Incremental query on partition column

2020-08-13 Thread Sivaprakash
Hi, What is the design that can be used/implemented when we re-ingest the data without affecting incremental query? - Is it possible to maintain a delta dataset across partitions ( hoodie.datasource.write.partitionpath.field) ? In my case it is a date. - Can I do a snapshot query on a

GDPR - Time Travel Query

2020-07-30 Thread Sivaprakash
Hello What I see is; If I we want to implement GDPR ( https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-HowdoIdeleterecordsinthedatasetusingHudi) then old version of commit files should be removed (otherwise incremental query with point-time options can still read the data which is deleted

Re: Hard Delete

2020-07-17 Thread Sivaprakash
2020 at 5:07 PM Balaji Varadarajan wrote: > Hi Sivaprakash, > You can configure cleaner to clean the older file versions which contain > those records to be deleted. You can take a look at > https://cwiki.apache.org/confluence/display/HUDI/FAQ#FAQ-WhatdoestheHudicleanerdo >

Hard Delete

2020-07-17 Thread Sivaprakash
Hello Do we have any option to delete a record from every partition? Which mean I want to completely wipe out particular record from complete data set (first commit, all the changes, delta commit etc) Currently, when I delete it affects only the last commit but if I do an incremental query on th

Re: Handling delta

2020-07-16 Thread Sivaprakash
Great !! Got it working !! 'hoodie.datasource.write.recordkey.field': 'COL1,COL2', 'hoodie.datasource.write.keygenerator.class': 'org.apache.hudi.keygen.ComplexKeyGenerator', Thank you. On Thu, Jul 16, 2020 at 7:10 PM Adam Feldman wrote: > Hi Si

Re: Handling delta

2020-07-16 Thread Sivaprakash
dot notation eg: a.b.c However I couldn't provide more than one column like this... COL1.COL2 'hoodie.datasource.write.recordkey.field: 'COL1.COL2' Anything wrong with the syntax? (tried with comma as well) On Thu, Jul 16, 2020 at 6:41 PM Sivaprakash wrote: > Hello B

Re: Handling delta

2020-07-16 Thread Sivaprakash
ates 3 records again? I thought it would create only the 2nd record - Trying to understand the storage volume efficiency here - Some configuration has to be enabled to fix this? configuration that I use - COPY_ON_WRITE, Append, Upsert - First Column (NR001) is configured as *hoodie

Re: Handling delta

2020-07-16 Thread Sivaprakash
should be only 50 records). On Thu, Jul 16, 2020 at 4:01 PM Allen Underwood wrote: > Hi Sivaprakash, > > So I'm by no means an expert on this, but I think you might find what > you're looking for here: > https://hudi.apache.org/docs/concepts.html > > I'm no

Handling delta

2020-07-16 Thread Sivaprakash
This might be a basic question - I'm experimenting with Hudi (Pyspark). I have used Insert/Upsert options to write delta into my data lake. However, one is not clear to me Step 1:- I write 50 records Step 2:- Im writing 50 records out of which only *10 records have been changed* (I'm using upsert