Re: Quick but probably silly question...

2017-01-17 Thread Jörn Franke
You run compaction, i.e. save the modified/deleted records in a dedicated file. Every now and then you compare the original and delta file and create a new version. When querying before compaction then you need to check in original and delta file. I don to think orc need tez for it , but it

Quick but probably silly question...

2017-01-17 Thread Michael Segel
Hi, While the parquet file is immutable and the data sets are immutable, how does sparkSQL handle updates or deletes? I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, and then persist it, I now have two files. If I reread the table back in … will I see duplicates