Hi, 
While the parquet file is immutable and the data sets are immutable, how does 
sparkSQL handle updates or deletes? 
I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, 
and then persist it, I now have two files. If I reread the table back in … will 
I see duplicates or not? 

The larger issue is how to handle mutable data in a multi-user / multi-tenant 
situation and using Parquet as the storage. 

Would this be the right tool? 

W.R.T ORC files, mutation is handled by Tez. 

Thanks in Advance, 

-Mike

Reply via email to