Hi, While the parquet file is immutable and the data sets are immutable, how does sparkSQL handle updates or deletes? I mean if I read in a file using SQL in to an RDD, mutate it, eg delete a row, and then persist it, I now have two files. If I reread the table back in … will I see duplicates or not?
The larger issue is how to handle mutable data in a multi-user / multi-tenant situation and using Parquet as the storage. Would this be the right tool? W.R.T ORC files, mutation is handled by Tez. Thanks in Advance, -Mike