Parquet files don't support mutation. If you want to remove records, you have to rewrite the file and filter out the records you don't want. I think this is probably better for compliance because delete markers don't really delete the data or make the data inaccessible.
I'm not sure what else the format could provide here. Maybe making it easier to delete a column would be sufficient? If you had data that wasn't tied to a person except for some ID column, you could anonymize by removing that column without re-encoding the rest of the data (though this would require rewriting the file). That wouldn't be too difficult to do, but unfortunately requires planning ahead to know what columns you can delete to reach compliance. Another idea here is to replace the ID column with hash(ID) so you'd still have relationships, but no information to tie rows to individuals. rb On Tue, Nov 7, 2017 at 1:11 AM, Machiel Groeneveld <machi...@gmail.com> wrote: > Hi, > > The upcoming cross EU law GDPR requires companies to remove data collected > from consumers as requested. I'm exploring the options concerning our > Parquet tables. > > I don't see any support for mutating parquet files, if it's not there is it > possible to add that? > > I wonder if anyone has any knowledge of how a deletion could be processed > in the parquet world. Of course there is the option to sift through > billions of records and recreate all our tables for each deletion request > but I'm hoping for a more efficient method. Perhaps a delete flag could be > added to the format or is there a way to zero out existing data? > > At some point all companies storing data of EU citizens will need to have > an answer to this. Simply locking the data behind more restrictions is not > an option, data should be erased. Companies are already looking into ways > to delete data from tape backups, the law is that far reaching. > -- Ryan Blue Software Engineer Netflix