If one uses HDFS as raw file storage where a single file intermingles data from 
all users, it's not easy to achieve what you are trying to do.

Instead, using systems (e.g. HBase, Hive) that support updates and deletes to 
individual records is the only way to go.

+Vinod

> On Apr 15, 2019, at 1:32 AM, Ivan Panico <iv.pan...@gmail.com> wrote:
> 
> Hi,
> 
> Recent GDPR introduced a new right for people : the right to be forgotten. 
> This right means that if an organization is asked by a customer to delete all 
> his data, the organization have to comply most of the time (there are 
> conditions which can suspend this right but that's besides my point).
> 
> Now HDFS being WORM (Write Once Read Multpliple Times), I guess you see where 
> I'm going. What would be the best way to implement this line deletion feature 
> (supposing that when a customer asks for a delete of all his data, the 
> organization would have to delete some lines in some HDFS files).
> 
> Right now I'm going for the following :
> Create a key-value base (user, [files])
> On file writing, feed this base with the users and file location (by 
> appending or updating a key).
> When the deletion is requested by the user "john", look in that base and 
> rewrite all the files of the "john" key (read the file in memmory, suppress 
> the lines of "john", rewrite the files)
> 
> Would this be the most hadoop way to do that ?
> I discarded some cryptoshredding like solution because the HDFS data has to 
> be readable by some mutliple proprietary softwares and by users at some point 
> and I'm not sur how to incorporate a decyphering step for all those uses 
> cases.
> Also, I came up with this table solution because a violent grep for some key 
> on the whole HDFS tree seemed unlikely to scale but maybe I'm mistaken ?
> 
> Thanks for your help,
> Best regards

Reply via email to