Re: Data Remanence in HDFS

Jim Halfpenny Fri, 12 Jan 2024 08:04:20 -0800

Hi Danny,
This does depend on a number of circumstances, mostly based on file 
permissions. If for example a file is deleted without the -skipTrash option 
then it will be moved to the .Trash directory. From here it could be read, but 
the original file permissions will be preserved. Therefore if a user did not 
have read access before it was deleted then it won’t be able to read it from 
.Trash and if they did have read access then this ought to remain the case.

If a file is deleted then the blocks are marked for deletion by the namenode 
and won’t be available through HDFS, but there will be some lag between the 
HDFS delete operation and the block files being removed from the datanodes. 
It’s possible that someone could read the block from the datanode file system 
directly, but not through the HDFS file system. The blocks will exist on disk 
until the datanode itself deletes them.

The way HDFS works you won’t get previous data when you create a new block 
since unallocated spaces doesn’t exist in the same way as it does on a regular 
file system. Each HDFS block maps to a file on the datanodes and block files 
can be an arbitrary size, unlike the fixed block/extent size of a regular file 
system. You don’t “reuse" HDFS blocks, a block in HDFS is just a file on the 
data node. You could potentially recover data from unallocated space on the 
datanode disk the same way you would for any other deleted file.

If you want to remove the chance of data recovery on HDFS then encrypting the 
blocks using HDFS transparent encryption is the way to do it. They encryption 
keys reside in the namenode metadata so once they are deleted the data in that 
file is effectively lost. Beware of snapshots though since a deleted file in 
the live HDFS view may exist in a previous snapshot.

Kind regards,
Jim

> On 11 Jan 2024, at 21:50, Daniel Howard <danny...@toldme.com> wrote:
> 
> Is it possible for a user with HDFS access to read the contents of a file 
> previously deleted by a different user?
> 
> I know a user can employ KMS to encrypt files with a personal key, making 
> this sort of data leakage effectively impossible. But, without KMS, is it 
> possible to allocate a file with uninitialized data, and then read the data 
> that exists on the underlying disk?
> 
> Thanks,
> -danny
> 
> --
> http://dannyman.toldme.com <http://dannyman.toldme.com/>

Re: Data Remanence in HDFS

Reply via email to