Thank Jim, The scenario I have in mind is something like: 1) Ask HDFS to create a file that is 32k in length. 2) Attempt to read the contents of the file.
Can I even attempt to read the contents of a file that has not yet been written? If so, what data would get sent? For example, I asked a version of this question of ganeti with regard to creating VMs. You can, by default, read the previous contents of the disk in your new VM, but they have an option to wipe newly allocated VM disks for added security.[1] [1]: https://groups.google.com/g/ganeti/c/-c_KoLd6mnI Thanks, -danny On Fri, Jan 12, 2024 at 8:03 AM Jim Halfpenny <[email protected]> wrote: > Hi Danny, > This does depend on a number of circumstances, mostly based on file > permissions. If for example a file is deleted without the -skipTrash option > then it will be moved to the .Trash directory. From here it could be read, > but the original file permissions will be preserved. Therefore if a user > did not have read access before it was deleted then it won’t be able to > read it from .Trash and if they did have read access then this ought to > remain the case. > > If a file is deleted then the blocks are marked for deletion by the > namenode and won’t be available through HDFS, but there will be some lag > between the HDFS delete operation and the block files being removed from > the datanodes. It’s possible that someone could read the block from the > datanode file system directly, but not through the HDFS file system. The > blocks will exist on disk until the datanode itself deletes them. > > The way HDFS works you won’t get previous data when you create a new block > since unallocated spaces doesn’t exist in the same way as it does on a > regular file system. Each HDFS block maps to a file on the datanodes and > block files can be an arbitrary size, unlike the fixed block/extent size of > a regular file system. You don’t “reuse" HDFS blocks, a block in HDFS is > just a file on the data node. You could potentially recover data from > unallocated space on the datanode disk the same way you would for any other > deleted file. > > If you want to remove the chance of data recovery on HDFS then encrypting > the blocks using HDFS transparent encryption is the way to do it. They > encryption keys reside in the namenode metadata so once they are deleted > the data in that file is effectively lost. Beware of snapshots though since > a deleted file in the live HDFS view may exist in a previous snapshot. > > Kind regards, > Jim > > > On 11 Jan 2024, at 21:50, Daniel Howard <[email protected]> wrote: > > Is it possible for a user with HDFS access to read the contents of a file > previously deleted by a different user? > > I know a user can employ KMS to encrypt files with a personal key, making > this sort of data leakage effectively impossible. But, without KMS, is it > possible to allocate a file with uninitialized data, and then read the data > that exists on the underlying disk? > > Thanks, > -danny > > -- > http://dannyman.toldme.com > > > -- http://dannyman.toldme.com
