Re: Data Remanence in HDFS

2024-01-13 Thread Jim Halfpenny
Hi Daniel,
In short you can’t create a HDFS block with unallocated data. You can create a 
zero length block, which will result in a zero byte file being created on the 
data node, but you can’t create a sparse file in HDFS. While HDFS has a block 
size e.g. 128MB if you create a small file then the file on the data node will 
be of a size directly proportional to the data and not the block length; 
creating a 32kB HDFS file will in turn create a single 32kB file on the 
datanodes. The way HDFS is built is not like a traditional file system with 
fixed size blocks/extents in fixed disk locations.

Kind regards,
Jim

> On 12 Jan 2024, at 18:35, Daniel Howard  wrote:
> 
> Thank Jim,
> 
> The scenario I have in mind is something like:
> 1) Ask HDFS to create a file that is 32k in length.
> 2) Attempt to read the contents of the file.
> 
> Can I even attempt to read the contents of a file that has not yet been 
> written? If so, what data would get sent?
> 
> For example, I asked a version of this question of ganeti with regard to 
> creating VMs. You can, by default, read the previous contents of the disk in 
> your new VM, but they have an option to wipe newly allocated VM disks for 
> added security.[1]
> 
> [1]: https://groups.google.com/g/ganeti/c/-c_KoLd6mnI
> 
> Thanks,
> -danny
> 
> On Fri, Jan 12, 2024 at 8:03 AM Jim Halfpenny  
> wrote:
>> Hi Danny,
>> This does depend on a number of circumstances, mostly based on file 
>> permissions. If for example a file is deleted without the -skipTrash option 
>> then it will be moved to the .Trash directory. From here it could be read, 
>> but the original file permissions will be preserved. Therefore if a user did 
>> not have read access before it was deleted then it won’t be able to read it 
>> from .Trash and if they did have read access then this ought to remain the 
>> case.
>> 
>> If a file is deleted then the blocks are marked for deletion by the namenode 
>> and won’t be available through HDFS, but there will be some lag between the 
>> HDFS delete operation and the block files being removed from the datanodes. 
>> It’s possible that someone could read the block from the datanode file 
>> system directly, but not through the HDFS file system. The blocks will exist 
>> on disk until the datanode itself deletes them.
>> 
>> The way HDFS works you won’t get previous data when you create a new block 
>> since unallocated spaces doesn’t exist in the same way as it does on a 
>> regular file system. Each HDFS block maps to a file on the datanodes and 
>> block files can be an arbitrary size, unlike the fixed block/extent size of 
>> a regular file system. You don’t “reuse" HDFS blocks, a block in HDFS is 
>> just a file on the data node. You could potentially recover data from 
>> unallocated space on the datanode disk the same way you would for any other 
>> deleted file.
>> 
>> If you want to remove the chance of data recovery on HDFS then encrypting 
>> the blocks using HDFS transparent encryption is the way to do it. They 
>> encryption keys reside in the namenode metadata so once they are deleted the 
>> data in that file is effectively lost. Beware of snapshots though since a 
>> deleted file in the live HDFS view may exist in a previous snapshot.
>> 
>> Kind regards,
>> Jim
>> 
>> 
>>> On 11 Jan 2024, at 21:50, Daniel Howard >> > wrote:
>>> 
>>> Is it possible for a user with HDFS access to read the contents of a file 
>>> previously deleted by a different user?
>>> 
>>> I know a user can employ KMS to encrypt files with a personal key, making 
>>> this sort of data leakage effectively impossible. But, without KMS, is it 
>>> possible to allocate a file with uninitialized data, and then read the data 
>>> that exists on the underlying disk?
>>> 
>>> Thanks,
>>> -danny
>>> 
>>> --
>>> http://dannyman.toldme.com 
> 
> 
> --
> http://dannyman.toldme.com 


Re: Data Remanence in HDFS

2024-01-12 Thread Daniel Howard
Thank Jim,

The scenario I have in mind is something like:
1) Ask HDFS to create a file that is 32k in length.
2) Attempt to read the contents of the file.

Can I even attempt to read the contents of a file that has not yet been
written? If so, what data would get sent?

For example, I asked a version of this question of ganeti with regard to
creating VMs. You can, by default, read the previous contents of the disk
in your new VM, but they have an option to wipe newly allocated VM disks
for added security.[1]

[1]: https://groups.google.com/g/ganeti/c/-c_KoLd6mnI

Thanks,
-danny

On Fri, Jan 12, 2024 at 8:03 AM Jim Halfpenny 
wrote:

> Hi Danny,
> This does depend on a number of circumstances, mostly based on file
> permissions. If for example a file is deleted without the -skipTrash option
> then it will be moved to the .Trash directory. From here it could be read,
> but the original file permissions will be preserved. Therefore if a user
> did not have read access before it was deleted then it won’t be able to
> read it from .Trash and if they did have read access then this ought to
> remain the case.
>
> If a file is deleted then the blocks are marked for deletion by the
> namenode and won’t be available through HDFS, but there will be some lag
> between the HDFS delete operation and the block files being removed from
> the datanodes. It’s possible that someone could read the block from the
> datanode file system directly, but not through the HDFS file system. The
> blocks will exist on disk until the datanode itself deletes them.
>
> The way HDFS works you won’t get previous data when you create a new block
> since unallocated spaces doesn’t exist in the same way as it does on a
> regular file system. Each HDFS block maps to a file on the datanodes and
> block files can be an arbitrary size, unlike the fixed block/extent size of
> a regular file system. You don’t “reuse" HDFS blocks, a block in HDFS is
> just a file on the data node. You could potentially recover data from
> unallocated space on the datanode disk the same way you would for any other
> deleted file.
>
> If you want to remove the chance of data recovery on HDFS then encrypting
> the blocks using HDFS transparent encryption is the way to do it. They
> encryption keys reside in the namenode metadata so once they are deleted
> the data in that file is effectively lost. Beware of snapshots though since
> a deleted file in the live HDFS view may exist in a previous snapshot.
>
> Kind regards,
> Jim
>
>
> On 11 Jan 2024, at 21:50, Daniel Howard  wrote:
>
> Is it possible for a user with HDFS access to read the contents of a file
> previously deleted by a different user?
>
> I know a user can employ KMS to encrypt files with a personal key, making
> this sort of data leakage effectively impossible. But, without KMS, is it
> possible to allocate a file with uninitialized data, and then read the data
> that exists on the underlying disk?
>
> Thanks,
> -danny
>
> --
> http://dannyman.toldme.com
>
>
>

-- 
http://dannyman.toldme.com


Re: Data Remanence in HDFS

2024-01-12 Thread Jim Halfpenny
Hi Danny,
This does depend on a number of circumstances, mostly based on file 
permissions. If for example a file is deleted without the -skipTrash option 
then it will be moved to the .Trash directory. From here it could be read, but 
the original file permissions will be preserved. Therefore if a user did not 
have read access before it was deleted then it won’t be able to read it from 
.Trash and if they did have read access then this ought to remain the case.

If a file is deleted then the blocks are marked for deletion by the namenode 
and won’t be available through HDFS, but there will be some lag between the 
HDFS delete operation and the block files being removed from the datanodes. 
It’s possible that someone could read the block from the datanode file system 
directly, but not through the HDFS file system. The blocks will exist on disk 
until the datanode itself deletes them.

The way HDFS works you won’t get previous data when you create a new block 
since unallocated spaces doesn’t exist in the same way as it does on a regular 
file system. Each HDFS block maps to a file on the datanodes and block files 
can be an arbitrary size, unlike the fixed block/extent size of a regular file 
system. You don’t “reuse" HDFS blocks, a block in HDFS is just a file on the 
data node. You could potentially recover data from unallocated space on the 
datanode disk the same way you would for any other deleted file.

If you want to remove the chance of data recovery on HDFS then encrypting the 
blocks using HDFS transparent encryption is the way to do it. They encryption 
keys reside in the namenode metadata so once they are deleted the data in that 
file is effectively lost. Beware of snapshots though since a deleted file in 
the live HDFS view may exist in a previous snapshot.

Kind regards,
Jim


> On 11 Jan 2024, at 21:50, Daniel Howard  wrote:
> 
> Is it possible for a user with HDFS access to read the contents of a file 
> previously deleted by a different user?
> 
> I know a user can employ KMS to encrypt files with a personal key, making 
> this sort of data leakage effectively impossible. But, without KMS, is it 
> possible to allocate a file with uninitialized data, and then read the data 
> that exists on the underlying disk?
> 
> Thanks,
> -danny
> 
> --
> http://dannyman.toldme.com