Hi All,

I have a requirement whereby I have to extend the functionality of HDFS to
filter out sensitive information ( SSN, Bank Account ) during data read.
The solution has to be done at the API definition layer ( API of
FSDataInputStream) such that it works with all our existing ETL programs.
I looked into FSDataInputStream or (
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html)
 and DistributedFileSystem (
https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#open(org.apache.hadoop.fs.Path)
.
One possibility is to read the stream from FSDataInputStream and sanitize
the stream by removing the SSN and then create an new FSDataInputStream and
provide this new FSDataInputStream back to the client. Could someone
provide me input as to any better way to achieve the same. I am hoping to
build an extension to HDFS api ( FSDataInputStream) .

Any pointers is appreciated.

thanks
Rahul

Reply via email to