Hi All, I have a requirement whereby I have to extend the functionality of HDFS to filter out sensitive information ( SSN, Bank Account ) during data read. The solution has to be done at the API definition layer ( API of FSDataInputStream) such that it works with all our existing ETL programs. I looked into FSDataInputStream or ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FSDataInputStream.html) and DistributedFileSystem ( https://hadoop.apache.org/docs/current/api/org/apache/hadoop/fs/FileSystem.html#open(org.apache.hadoop.fs.Path) . One possibility is to read the stream from FSDataInputStream and sanitize the stream by removing the SSN and then create an new FSDataInputStream and provide this new FSDataInputStream back to the client. Could someone provide me input as to any better way to achieve the same. I am hoping to build an extension to HDFS api ( FSDataInputStream) .
Any pointers is appreciated. thanks Rahul