Hi, I am new in Nifi and exploring it as open source ETL tool.
As per my understanding, flow files are stored on local disk and it contains actual data. If above is true, lets consider a below scenario: Scenario 1: - In a spool directory we have terabytes(5-6TB/day) of files coming from external sources - I want to push those files to HDFS as it is without any changes Scenario 2: - In a spool directory we have terabytes(5-6TB/day) of files coming from external sources - I want to mask some of the sensitive columns - Then send one copy to HDFS and another copy to Kafka Question for Scenario 1: 1.a In that case those 5-6TB data will be again written on local disk as flow files and will cause double I/O. Which eventually may cause slower performance due to I/O bottleneck. Is there any way to by pass writing flow files on disk or directly pass those files to HDFS as it is ? 1.b If the files on the spool directory are compressed(zip/gzip), can we store files on HDFS as uncompressed ? Question for Scenario 2: 2.a Can we use our existing java code for masking ? if yes then how ? 2.b For this Scenario we also want to bypass storing flow files on disk. Can we do it on the fly, masking and storing on HDFS ? 2.c If the source files are compressed (zip/gzip), is there any issue for masking here ? In fact, I tried above using flume+flume interceptors. Everything working fine with smaller files. But when source files greater that 50MB flume chocks :(. So, I am exploring options in NiFi. Hope I will get some guideline from you guys. Thanks in advance. -Obaid