So, I stumbled onto Nifi at a Laurel, MD Spark meetup and was pretty excited about using it. I'm running HDP and need to construct an ETL like flow and would like to try to start, as a new user to Nifi, using a "best practice" approach. Wondering if some of you more seasoned users might provide some thoughts on my problem?
1. 160 zip files/day show up on an NFS share in various sub directories and their filenames contain the yyyymmddHHMMSS of when the stats where generated. 2. Each zip file contains 4 or more large CSV files 3. I need just one of those CSVs from each zip file each day and they all add up to about 10GB uncompressed 4. I need to extract that one file from each zip, strip off the first line (the headers), and store it in HDFS compressed again using gzip or snappy 5. I cannot delete the NFS file after the copy to HDFS because others need access to it for some time So, where I am having a hard time visualizing doing this in Nifi is with the first step. I need to scan the NFS files after 8 AM every day (when I know all files for the previous 24 hours will be present), find that set of files for that day using the yyymmdd part of file names, then perform the extract of the one file I need and process it into HDFS. I could imagine a processor that runs once every 24 hours on a cron schedule. I could imaging running an ExecuteProcess processor against a bash script to get the list of all the files that match the yyyymmdd. Then I get stuck. How to take this list of 160 file paths and start the job of processing each one of them in parallel to run the ETL flow? Thanks in advance for any ideas