Mark, I'm far from seasoned but I'll take a swing at it to check my understanding (or lack thereof). I'd break the task into 2 parts:
Identify and move files to a staging location, then process the zip files from the staging location. Flow1: Run a cron driven* GenerateFlowFile* Processor to start the process every 24 hours after 8AM -> *ExecuteStreamCommand* to run your bash script to stream the the 160 files of interest into -> *SplitText processor* to generate a new flow file for each zip filename. This can be routed into a *DistributeLoad processor* which can will distribute the flowfiles to *ExtractText processors *to extract the text out of the flowfile (extract contents: filename and path), then pass to *UpdateAttribute *of the flow file to be able to access the filename and path via Nifi expression language. Pass flow file to *ExecuteStreamProcess*(cp /${path_attribute}/${filename} /location2/${filename} ) * this will copy the zipfile to a another directory(location2), to keep files at the source for other users.* Flow2: * GetFile from location2* -> *Unpack Contents* ->* RouteOnAttribute* (to select CSV of interest, discard the rest) -> (*ExecuteStreamProcess *(sed '1d') to remove header -> *CompressContent* ->* PutHDFS* Hope this helps, and I hope this isn't too far off. Thanks, Lee On Sat, Oct 24, 2015 at 10:25 PM, Mark Petronic <markpetro...@gmail.com> wrote: > Reading some other posts, stumbled on this JIRA [1] which seems to > directly relate to my question in this post. > > [1] https://issues.apache.org/jira/browse/NIFI-631 > > On Sat, Oct 24, 2015 at 11:44 PM, Mark Petronic <markpetro...@gmail.com> > wrote: > > So, I stumbled onto Nifi at a Laurel, MD Spark meetup and was pretty > > excited about using it. I'm running HDP and need to construct an ETL > > like flow and would like to try to start, as a new user to Nifi, using > > a "best practice" approach. Wondering if some of you more seasoned > > users might provide some thoughts on my problem? > > > > 1. 160 zip files/day show up on an NFS share in various sub > > directories and their filenames contain the yyyymmddHHMMSS of when the > > stats where generated. > > 2. Each zip file contains 4 or more large CSV files > > 3. I need just one of those CSVs from each zip file each day and they > > all add up to about 10GB uncompressed > > 4. I need to extract that one file from each zip, strip off the first > > line (the headers), and store it in HDFS compressed again using gzip > > or snappy > > 5. I cannot delete the NFS file after the copy to HDFS because others > > need access to it for some time > > > > So, where I am having a hard time visualizing doing this in Nifi is > > with the first step. I need to scan the NFS files after 8 AM every day > > (when I know all files for the previous 24 hours will be present), > > find that set of files for that day using the yyymmdd part of file > > names, then perform the extract of the one file I need and process it > > into HDFS. > > > > I could imagine a processor that runs once every 24 hours on a cron > > schedule. I could imaging running an ExecuteProcess processor against > > a bash script to get the list of all the files that match the > > yyyymmdd. Then I get stuck. How to take this list of 160 file paths > > and start the job of processing each one of them in parallel to run > > the ETL flow? > > > > Thanks in advance for any ideas >