These have been invaluable insights Mark. Thank you very much for your help. -Jim
On Tue, Jan 10, 2017 at 2:13 PM, Mark Payne <marka...@hotmail.com> wrote: > Jim, > > Off the top of my head, I don't remember the reason for two dates, > specifically. I think it may have had to do > with ensuring that if we run at time X, we could potentially pick up a > file that also has a timestamp of X. Then, > we could potentially have 1+ files come in at time X also, after the > processor finished running. If we only looked > at the one timestamp, we could miss those 1+ files that came in later, but > during the same second or millisecond > or whatever precision your operating system provides for file modification > precision. Someone else on the list > may have more insight into the exact meaning of the two timestamps, as I > didn't come up with the algorithm. > > Yes, the ListFile processor will scan through the directory each time that > it runs to find any new files. Would recommend > that you not schedule ListFile to run with the default "0 sec" run > schedule but instead set it to something like "1 min" or > however often you can afford/need to. I believe that if it is scheduled to > run too frequently, it will actually yield itself, > which would cause it to 'pause' for 1 second (by default; this is > configured in the Settings for the Processor as well). > > The files that you mention there are simply the internals of the > Write-Ahead Log. When the WAL is updated, > it picks partition to write the update to (the partition directories) and > appends to whichever journal file it is > currently writing to. If we did this forever, those files would grow > indefinitely and aside from running out of disk > space, restarting NiFi would take ages. So periodically (by default, every > 2 minutes), the WAL is checkpointed. > > When this happens it creates the 'snapshot' file and writes to the file > the current state of the system and then > starts a new journal file for each partition. So there's a 'snapshot' file > that is a snapshot of the system state > and then the journal files that indicate a series of changes from the > snapshot to get back to the most recent > state. > > You may occasionally see some other files, such as multiple journal files, > snapshot.part files, etc. that are temporary > artifacts generated in order to provide better performance and ensure > reliability across system crashes/restarts. > > The wali.lock is simply there to ensure that we don't start NiFi twice and > have 2 different processes trying to write to > those files at the same time. > > Hope this helps! > > Thanks > -Mark > > > On Jan 10, 2017, at 10:01 AM, James McMahon <jsmcmah...@gmail.com> wrote: > > Thank you very much Mark. This is very helpful. Can I ask you just a few > quick follow-up questions in an effort to better understand? > > How does NiFi use those two dates? It seems that the timestamp of last > listing would be sufficient to permit NiFi to identify newly received > content. Why is it necessary to maintain the timestamp of the most recent > file it has sent out? > > How does NiFi quickly determine which files throughout the nested > directory structure were received after the last date it logged? Is it > scanning through listings of all the associated directories flagging for > processing those files with later dates? > > I looked more closely at my ./state/local directory and subdirectories. > Can you offer a few words about the purpose of each of the following? > * file snapshot > * file wali.lock > * the partition[0-15] subdirectories, each of which appears to own a > journal file > * the journal file > Where are the dates you referenced? > > Thank you again for your insights. > > On Tue, Jan 10, 2017 at 8:51 AM, Mark Payne <marka...@hotmail.com> wrote: > >> Hi Jim, >> >> ListFile does not maintain a list of files w/ datetime stamps. Instead, >> it store just two timestamps: >> the timestamp of when a listing was last performed, and the timestamp of >> the newest file that it has >> sent out. This is done precisely because we need it to be able to scale >> as the input becomes large. >> >> The location of where this information is stored depends on a couple of >> things. ListFile has a property named >> "Input Directory Location." If that is set to "Remote" and the NiFi >> instance is clustered, then this information is >> stored in ZooKeeper. This allows the Processor to run on Primary Node >> only and if a new node is elected Primary, >> then it is able to pick up where the previous Primary Node left off. >> >> if the Input Directory Location is set to "Local" (or if NiFi is not >> clustered) then the state will be stored to the Local >> State manager, which is backed by a write-ahead log. By default it is >> written to ./state/local but this can be configured >> in the conf/state-management.xml. So if you want to be really sure that >> you don't lose the information, you could >> potentially change the location to some place that has a RAID >> configuration for redundancy. >> >> Thanks >> -Mark >> >> >> > On Jan 10, 2017, at 8:38 AM, James McMahon <jsmcmah...@gmail.com> >> wrote: >> > >> > I am using ListFile followed by FetchFile to recurse and detect new >> files that show up in a large nested directory structure that grows over >> time. I need to better understand how this approach scales. What are the >> practical and the performance limitations to using this tandem of >> processors for feeding new files to NiFi? If anyone has used this approach >> in a large-scale data environment to manage new content to NiFi, I would >> welcome your thoughts. >> > >> > Where does ListFile maintain its list of files with datetime stamps? >> Does this get persisted as a hash map in memory? Is it also persisted into >> one of the NiFi repositories as a backup? My concern is avoiding having to >> reprocess the entire directory structure should that list ever get lost or >> destroyed. >> > >> > Thank you in advance once again for your assistance. -Jim >> >> > >