These have been invaluable insights Mark. Thank you very much for your
help. -Jim

On Tue, Jan 10, 2017 at 2:13 PM, Mark Payne <marka...@hotmail.com> wrote:

> Jim,
>
> Off the top of my head, I don't remember the reason for two dates,
> specifically. I think it may have had to do
> with ensuring that if we run at time X, we could potentially pick up a
> file that also has a timestamp of X. Then,
> we could potentially have 1+ files come in at time X also, after the
> processor finished running. If we only looked
> at the one timestamp, we could miss those 1+ files that came in later, but
> during the same second or millisecond
> or whatever precision your operating system provides for file modification
> precision. Someone else on the list
> may have more insight into the exact meaning of the two timestamps, as I
> didn't come up with the algorithm.
>
> Yes, the ListFile processor will scan through the directory each time that
> it runs to find any new files. Would recommend
> that you not schedule ListFile to run with the default "0 sec" run
> schedule but instead set it to something like "1 min" or
> however often you can afford/need to. I believe that if it is scheduled to
> run too frequently, it will actually yield itself,
> which would cause it to 'pause' for 1 second (by default; this is
> configured in the Settings for the Processor as well).
>
> The files that you mention there are simply the internals of the
> Write-Ahead Log. When the WAL is updated,
> it picks partition to write the update to (the partition directories) and
> appends to whichever journal file it is
> currently writing to. If we did this forever, those files would grow
> indefinitely and aside from running out of disk
> space, restarting NiFi would take ages. So periodically (by default, every
> 2 minutes), the WAL is checkpointed.
>
> When this happens it creates the 'snapshot' file and writes to the file
> the current state of the system and then
> starts a new journal file for each partition. So there's a 'snapshot' file
> that is a snapshot of the system state
> and then the journal files that indicate a series of changes from the
> snapshot to get back to the most recent
> state.
>
> You may occasionally see some other files, such as multiple journal files,
> snapshot.part files, etc. that are temporary
> artifacts generated in order to provide better performance and ensure
> reliability across system crashes/restarts.
>
> The wali.lock is simply there to ensure that we don't start NiFi twice and
> have 2 different processes trying to write to
> those files at the same time.
>
> Hope this helps!
>
> Thanks
> -Mark
>
>
> On Jan 10, 2017, at 10:01 AM, James McMahon <jsmcmah...@gmail.com> wrote:
>
> Thank you very much Mark. This is very helpful. Can I ask you just a few
> quick follow-up questions in an effort to better understand?
>
> How does NiFi use those two dates? It seems that the timestamp of last
> listing would be sufficient to permit NiFi to identify newly received
> content. Why is it necessary to maintain the timestamp of the most recent
> file it has sent out?
>
> How does NiFi quickly determine which files throughout the nested
> directory structure were received after the last date it logged? Is it
> scanning through listings of all the associated directories flagging for
> processing those files with later dates?
>
> I looked more closely at my ./state/local directory and subdirectories.
> Can you offer a few words about the purpose of each of the following?
> * file snapshot
> * file wali.lock
> * the partition[0-15] subdirectories, each of which appears to own a
> journal file
> * the journal file
> Where are the dates you referenced?
>
> Thank you again for your insights.
>
> On Tue, Jan 10, 2017 at 8:51 AM, Mark Payne <marka...@hotmail.com> wrote:
>
>> Hi Jim,
>>
>> ListFile does not maintain a list of files w/ datetime stamps. Instead,
>> it store just two timestamps:
>> the timestamp of when a listing was last performed, and the timestamp of
>> the newest file that it has
>> sent out. This is done precisely because we need it to be able to scale
>> as the input becomes large.
>>
>> The location of where this information is stored depends on a couple of
>> things. ListFile has a property named
>> "Input Directory Location." If that is set to "Remote" and the NiFi
>> instance is clustered, then this information is
>> stored in ZooKeeper. This allows the Processor to run on Primary Node
>> only and if a new node is elected Primary,
>> then it is able to pick up where the previous Primary Node left off.
>>
>> if the Input Directory Location is set to "Local" (or if NiFi is not
>> clustered) then the state will be stored to the Local
>> State manager, which is backed by a write-ahead log. By default it is
>> written to ./state/local but this can be configured
>> in the conf/state-management.xml. So if you want to be really sure that
>> you don't lose the information, you could
>> potentially change the location to some place that has a RAID
>> configuration for redundancy.
>>
>> Thanks
>> -Mark
>>
>>
>> > On Jan 10, 2017, at 8:38 AM, James McMahon <jsmcmah...@gmail.com>
>> wrote:
>> >
>> > I am using ListFile followed by FetchFile to recurse and detect new
>> files that show up in a large nested directory structure that grows over
>> time. I need to better understand how this approach scales. What are the
>> practical and the performance limitations to using this tandem of
>> processors for feeding new files to NiFi? If anyone has used this approach
>> in a large-scale data environment to manage new content to NiFi, I would
>> welcome your thoughts.
>> >
>> > Where does ListFile maintain its list of files with datetime stamps?
>> Does this get persisted as a hash map in memory? Is it also persisted into
>> one of the NiFi repositories as a backup? My concern is avoiding having to
>> reprocess the entire directory structure should that list ever get lost or
>> destroyed.
>> >
>> > Thank you in advance once again for your assistance. -Jim
>>
>>
>
>

Reply via email to