Thank you very much Mark. This is very helpful. Can I ask you just a few
quick follow-up questions in an effort to better understand?

How does NiFi use those two dates? It seems that the timestamp of last
listing would be sufficient to permit NiFi to identify newly received
content. Why is it necessary to maintain the timestamp of the most recent
file it has sent out?

How does NiFi quickly determine which files throughout the nested directory
structure were received after the last date it logged? Is it scanning
through listings of all the associated directories flagging for processing
those files with later dates?

I looked more closely at my ./state/local directory and subdirectories. Can
you offer a few words about the purpose of each of the following?
* file snapshot
* file wali.lock
* the partition[0-15] subdirectories, each of which appears to own a
journal file
* the journal file
Where are the dates you referenced?

Thank you again for your insights.

On Tue, Jan 10, 2017 at 8:51 AM, Mark Payne <marka...@hotmail.com> wrote:

> Hi Jim,
>
> ListFile does not maintain a list of files w/ datetime stamps. Instead, it
> store just two timestamps:
> the timestamp of when a listing was last performed, and the timestamp of
> the newest file that it has
> sent out. This is done precisely because we need it to be able to scale as
> the input becomes large.
>
> The location of where this information is stored depends on a couple of
> things. ListFile has a property named
> "Input Directory Location." If that is set to "Remote" and the NiFi
> instance is clustered, then this information is
> stored in ZooKeeper. This allows the Processor to run on Primary Node only
> and if a new node is elected Primary,
> then it is able to pick up where the previous Primary Node left off.
>
> if the Input Directory Location is set to "Local" (or if NiFi is not
> clustered) then the state will be stored to the Local
> State manager, which is backed by a write-ahead log. By default it is
> written to ./state/local but this can be configured
> in the conf/state-management.xml. So if you want to be really sure that
> you don't lose the information, you could
> potentially change the location to some place that has a RAID
> configuration for redundancy.
>
> Thanks
> -Mark
>
>
> > On Jan 10, 2017, at 8:38 AM, James McMahon <jsmcmah...@gmail.com> wrote:
> >
> > I am using ListFile followed by FetchFile to recurse and detect new
> files that show up in a large nested directory structure that grows over
> time. I need to better understand how this approach scales. What are the
> practical and the performance limitations to using this tandem of
> processors for feeding new files to NiFi? If anyone has used this approach
> in a large-scale data environment to manage new content to NiFi, I would
> welcome your thoughts.
> >
> > Where does ListFile maintain its list of files with datetime stamps?
> Does this get persisted as a hash map in memory? Is it also persisted into
> one of the NiFi repositories as a backup? My concern is avoiding having to
> reprocess the entire directory structure should that list ever get lost or
> destroyed.
> >
> > Thank you in advance once again for your assistance. -Jim
>
>

Reply via email to