Hi guys,

I have following situation:

There is SMB mounted folder on one Nifi worker and it has many subfolders with 
subfolders (the depth of nesting is not known in advance).
If new files in that directory tree appears or file is moved or old file is 
copied/moved, modification timestamp changes. It is achieved
with some other tools, configs etc.

What is the best way to list new/updated/new-old files with NiFi if we take 
into account there are enormous number of subfolders and files in them (let's 
say milions).

In case of ListFile and using 'Tracking Entities' I am concerned about the 
following:


  *   How big is the I/O if ListFile constantly checks the directory tree and 
all files?
It can be CRON based not to do it all the time, but if it is not, how ListFile 
is doing that in the background?
  *   How big is the cahce of listed entities, what is stored in fact in the 
cache, just metadata or?
  *   What if cahce is not persisted and NiFi restart occurs? Will cache be 
incosistent?
In case it is persisted, what if restart occurs in the moment when ListFile is 
checking new/old entities and their size, name etc?
What is the interval of persisting the cache, is it related to snapshots NiFi 
takes in configured intervals?

What is the best and the most efficient way to do this? Maybe some extra tools 
or engines to use for finding difference and to
persist last known state, like elastic, some DB maybe?
Or to construct list of paths which need to be fecthed using some python 
sctipts?

The constarint here is shared(mounted) folders and even if modification date is 
changed for every new/updated file,
how to efficiently monitor big directory tree or how to efficiently trigger 
ListFile (NiFi flow) to fetch new/old-new files?

In case of 'Tracking entities', maybe having separated standalone NiFi instance 
on separated server with configured CacheServer
to serve as cache is not bad idea?

Thanks in advance,

Tom



Reply via email to