[ 
https://issues.apache.org/jira/browse/NIFI-12595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

endzeit reassigned NIFI-12595:
------------------------------

    Assignee: endzeit

> Introduce an "Entity Tracking Mode" and "Track Last Listing Time" to 
> ListedEntityTracker
> ----------------------------------------------------------------------------------------
>
>                 Key: NIFI-12595
>                 URL: https://issues.apache.org/jira/browse/NIFI-12595
>             Project: Apache NiFi
>          Issue Type: New Feature
>    Affects Versions: 1.24.0
>            Reporter: endzeit
>            Assignee: endzeit
>            Priority: Major
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> h1. Situation
> The existing {{ListX}} processors support different "Listing Strategies". One 
> commonly used  "Listing Strategy" is "Tracking Entities" whereby crucial 
> information of all recently listed entities, e.g. files, is remembered.
> On every listing, in case information to an entity has been remembered 
> before, the entity is not listed again (unless it was modified).
> This has several benefits over other available "Listing Strategies". For 
> example, unlike with "No Tracking" the same entity is not listed repeatedly.  
> Other than "Tracking Timestamps", entities with an older timestamp than ones 
> previously listed can be picked up. 
> However, the strategy comes with its own problems.
> h1. Problem
> Due to the ever given constraints to available memory and performance, 
> entities cannot be tracked indefinitely.
> That's why the {{ListedEntityTracker}}, used for implementing "Tracking 
> Entities" by most processors, introduces the notion of an "Entity Tracking 
> Time Window".
> All remembered entities that are out of the time window (they are older than 
> the current time minus the time window) are removed from the tracking cache, 
> to limit memory use. Additionally, not yet listed entities that are out of 
> the time window are exempt from listing, as they would be removed from the 
> "cache" on the next run immediately, resulting in them being listed over and 
> over. 
> However, this results in entities "older" than the specified "Entity Tracking 
> Time Window" not being picked up. For example, given entities are listed from 
> a remote server and this server is not available for some time. Once the 
> server is available again, the listing continues. However, all entities / 
> files that were created before the defined time window, will be silently 
> ignored.
> As of now, this can be solved by manual intervention, re-starting the ListX 
> processor. The 
> "Entity Tracking Time Window" can be ignored upon initial listing, when the 
> "Entity Tracking Initial Listing Target" is set to "All Available" (default).
> However, this requires the NiFi user to be aware of lingering old entities 
> being available on the connected remote source. Additionally, the need for 
> manual intervention might be undesired / impractical when having a plentiful 
> of sources connected.
> Additionally, the "Entity Tracking Time Window" can be increased to account 
> for longer time frames. However, this only betters the situation somewhat and 
> does not solve the problem. Also there is a limit to this, as it increases 
> the memory needed.
> h1. Proposal
> This issue proposes introducing the notion of a "Entity Tracking Mode", 
> whereby the current behavior could be understand as "Track Entity Timestamp".
> An new mode of "Track Last Listing Time" is added. Other than the existing 
> "Track Entity Timestamp" mode, this would not impose any prerequisites on the 
> entities regarding they timestamp (see {{minTimestampToList}}). Instead, all 
> entities would be considered. 
> However, this strategy needs a way to limit / clean the entity cache as well. 
> Instead of measuring the time window by the timestamp of the entity, the mode 
> should remember the last time the entity was tracked; that is, part of a call 
> to "listEntities" in "trackEntities". That is, every time an entity is 
> listed, its cache entry is renewed. After every listing, only the cache 
> entries that have been updated in the time window will be kept. All other, 
> entities that have not been listed for a longer time, are removed from cache.
> In case users want to limit a processor to only list entities up to a certain 
> age, most processors have support for this with a separate property already, 
> e.g. "Maximum File Age" in ListSFTP. 
> While this mode solves the problem of listing "old" entities it comes with 
> its own downsides. Due to lifting the restriction on {{minTimestampToList}}, 
> more entities can be listed, potentially leading to long listing times. 
> Additionally, similar to the existing "Track Entity Timestamp" there is no 
> enforced upper limit on how many cache entries are possible. See NIFI-12609 
> for a proposal that may address both problems.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to