[ https://issues.apache.org/jira/browse/NIFI-12595?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
endzeit reassigned NIFI-12595: ------------------------------ Assignee: endzeit > Introduce an "Entity Tracking Mode" and "Track Last Listing Time" to > ListedEntityTracker > ---------------------------------------------------------------------------------------- > > Key: NIFI-12595 > URL: https://issues.apache.org/jira/browse/NIFI-12595 > Project: Apache NiFi > Issue Type: New Feature > Affects Versions: 1.24.0 > Reporter: endzeit > Assignee: endzeit > Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > h1. Situation > The existing {{ListX}} processors support different "Listing Strategies". One > commonly used "Listing Strategy" is "Tracking Entities" whereby crucial > information of all recently listed entities, e.g. files, is remembered. > On every listing, in case information to an entity has been remembered > before, the entity is not listed again (unless it was modified). > This has several benefits over other available "Listing Strategies". For > example, unlike with "No Tracking" the same entity is not listed repeatedly. > Other than "Tracking Timestamps", entities with an older timestamp than ones > previously listed can be picked up. > However, the strategy comes with its own problems. > h1. Problem > Due to the ever given constraints to available memory and performance, > entities cannot be tracked indefinitely. > That's why the {{ListedEntityTracker}}, used for implementing "Tracking > Entities" by most processors, introduces the notion of an "Entity Tracking > Time Window". > All remembered entities that are out of the time window (they are older than > the current time minus the time window) are removed from the tracking cache, > to limit memory use. Additionally, not yet listed entities that are out of > the time window are exempt from listing, as they would be removed from the > "cache" on the next run immediately, resulting in them being listed over and > over. > However, this results in entities "older" than the specified "Entity Tracking > Time Window" not being picked up. For example, given entities are listed from > a remote server and this server is not available for some time. Once the > server is available again, the listing continues. However, all entities / > files that were created before the defined time window, will be silently > ignored. > As of now, this can be solved by manual intervention, re-starting the ListX > processor. The > "Entity Tracking Time Window" can be ignored upon initial listing, when the > "Entity Tracking Initial Listing Target" is set to "All Available" (default). > However, this requires the NiFi user to be aware of lingering old entities > being available on the connected remote source. Additionally, the need for > manual intervention might be undesired / impractical when having a plentiful > of sources connected. > Additionally, the "Entity Tracking Time Window" can be increased to account > for longer time frames. However, this only betters the situation somewhat and > does not solve the problem. Also there is a limit to this, as it increases > the memory needed. > h1. Proposal > This issue proposes introducing the notion of a "Entity Tracking Mode", > whereby the current behavior could be understand as "Track Entity Timestamp". > An new mode of "Track Last Listing Time" is added. Other than the existing > "Track Entity Timestamp" mode, this would not impose any prerequisites on the > entities regarding they timestamp (see {{minTimestampToList}}). Instead, all > entities would be considered. > However, this strategy needs a way to limit / clean the entity cache as well. > Instead of measuring the time window by the timestamp of the entity, the mode > should remember the last time the entity was tracked; that is, part of a call > to "listEntities" in "trackEntities". That is, every time an entity is > listed, its cache entry is renewed. After every listing, only the cache > entries that have been updated in the time window will be kept. All other, > entities that have not been listed for a longer time, are removed from cache. > In case users want to limit a processor to only list entities up to a certain > age, most processors have support for this with a separate property already, > e.g. "Maximum File Age" in ListSFTP. > While this mode solves the problem of listing "old" entities it comes with > its own downsides. Due to lifting the restriction on {{minTimestampToList}}, > more entities can be listed, potentially leading to long listing times. > Additionally, similar to the existing "Track Entity Timestamp" there is no > enforced upper limit on how many cache entries are possible. See NIFI-12609 > for a proposal that may address both problems. -- This message was sent by Atlassian Jira (v8.20.10#820010)