dariuszseweryn commented on PR #10053:
URL: https://github.com/apache/nifi/pull/10053#issuecomment-3108084521

   There are two aspects that make me think:
   - flows that assume sequential FlowFile contents
   - humans auditing produced FlowFile contents completeness
   
   # Flows assuming sequential FlowFile contents
   
   All produced FlowFiles contain `aws.kinesis.sequence.number` of the last 
record. Given the flow uses FIFO or `aws.kinesis.sequence.number` as a 
prioritizer, subsequent FlowFiles for the same shard contained non-overlapping 
ranges of subsequent records under normal circumstances. 
   
   Using grouping by default could break this assumption. To be 
backwards-compatible in this manner, the default strategy should be to close 
one FlowFile and start another when the schema changes and have grouping as an 
option.
   
   # Humans auditing produced FlowFile contents completeness
   
   I am considering auditability of processing completeness, as this processor 
does not work well with the Stateless Engine, i.e. it does not support Exactly 
Once semantics. Not all users will use the wrapping mechanism — having a way to 
determine/audit from the processed FlowFiles if all records were processed 
successfully could be useful.
   
   Up till now, FlowFiles contained sequential records minus those that could 
not be parsed and went with Parsing Failure relationship. Incremental check of 
sequential FlowFiles and counting records between sequence numbers was enough 
to verify completeness.
   
   With grouping it is harder to reason if all records were processed just by 
looking at subsequent FlowFiles attributes, as one would need to match the 
schema to find sequential FlowFiles and there is no guarantee how much back one 
would need to look for the previous FlowFile in sequence for a given schema. 
   
   Easy ways for verification with grouping:
   1. in wrapper mode — this is less of a problem, we have sequence/subsequence 
information on the processed records themselves
   2. in non-wrapping mode — apart from having a sequence/subsequence number of 
the last record on each FlowFile, every FlowFile created from a single batch 
should have some identification of the batch it came from — 
sequence/subsequence number of the first record in the batch and count of 
FlowFiles produced by the batch. This would allow for easy grouping of 
FlowFiles produced by a batch and counting messages processed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to