Github user markap14 commented on the issue: https://github.com/apache/nifi/pull/2416 @devriesb that was the approach that I started with. However, to ensure correctness, it turned not to be a tweak to the existing implementation. Initially, my thoughts were along the lines of: Multiple Partitions - Each FlowFile must be sticky to a partition. - PartitionId = Hash<Record ID> % numPartitions. But now consider the following scenario: MergeContent merges 2 FlowFiles (A and B) to create a new FlowFile (C). Requires updates to Partition 1, 2 for transfer to 'original'; Partition 3 for new FlowFile. If Partitions 1 & 2 are flushed to disk but not 3, then on restart we have transferred FlowFiles to 'original' relationship without the newly created FlowFile, so we have data loss. This is not OK. To avoid this, we can consider that the transaction is not complete unless all partitions in the transaction agree. Now consider, though, that Partitions 1 & 2 are flushed to disk. Partition 3 is not. Then FlowFile A is updated again, this time by itself. Partition 1 is flushed to disk but not Partition 3. Now on restart, we throw out the 'merge' transaction because Partition 3 was not flushed to disk. But we carry on with the subsequent transaction for FlowFile A, effectively missing one of the updates to the FlowFile. *same situation we are in now*. To avoid this from happening, we have to consider that since our 'merge' transaction didn't complete (encountered EOF), then all subsequent transactions for the FlowFiles involved in the 'merge' transaction have to be rolled back. This still leaves us with the following possibility, though: FlowFile A and B were merged into C. A and B were transferred to original. This transaction failed as described above because Partition 3 was not flushed to disk. Now, FlowFile A was forked into 5 children. These 5 children were written to Partitions 4 and 5. Partitions 1, 4, and 5 are flushed to disk, but not Partition 3. On restart, we now see that Partition 3 was not flushed, so we discard the 'merge' transaction and the FlowFiles are placed back on the queue for MergeContent. But now we have children of FlowFile A running through the flow, and we will now re-merge A with other data. Additionally, in such a case, we have the overhead of not only writing to potentially many files for a single update to the repository but also having to write to a transaction log that is a single point through which all updates must go, and this can amount to quite an expensive update. This amounts to more than tweak to the code, unfortunately. Given the criticality this branch of the code, correctness is definitely more important than performance, but performance is still critical and we should strive to be both correct and high performance. Given that, I started looking at an alternative, to serialize the data to byte arrays outside of any lock contention, then performing only the write of that data to a file within the contended portion of code. The good news is that a lot of the serialization and error handling is copied & pasted from the existing implementation and then modified to fit a cleaner software design. So while it is a new implementation, there is still a large amount of reuse.
---