Github user markap14 commented on the issue:

    https://github.com/apache/nifi/pull/2416
  
    @devriesb that was the approach that I started with. However, to ensure 
correctness, it turned not to be a tweak to the existing implementation. 
Initially, my thoughts were along the lines of:
    
    Multiple Partitions
    - Each FlowFile must be sticky to a partition.
    - PartitionId = Hash<Record ID> % numPartitions.
    
    But now consider the following scenario:
    
    MergeContent merges 2 FlowFiles (A and B) to create a new FlowFile (C).
    Requires updates to Partition 1, 2 for transfer to 'original'; Partition 3 
for new FlowFile.
    If Partitions 1 & 2 are flushed to disk but not 3, then on restart we have 
transferred FlowFiles
       to 'original' relationship without the newly created FlowFile, so we 
have data loss. This is not OK.
    
    To avoid this, we can consider that the transaction is not complete unless 
all partitions in the transaction agree.
    Now consider, though, that Partitions 1 & 2 are flushed to disk. Partition 
3 is not. Then FlowFile A is updated again, this time by itself.
    Partition 1 is flushed to disk but not Partition 3.
    Now on restart, we throw out the 'merge' transaction because Partition 3 
was not flushed to disk. But we carry on with the subsequent transaction
    for FlowFile A, effectively missing one of the updates to the FlowFile. 
*same situation we are in now*.
    
    To avoid this from happening, we have to consider that since our 'merge' 
transaction didn't complete (encountered EOF), then all subsequent
    transactions for the FlowFiles involved in the 'merge' transaction have to 
be rolled back. This still leaves us with the following possibility, though:
    
        FlowFile A and B were merged into C. A and B were transferred to 
original.
        This transaction failed as described above because Partition 3 was not 
flushed to disk.
        Now, FlowFile A was forked into 5 children. These 5 children were 
written to Partitions 4 and 5.
        Partitions 1, 4, and 5 are flushed to disk, but not Partition 3.
    
        On restart, we now see that Partition 3 was not flushed, so we discard 
the 'merge' transaction and
        the FlowFiles are placed back on the queue for MergeContent. But now we 
have children of FlowFile A
        running through the flow, and we will now re-merge A with other data.
    
    Additionally, in such a case, we have the overhead of not only writing to 
potentially many files for a single update to the repository but also having to 
write to a transaction log that is a single point through which all updates 
must go, and this can amount to quite an expensive update. This amounts to more 
than tweak to the code, unfortunately.
    
    Given the criticality this branch of the code, correctness is definitely 
more important than performance, but performance is still critical and we 
should strive to be both correct and high performance.
    
    Given that, I started looking at an alternative, to serialize the data to 
byte arrays outside of any lock contention, then performing only the write of 
that data to a file within the contended portion of code.
    
    The good news is that a lot of the serialization and error handling is 
copied & pasted from the existing implementation and then modified to fit a 
cleaner software design. So while it is a new implementation, there is still a 
large amount of reuse.


---

Reply via email to