[ https://issues.apache.org/jira/browse/NIFI-4950?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Michael Moser resolved NIFI-4950. --------------------------------- Resolution: Fixed Fix Version/s: 1.6.0 > MergeContent: Defragment can improperly reassemble > -------------------------------------------------- > > Key: NIFI-4950 > URL: https://issues.apache.org/jira/browse/NIFI-4950 > Project: Apache NiFi > Issue Type: Bug > Components: Extensions > Affects Versions: 1.5.0 > Reporter: Brandon DeVries > Assignee: Mark Bean > Priority: Minor > Fix For: 1.6.0 > > > In Defragment mode, MergeContent can improperly reassemble the pieces of a > split file. I understand this was previously discussed in NIFI-378, and the > outcome was to update the documentation for fragment.index [1]: > {quote} Applicable only if the <Merge Strategy> property is set to > Defragment. This attribute indicates the order in which the fragments should > be assembled. This attribute must be present on all FlowFiles when using the > Defragment Merge Strategy and must be a unique (i.e., unique across all > FlowFiles that have the same value for the "fragment.identifier" attribute) > integer between 0 and the value of the fragment.count attribute. If two or > more FlowFiles have the same value for the "fragment.identifier" attribute > and the same value for the "fragment.index" attribute, the behavior of this > Processor is undefined. > {quote} > I believe this could (and probably should) be improved upon. Specifically, > the discussion around NIFI-378 focused on the "improper" use of MergeContent, > in using the same fragment.identifier to "pair up" files. The situation I've > encountered isn't really unusual in any way... > I have a file, being split and sent via PostHTTP to another nifi instance. > If something "goes wrong", the sending NiFi may not get an acknowledgement of > success even if the file made it to the receiving NiFi. It then sends the > segment again. NiFi favors duplication over loss, so this is not unexpected. > However, I now have a file broken into X fragments arriving on the other > side as X+1 (or more). The reassembly may work... or both duplicates may be > chosen, and result in an incorrectly recreated file. > To satisfy the contract as it exists, you would need to use a DetectDuplicate > before the MergeContent to filter these out. However, that could potentially > incur a great of overhead. In contrast, simply checking that there are no > duplicate fragment id's in a bin should be relatively straightforward. How > to handle duplicates is a legitimate question... are they ignored, or are > they discard (if they're actually the same)? If the duplicate id's aren't > identical, what is the behavior? Personally, I would say if you have actual > duplicates, drop one and continue with the merge... if you have unequal > "duplicates", fail the bin. But there's room for discussion there. > The point is, in this circumstance it is very easy for a user to do a very > reasonable thing and end up with a corrupt file for reasons that are somewhat > esoteric. Then, we would need to explain to them why "defragment" doesn't > actually defragment, but just kind of sorts a bin of matching things. I > think we can do better than that. > [1] > [http://nifi.apache.org/docs/nifi-docs/components/org.apache.nifi/nifi-standard-nar/1.5.0/org.apache.nifi.processors.standard.MergeContent/index.html] > -- This message was sent by Atlassian JIRA (v7.6.3#76005)