Suggestions for splitting, then reassembling documents

Russell Bateman Tue, 18 Aug 2020 09:38:37 -0700

I am writing custom processors that juggle medical documents (in a moreor less proprietary format). The document are always XML and containtwo, major parts:


1. an original document which may be text, HL7v2 or XML and may contain
   HTML between <document> ... </document>, could be many megabytes in size
2. XML structure representing data extracted from (a) in myriad XML
   elements, rarely more than a few hundred kilobytes in size

I am using XStreamto serialize #2 after I've parsed it into POJOs forlater use. It's too over the top to base-64 encode #1 to surviveserialization by XStreamand it buys me nothing except the convenience ofmaking #1 conceptually identical to #2. Since I don't need to dig downinto #1 or analyze it, and it's so big, processing it at all is costlyand undesirable.

What I thought I'd investigate is the possibility of splitting<document> ... </document>(#1) into a separate flowfile to bereassembled by a later processor, but with (literally) millions of thesefiles flowing through NiFi, I wonder about the advisability of splittingthem up then hoping I can unite the correct parts and how to accomplishthat (discrete attribute ids on constituent parts?).

The reassembly would involve deserializing #2, working with that data togenerate a new document (HL7v4/FHIR, other formats) along withreinserting #1.

Yes, I have examined /SplitXml/ and /SplitContent/, but I need to domuch more than just split the flowfile at the time I have it in myhands, hence, a custom processor. Similarly, /MergeContent/ will not behelpful for reassembly.

So, specifically, I can easily generate a flowfile attribute, an id thatdiscretely identifies these two now separate documents as suitable toweld back together. However, I have not yet experimented with flowfilesrandomly (?) showing up together later in the flow within easy reach ofone processor for reassembly. Obviously, /Split/- and /MergeContent/must be in the habit of dealing with this situation, but I have noexperience with them outside my primitive imagination.

I'm asking for suggestions, best practice, gotchas, warnings or anyother thoughts.


Russ

Suggestions for splitting, then reassembling documents

Reply via email to