I am writing custom processors that juggle medical documents (in a more or less proprietary format). The document are always XML and contain two, major parts:

1. an original document which may be text, HL7v2 or XML and may contain
   HTML between <document> ... </document>, could be many megabytes in size
2. XML structure representing data extracted from (a) in myriad XML
   elements, rarely more than a few hundred kilobytes in size

I am using XStreamto serialize #2 after I've parsed it into POJOs for later use. It's too over the top to base-64 encode #1 to survive serialization by XStreamand it buys me nothing except the convenience of making #1 conceptually identical to #2. Since I don't need to dig down into #1 or analyze it, and it's so big, processing it at all is costly and undesirable.

What I thought I'd investigate is the possibility of splitting <document> ... </document>(#1) into a separate flowfile to be reassembled by a later processor, but with (literally) millions of these files flowing through NiFi, I wonder about the advisability of splitting them up then hoping I can unite the correct parts and how to accomplish that (discrete attribute ids on constituent parts?).

The reassembly would involve deserializing #2, working with that data to generate a new document (HL7v4/FHIR, other formats) along with reinserting #1.

Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do much more than just split the flowfile at the time I have it in my hands, hence, a custom processor. Similarly, /MergeContent/ will not be helpful for reassembly.

So, specifically, I can easily generate a flowfile attribute, an id that discretely identifies these two now separate documents as suitable to weld back together. However, I have not yet experimented with flowfiles randomly (?) showing up together later in the flow within easy reach of one processor for reassembly. Obviously, /Split/- and /MergeContent/ must be in the habit of dealing with this situation, but I have no experience with them outside my primitive imagination.

I'm asking for suggestions, best practice, gotchas, warnings or any other thoughts.

Russ


Reply via email to