I am writing custom processors that juggle medical documents (in a more
or less proprietary format). The document are always XML and contain
two, major parts:
1. an original document which may be text, HL7v2 or XML and may contain
HTML between <document> ... </document>, could be many megabytes in size
2. XML structure representing data extracted from (a) in myriad XML
elements, rarely more than a few hundred kilobytes in size
I am using XStreamto serialize #2 after I've parsed it into POJOs for
later use. It's too over the top to base-64 encode #1 to survive
serialization by XStreamand it buys me nothing except the convenience of
making #1 conceptually identical to #2. Since I don't need to dig down
into #1 or analyze it, and it's so big, processing it at all is costly
and undesirable.
What I thought I'd investigate is the possibility of splitting
<document> ... </document>(#1) into a separate flowfile to be
reassembled by a later processor, but with (literally) millions of these
files flowing through NiFi, I wonder about the advisability of splitting
them up then hoping I can unite the correct parts and how to accomplish
that (discrete attribute ids on constituent parts?).
The reassembly would involve deserializing #2, working with that data to
generate a new document (HL7v4/FHIR, other formats) along with
reinserting #1.
Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do
much more than just split the flowfile at the time I have it in my
hands, hence, a custom processor. Similarly, /MergeContent/ will not be
helpful for reassembly.
So, specifically, I can easily generate a flowfile attribute, an id that
discretely identifies these two now separate documents as suitable to
weld back together. However, I have not yet experimented with flowfiles
randomly (?) showing up together later in the flow within easy reach of
one processor for reassembly. Obviously, /Split/- and /MergeContent/
must be in the habit of dealing with this situation, but I have no
experience with them outside my primitive imagination.
I'm asking for suggestions, best practice, gotchas, warnings or any
other thoughts.
Russ