Re: Suggestions for splitting, then reassembling documents

Russell Bateman Fri, 21 Aug 2020 05:33:42 -0700

Hey, thanks, Jason. I will give this approach a try. I confess I had noteven thought of /Wait/ for that. Thanks too for pointing out side effects.


Russ


On 8/21/20 5:46 AM, Sherman, Jason wrote:

This sounds like a good use case for wait/notify, which I've used
successfully multiple times.  Once the document is split, the original
document part would sit at the wait processor until a notify processor
signals the completion of the flow.  I would first try using the original
files UUID for the wait/notify signal.

Also, you can set attributes on the notify processor that can get added to
the document at the wait processor. Then, use that added information to
build whatever output you need with the original document.

However, with such large documents, this will likely slow down the
processing for the XML portion, depending on how much processing they have
to go through.

Cheers,
Jason
--
Jason C. Sherman, CSSLP, CISSP
Owner
Logical Software Solutions, LLC
Solid. Secure. Software.

http://logicalsoftware.co/
.co?  Yes, your data isn't always what you expect.  We'll make sense of it.

https://www.linkedin.com/in/lss-js/


On Tue, Aug 18, 2020 at 12:38 PM Russell Bateman <r...@windofkeltia.com>
wrote:

I am writing custom processors that juggle medical documents (in a more
or less proprietary format). The document are always XML and contain
two, major parts:

  1. an original document which may be text, HL7v2 or XML and may contain
     HTML between <document> ... </document>, could be many megabytes in
size
  2. XML structure representing data extracted from (a) in myriad XML
     elements, rarely more than a few hundred kilobytes in size

I am using XStreamto serialize #2 after I've parsed it into POJOs for
later use. It's too over the top to base-64 encode #1 to survive
serialization by XStreamand it buys me nothing except the convenience of
making #1 conceptually identical to #2. Since I don't need to dig down
into #1 or analyze it, and it's so big, processing it at all is costly
and undesirable.

What I thought I'd investigate is the possibility of splitting
<document> ... </document>(#1) into a separate flowfile to be
reassembled by a later processor, but with (literally) millions of these
files flowing through NiFi, I wonder about the advisability of splitting
them up then hoping I can unite the correct parts and how to accomplish
that (discrete attribute ids on constituent parts?).

The reassembly would involve deserializing #2, working with that data to
generate a new document (HL7v4/FHIR, other formats) along with
reinserting #1.

Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do
much more than just split the flowfile at the time I have it in my
hands, hence, a custom processor. Similarly, /MergeContent/ will not be
helpful for reassembly.

So, specifically, I can easily generate a flowfile attribute, an id that
discretely identifies these two now separate documents as suitable to
weld back together. However, I have not yet experimented with flowfiles
randomly (?) showing up together later in the flow within easy reach of
one processor for reassembly. Obviously, /Split/- and /MergeContent/
must be in the habit of dealing with this situation, but I have no
experience with them outside my primitive imagination.

I'm asking for suggestions, best practice, gotchas, warnings or any
other thoughts.

Russ

Re: Suggestions for splitting, then reassembling documents

Reply via email to