Re: Suggestions for splitting, then reassembling documents

2020-08-21 Thread Sherman, Jason
You're welcome.  I hope it helps.

v/r
Jason
--
Jason C. Sherman, CSSLP, CISSP
Owner
Logical Software Solutions, LLC
Solid. Secure. Software.

http://logicalsoftware.co/
.co?  Yes, your data isn't always what you expect.  We'll make sense of it.

https://www.linkedin.com/in/lss-js/


On Fri, Aug 21, 2020 at 8:33 AM Russell Bateman 
wrote:

> Hey, thanks, Jason. I will give this approach a try. I confess I had not
> even thought of /Wait/ for that. Thanks too for pointing out side effects.
>
> Russ
>
> On 8/21/20 5:46 AM, Sherman, Jason wrote:
> > This sounds like a good use case for wait/notify, which I've used
> > successfully multiple times.  Once the document is split, the original
> > document part would sit at the wait processor until a notify processor
> > signals the completion of the flow.  I would first try using the original
> > files UUID for the wait/notify signal.
> >
> > Also, you can set attributes on the notify processor that can get added
> to
> > the document at the wait processor. Then, use that added information to
> > build whatever output you need with the original document.
> >
> > However, with such large documents, this will likely slow down the
> > processing for the XML portion, depending on how much processing they
> have
> > to go through.
> >
> > Cheers,
> > Jason
> > --
> > Jason C. Sherman, CSSLP, CISSP
> > Owner
> > Logical Software Solutions, LLC
> > Solid. Secure. Software.
> >
> > http://logicalsoftware.co/
> > .co?  Yes, your data isn't always what you expect.  We'll make sense of
> it.
> >
> > https://www.linkedin.com/in/lss-js/
> >
> >
> > On Tue, Aug 18, 2020 at 12:38 PM Russell Bateman 
> > wrote:
> >
> >> I am writing custom processors that juggle medical documents (in a more
> >> or less proprietary format). The document are always XML and contain
> >> two, major parts:
> >>
> >>   1. an original document which may be text, HL7v2 or XML and may
> contain
> >>  HTML between  ... , could be many megabytes in
> >> size
> >>   2. XML structure representing data extracted from (a) in myriad XML
> >>  elements, rarely more than a few hundred kilobytes in size
> >>
> >> I am using XStreamto serialize #2 after I've parsed it into POJOs for
> >> later use. It's too over the top to base-64 encode #1 to survive
> >> serialization by XStreamand it buys me nothing except the convenience of
> >> making #1 conceptually identical to #2. Since I don't need to dig down
> >> into #1 or analyze it, and it's so big, processing it at all is costly
> >> and undesirable.
> >>
> >> What I thought I'd investigate is the possibility of splitting
> >>  ... (#1) into a separate flowfile to be
> >> reassembled by a later processor, but with (literally) millions of these
> >> files flowing through NiFi, I wonder about the advisability of splitting
> >> them up then hoping I can unite the correct parts and how to accomplish
> >> that (discrete attribute ids on constituent parts?).
> >>
> >> The reassembly would involve deserializing #2, working with that data to
> >> generate a new document (HL7v4/FHIR, other formats) along with
> >> reinserting #1.
> >>
> >> Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do
> >> much more than just split the flowfile at the time I have it in my
> >> hands, hence, a custom processor. Similarly, /MergeContent/ will not be
> >> helpful for reassembly.
> >>
> >> So, specifically, I can easily generate a flowfile attribute, an id that
> >> discretely identifies these two now separate documents as suitable to
> >> weld back together. However, I have not yet experimented with flowfiles
> >> randomly (?) showing up together later in the flow within easy reach of
> >> one processor for reassembly. Obviously, /Split/- and /MergeContent/
> >> must be in the habit of dealing with this situation, but I have no
> >> experience with them outside my primitive imagination.
> >>
> >> I'm asking for suggestions, best practice, gotchas, warnings or any
> >> other thoughts.
> >>
> >> Russ
> >>
> >>
> >>
>
>


Re: Suggestions for splitting, then reassembling documents

2020-08-21 Thread Russell Bateman
Hey, thanks, Jason. I will give this approach a try. I confess I had not 
even thought of /Wait/ for that. Thanks too for pointing out side effects.


Russ

On 8/21/20 5:46 AM, Sherman, Jason wrote:

This sounds like a good use case for wait/notify, which I've used
successfully multiple times.  Once the document is split, the original
document part would sit at the wait processor until a notify processor
signals the completion of the flow.  I would first try using the original
files UUID for the wait/notify signal.

Also, you can set attributes on the notify processor that can get added to
the document at the wait processor. Then, use that added information to
build whatever output you need with the original document.

However, with such large documents, this will likely slow down the
processing for the XML portion, depending on how much processing they have
to go through.

Cheers,
Jason
--
Jason C. Sherman, CSSLP, CISSP
Owner
Logical Software Solutions, LLC
Solid. Secure. Software.

http://logicalsoftware.co/
.co?  Yes, your data isn't always what you expect.  We'll make sense of it.

https://www.linkedin.com/in/lss-js/


On Tue, Aug 18, 2020 at 12:38 PM Russell Bateman 
wrote:


I am writing custom processors that juggle medical documents (in a more
or less proprietary format). The document are always XML and contain
two, major parts:

  1. an original document which may be text, HL7v2 or XML and may contain
 HTML between  ... , could be many megabytes in
size
  2. XML structure representing data extracted from (a) in myriad XML
 elements, rarely more than a few hundred kilobytes in size

I am using XStreamto serialize #2 after I've parsed it into POJOs for
later use. It's too over the top to base-64 encode #1 to survive
serialization by XStreamand it buys me nothing except the convenience of
making #1 conceptually identical to #2. Since I don't need to dig down
into #1 or analyze it, and it's so big, processing it at all is costly
and undesirable.

What I thought I'd investigate is the possibility of splitting
 ... (#1) into a separate flowfile to be
reassembled by a later processor, but with (literally) millions of these
files flowing through NiFi, I wonder about the advisability of splitting
them up then hoping I can unite the correct parts and how to accomplish
that (discrete attribute ids on constituent parts?).

The reassembly would involve deserializing #2, working with that data to
generate a new document (HL7v4/FHIR, other formats) along with
reinserting #1.

Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do
much more than just split the flowfile at the time I have it in my
hands, hence, a custom processor. Similarly, /MergeContent/ will not be
helpful for reassembly.

So, specifically, I can easily generate a flowfile attribute, an id that
discretely identifies these two now separate documents as suitable to
weld back together. However, I have not yet experimented with flowfiles
randomly (?) showing up together later in the flow within easy reach of
one processor for reassembly. Obviously, /Split/- and /MergeContent/
must be in the habit of dealing with this situation, but I have no
experience with them outside my primitive imagination.

I'm asking for suggestions, best practice, gotchas, warnings or any
other thoughts.

Russ







Re: Suggestions for splitting, then reassembling documents

2020-08-21 Thread Sherman, Jason
This sounds like a good use case for wait/notify, which I've used
successfully multiple times.  Once the document is split, the original
document part would sit at the wait processor until a notify processor
signals the completion of the flow.  I would first try using the original
files UUID for the wait/notify signal.

Also, you can set attributes on the notify processor that can get added to
the document at the wait processor. Then, use that added information to
build whatever output you need with the original document.

However, with such large documents, this will likely slow down the
processing for the XML portion, depending on how much processing they have
to go through.

Cheers,
Jason
--
Jason C. Sherman, CSSLP, CISSP
Owner
Logical Software Solutions, LLC
Solid. Secure. Software.

http://logicalsoftware.co/
.co?  Yes, your data isn't always what you expect.  We'll make sense of it.

https://www.linkedin.com/in/lss-js/


On Tue, Aug 18, 2020 at 12:38 PM Russell Bateman 
wrote:

> I am writing custom processors that juggle medical documents (in a more
> or less proprietary format). The document are always XML and contain
> two, major parts:
>
>  1. an original document which may be text, HL7v2 or XML and may contain
> HTML between  ... , could be many megabytes in
> size
>  2. XML structure representing data extracted from (a) in myriad XML
> elements, rarely more than a few hundred kilobytes in size
>
> I am using XStreamto serialize #2 after I've parsed it into POJOs for
> later use. It's too over the top to base-64 encode #1 to survive
> serialization by XStreamand it buys me nothing except the convenience of
> making #1 conceptually identical to #2. Since I don't need to dig down
> into #1 or analyze it, and it's so big, processing it at all is costly
> and undesirable.
>
> What I thought I'd investigate is the possibility of splitting
>  ... (#1) into a separate flowfile to be
> reassembled by a later processor, but with (literally) millions of these
> files flowing through NiFi, I wonder about the advisability of splitting
> them up then hoping I can unite the correct parts and how to accomplish
> that (discrete attribute ids on constituent parts?).
>
> The reassembly would involve deserializing #2, working with that data to
> generate a new document (HL7v4/FHIR, other formats) along with
> reinserting #1.
>
> Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do
> much more than just split the flowfile at the time I have it in my
> hands, hence, a custom processor. Similarly, /MergeContent/ will not be
> helpful for reassembly.
>
> So, specifically, I can easily generate a flowfile attribute, an id that
> discretely identifies these two now separate documents as suitable to
> weld back together. However, I have not yet experimented with flowfiles
> randomly (?) showing up together later in the flow within easy reach of
> one processor for reassembly. Obviously, /Split/- and /MergeContent/
> must be in the habit of dealing with this situation, but I have no
> experience with them outside my primitive imagination.
>
> I'm asking for suggestions, best practice, gotchas, warnings or any
> other thoughts.
>
> Russ
>
>
>


Suggestions for splitting, then reassembling documents

2020-08-18 Thread Russell Bateman
I am writing custom processors that juggle medical documents (in a more 
or less proprietary format). The document are always XML and contain 
two, major parts:


1. an original document which may be text, HL7v2 or XML and may contain
   HTML between  ... , could be many megabytes in size
2. XML structure representing data extracted from (a) in myriad XML
   elements, rarely more than a few hundred kilobytes in size

I am using XStreamto serialize #2 after I've parsed it into POJOs for 
later use. It's too over the top to base-64 encode #1 to survive 
serialization by XStreamand it buys me nothing except the convenience of 
making #1 conceptually identical to #2. Since I don't need to dig down 
into #1 or analyze it, and it's so big, processing it at all is costly 
and undesirable.


What I thought I'd investigate is the possibility of splitting 
 ... (#1) into a separate flowfile to be 
reassembled by a later processor, but with (literally) millions of these 
files flowing through NiFi, I wonder about the advisability of splitting 
them up then hoping I can unite the correct parts and how to accomplish 
that (discrete attribute ids on constituent parts?).


The reassembly would involve deserializing #2, working with that data to 
generate a new document (HL7v4/FHIR, other formats) along with 
reinserting #1.


Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do 
much more than just split the flowfile at the time I have it in my 
hands, hence, a custom processor. Similarly, /MergeContent/ will not be 
helpful for reassembly.


So, specifically, I can easily generate a flowfile attribute, an id that 
discretely identifies these two now separate documents as suitable to 
weld back together. However, I have not yet experimented with flowfiles 
randomly (?) showing up together later in the flow within easy reach of 
one processor for reassembly. Obviously, /Split/- and /MergeContent/ 
must be in the habit of dealing with this situation, but I have no 
experience with them outside my primitive imagination.


I'm asking for suggestions, best practice, gotchas, warnings or any 
other thoughts.


Russ