Re: Suggestions for splitting, then reassembling documents
You're welcome. I hope it helps. v/r Jason -- Jason C. Sherman, CSSLP, CISSP Owner Logical Software Solutions, LLC Solid. Secure. Software. http://logicalsoftware.co/ .co? Yes, your data isn't always what you expect. We'll make sense of it. https://www.linkedin.com/in/lss-js/ On Fri, Aug 21, 2020 at 8:33 AM Russell Bateman wrote: > Hey, thanks, Jason. I will give this approach a try. I confess I had not > even thought of /Wait/ for that. Thanks too for pointing out side effects. > > Russ > > On 8/21/20 5:46 AM, Sherman, Jason wrote: > > This sounds like a good use case for wait/notify, which I've used > > successfully multiple times. Once the document is split, the original > > document part would sit at the wait processor until a notify processor > > signals the completion of the flow. I would first try using the original > > files UUID for the wait/notify signal. > > > > Also, you can set attributes on the notify processor that can get added > to > > the document at the wait processor. Then, use that added information to > > build whatever output you need with the original document. > > > > However, with such large documents, this will likely slow down the > > processing for the XML portion, depending on how much processing they > have > > to go through. > > > > Cheers, > > Jason > > -- > > Jason C. Sherman, CSSLP, CISSP > > Owner > > Logical Software Solutions, LLC > > Solid. Secure. Software. > > > > http://logicalsoftware.co/ > > .co? Yes, your data isn't always what you expect. We'll make sense of > it. > > > > https://www.linkedin.com/in/lss-js/ > > > > > > On Tue, Aug 18, 2020 at 12:38 PM Russell Bateman > > wrote: > > > >> I am writing custom processors that juggle medical documents (in a more > >> or less proprietary format). The document are always XML and contain > >> two, major parts: > >> > >> 1. an original document which may be text, HL7v2 or XML and may > contain > >> HTML between ... , could be many megabytes in > >> size > >> 2. XML structure representing data extracted from (a) in myriad XML > >> elements, rarely more than a few hundred kilobytes in size > >> > >> I am using XStreamto serialize #2 after I've parsed it into POJOs for > >> later use. It's too over the top to base-64 encode #1 to survive > >> serialization by XStreamand it buys me nothing except the convenience of > >> making #1 conceptually identical to #2. Since I don't need to dig down > >> into #1 or analyze it, and it's so big, processing it at all is costly > >> and undesirable. > >> > >> What I thought I'd investigate is the possibility of splitting > >> ... (#1) into a separate flowfile to be > >> reassembled by a later processor, but with (literally) millions of these > >> files flowing through NiFi, I wonder about the advisability of splitting > >> them up then hoping I can unite the correct parts and how to accomplish > >> that (discrete attribute ids on constituent parts?). > >> > >> The reassembly would involve deserializing #2, working with that data to > >> generate a new document (HL7v4/FHIR, other formats) along with > >> reinserting #1. > >> > >> Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do > >> much more than just split the flowfile at the time I have it in my > >> hands, hence, a custom processor. Similarly, /MergeContent/ will not be > >> helpful for reassembly. > >> > >> So, specifically, I can easily generate a flowfile attribute, an id that > >> discretely identifies these two now separate documents as suitable to > >> weld back together. However, I have not yet experimented with flowfiles > >> randomly (?) showing up together later in the flow within easy reach of > >> one processor for reassembly. Obviously, /Split/- and /MergeContent/ > >> must be in the habit of dealing with this situation, but I have no > >> experience with them outside my primitive imagination. > >> > >> I'm asking for suggestions, best practice, gotchas, warnings or any > >> other thoughts. > >> > >> Russ > >> > >> > >> > >
Re: Suggestions for splitting, then reassembling documents
Hey, thanks, Jason. I will give this approach a try. I confess I had not even thought of /Wait/ for that. Thanks too for pointing out side effects. Russ On 8/21/20 5:46 AM, Sherman, Jason wrote: This sounds like a good use case for wait/notify, which I've used successfully multiple times. Once the document is split, the original document part would sit at the wait processor until a notify processor signals the completion of the flow. I would first try using the original files UUID for the wait/notify signal. Also, you can set attributes on the notify processor that can get added to the document at the wait processor. Then, use that added information to build whatever output you need with the original document. However, with such large documents, this will likely slow down the processing for the XML portion, depending on how much processing they have to go through. Cheers, Jason -- Jason C. Sherman, CSSLP, CISSP Owner Logical Software Solutions, LLC Solid. Secure. Software. http://logicalsoftware.co/ .co? Yes, your data isn't always what you expect. We'll make sense of it. https://www.linkedin.com/in/lss-js/ On Tue, Aug 18, 2020 at 12:38 PM Russell Bateman wrote: I am writing custom processors that juggle medical documents (in a more or less proprietary format). The document are always XML and contain two, major parts: 1. an original document which may be text, HL7v2 or XML and may contain HTML between ... , could be many megabytes in size 2. XML structure representing data extracted from (a) in myriad XML elements, rarely more than a few hundred kilobytes in size I am using XStreamto serialize #2 after I've parsed it into POJOs for later use. It's too over the top to base-64 encode #1 to survive serialization by XStreamand it buys me nothing except the convenience of making #1 conceptually identical to #2. Since I don't need to dig down into #1 or analyze it, and it's so big, processing it at all is costly and undesirable. What I thought I'd investigate is the possibility of splitting ... (#1) into a separate flowfile to be reassembled by a later processor, but with (literally) millions of these files flowing through NiFi, I wonder about the advisability of splitting them up then hoping I can unite the correct parts and how to accomplish that (discrete attribute ids on constituent parts?). The reassembly would involve deserializing #2, working with that data to generate a new document (HL7v4/FHIR, other formats) along with reinserting #1. Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do much more than just split the flowfile at the time I have it in my hands, hence, a custom processor. Similarly, /MergeContent/ will not be helpful for reassembly. So, specifically, I can easily generate a flowfile attribute, an id that discretely identifies these two now separate documents as suitable to weld back together. However, I have not yet experimented with flowfiles randomly (?) showing up together later in the flow within easy reach of one processor for reassembly. Obviously, /Split/- and /MergeContent/ must be in the habit of dealing with this situation, but I have no experience with them outside my primitive imagination. I'm asking for suggestions, best practice, gotchas, warnings or any other thoughts. Russ
Re: Suggestions for splitting, then reassembling documents
This sounds like a good use case for wait/notify, which I've used successfully multiple times. Once the document is split, the original document part would sit at the wait processor until a notify processor signals the completion of the flow. I would first try using the original files UUID for the wait/notify signal. Also, you can set attributes on the notify processor that can get added to the document at the wait processor. Then, use that added information to build whatever output you need with the original document. However, with such large documents, this will likely slow down the processing for the XML portion, depending on how much processing they have to go through. Cheers, Jason -- Jason C. Sherman, CSSLP, CISSP Owner Logical Software Solutions, LLC Solid. Secure. Software. http://logicalsoftware.co/ .co? Yes, your data isn't always what you expect. We'll make sense of it. https://www.linkedin.com/in/lss-js/ On Tue, Aug 18, 2020 at 12:38 PM Russell Bateman wrote: > I am writing custom processors that juggle medical documents (in a more > or less proprietary format). The document are always XML and contain > two, major parts: > > 1. an original document which may be text, HL7v2 or XML and may contain > HTML between ... , could be many megabytes in > size > 2. XML structure representing data extracted from (a) in myriad XML > elements, rarely more than a few hundred kilobytes in size > > I am using XStreamto serialize #2 after I've parsed it into POJOs for > later use. It's too over the top to base-64 encode #1 to survive > serialization by XStreamand it buys me nothing except the convenience of > making #1 conceptually identical to #2. Since I don't need to dig down > into #1 or analyze it, and it's so big, processing it at all is costly > and undesirable. > > What I thought I'd investigate is the possibility of splitting > ... (#1) into a separate flowfile to be > reassembled by a later processor, but with (literally) millions of these > files flowing through NiFi, I wonder about the advisability of splitting > them up then hoping I can unite the correct parts and how to accomplish > that (discrete attribute ids on constituent parts?). > > The reassembly would involve deserializing #2, working with that data to > generate a new document (HL7v4/FHIR, other formats) along with > reinserting #1. > > Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do > much more than just split the flowfile at the time I have it in my > hands, hence, a custom processor. Similarly, /MergeContent/ will not be > helpful for reassembly. > > So, specifically, I can easily generate a flowfile attribute, an id that > discretely identifies these two now separate documents as suitable to > weld back together. However, I have not yet experimented with flowfiles > randomly (?) showing up together later in the flow within easy reach of > one processor for reassembly. Obviously, /Split/- and /MergeContent/ > must be in the habit of dealing with this situation, but I have no > experience with them outside my primitive imagination. > > I'm asking for suggestions, best practice, gotchas, warnings or any > other thoughts. > > Russ > > >
Suggestions for splitting, then reassembling documents
I am writing custom processors that juggle medical documents (in a more or less proprietary format). The document are always XML and contain two, major parts: 1. an original document which may be text, HL7v2 or XML and may contain HTML between ... , could be many megabytes in size 2. XML structure representing data extracted from (a) in myriad XML elements, rarely more than a few hundred kilobytes in size I am using XStreamto serialize #2 after I've parsed it into POJOs for later use. It's too over the top to base-64 encode #1 to survive serialization by XStreamand it buys me nothing except the convenience of making #1 conceptually identical to #2. Since I don't need to dig down into #1 or analyze it, and it's so big, processing it at all is costly and undesirable. What I thought I'd investigate is the possibility of splitting ... (#1) into a separate flowfile to be reassembled by a later processor, but with (literally) millions of these files flowing through NiFi, I wonder about the advisability of splitting them up then hoping I can unite the correct parts and how to accomplish that (discrete attribute ids on constituent parts?). The reassembly would involve deserializing #2, working with that data to generate a new document (HL7v4/FHIR, other formats) along with reinserting #1. Yes, I have examined /SplitXml/ and /SplitContent/, but I need to do much more than just split the flowfile at the time I have it in my hands, hence, a custom processor. Similarly, /MergeContent/ will not be helpful for reassembly. So, specifically, I can easily generate a flowfile attribute, an id that discretely identifies these two now separate documents as suitable to weld back together. However, I have not yet experimented with flowfiles randomly (?) showing up together later in the flow within easy reach of one processor for reassembly. Obviously, /Split/- and /MergeContent/ must be in the habit of dealing with this situation, but I have no experience with them outside my primitive imagination. I'm asking for suggestions, best practice, gotchas, warnings or any other thoughts. Russ