Re: Can ExecuteStreamCommand do this?

James McMahon Fri, 30 Sep 2022 06:01:04 -0700

Mike, let me make sure I understand this. Gzip outputs gz files that have
some reasonable level of compression. Because NiFi natively handles gzip
compressed files - presumably .gz extensions and some associated mime.type
- that is good enough for your purposes. You avoid 7za compression because
NiFi doesn't handle such compressed files natively, and because the gain in
compression is of little utility when S3 storage comes so cheaply; gzip
results are good enough.
Is that the gist of it?


On Fri, Sep 30, 2022 at 8:27 AM Mike Thomsen <[email protected]> wrote:

> I don't know what your use case is, but we avoid anything beyond gzip
> because S3 is so cheap.
>
> On Thu, Sep 29, 2022 at 10:51 AM James McMahon <[email protected]>
> wrote:
> >
> > Thank you Mark. Had no idea there was this file-based dependency to 7z
> files. Since my workaround appears to be working I think I may just move
> forward with that.
> > Steve, Mark - thank you again for replying.
> > Jim
> >
> > On Thu, Sep 29, 2022 at 9:15 AM Mark Payne <[email protected]> wrote:
> >>
> >> It’s been a while. But if I remember correctly, the reason that NiFi
> does not natively support 7-zip format is that with 7-zip, the dictionary
> is written at the end of the file.
> >> So when data is compressed, the dictionary is built up during
> compression and written at the end. This makes sense from a compression
> standpoint.
> >> However, what it means, is that in order to decompress it, you must
> first jump to the end of the file in order to access the dictionary. Then
> jump back to the beginning of the file in order to perform the
> decompression.
> >> NiFi makes use of Input Streams and Output Streams for FlowFIle access
> - it doesn’t provide a File-based approach. And this ability to jump to the
> end, read the dictionary, and then jump back to the beginning isn’t really
> possible with Input/Output Streams - at least, not without buffering
> everything into memory.
> >>
> >> So it would make sense that there would be a “Not Implemented” error
> when attempting to do the same thing using the 7-zip application directly,
> when attempting to use input streams & output streams.
> >> I think that if you’re stuck with 7-zip, your own option will be to do
> what you’re doing - write the data out as a file, run the 7-zip application
> against that file, writing the output to some directory, and then picking
> up the files from that directory.
> >> The alternative, of course, would be to update the source so that it’s
> creating zip files instead of 7-zip files, if you have sway over the source
> producer.
> >>
> >> Thanks
> >> -Mark
> >>
> >>
> >> On Sep 29, 2022, at 8:58 AM, stephen.hindmarch.bt.com via users <
> [email protected]> wrote:
> >>
> >> James,
> >>
> >> E_NOTIMPL means that feature is not implemented. I can see there is
> discussion about this down at sourceforge but the detail is blocked by my
> employer’s firewall.
> >>
> >> p7zip / Discussion / Help: E_NOTIMPL for stdin / stdout pipe
> >>
> >> https://sourceforge.net/p/p7zip/discussion/383044/thread/8066736d
> >>
> >> Steve Hindmarch
> >>
> >> From: James McMahon <[email protected]>
> >> Sent: 29 September 2022 12:12
> >> To: Hindmarch,SJ,Stephen,VIR R <[email protected]>
> >> Cc: [email protected]
> >> Subject: Re: Can ExecuteStreamCommand do this?
> >>
> >> I ran with these Command Arguments in the ExecuteStreamCommand
> configuration:
> >> x;-si;-so;-spf;-aou
> >> ${filename} removed, -si indicating use of STDIN, -so STDOUT.
> >>
> >> The same error is thrown by 7z through ExecuteStreamCommand: Executable
> command /bin/7za ended in an error: ERROR: Can not open the file as an
> archive  E_NOTIMPL
> >>
> >> I tried this at the command line, getting the same failure:
> >> cat testArchive.7z | 7za x -si -so | dd of=stooges.txt
> >>
> >>
> >> On Thu, Sep 29, 2022 at 6:44 AM James McMahon <[email protected]>
> wrote:
> >>
> >> Good morning, Steve. Indeed, that second paragraph is exactly how I did
> get this to work. I unpack to disk and then read in the twelve results
> using a GetFile. So far it is working well. It just feels a little wrong to
> me to do this, as I have introduced an extra write to and read from disk,
> which is going to be slower than doing it all in memory within the JVM.
> While that may not seem like anything significant for a single 7z file, as
> we work across thousands and thousands it can be significant.
> >>
> >> I am about to try what you suggested above: dropping the ${filename}
> entirely from the STDIN / STDOUT configuration. I realize it is not likely
> going to give me the twelve output flowfiles I'm seeking in the "output
> stream" path from ExecuteStreamCommand. I just want to see if it works
> without throwing that error.
> >>
> >> Welcome any other thoughts or comments you may have. Thanks again for
> your comments so far.
> >>
> >> Jim
> >>
> >> On Thu, Sep 29, 2022 at 5:23 AM <[email protected]> wrote:
> >>
> >> James,
> >>
> >> I have been thinking more about your problem and this may be the wrong
> approach. If you successfully unpack your files into the flow file content,
> you will still have one output flow file containing the unpacked contents
> of all of your files. If you need 12 separate files in their own flowfiles
> then you will need to find some way of splitting them up. Is there a byte
> sequence you can use in a SplitContent process, or a specific file length
> you can use in SplitText?
> >>
> >> Otherwise you may be better off using ExecuteStreamCommand to unpack
> the files on disk. Run it verbosely and use the output of that step to
> create a list of the locations where your recently unpacked files are. Or
> create a temporary directory to unpack in and fetch all the files in there,
> cleaning up aftwerwards. Then you can load the files with FetchFile.
> FetchFile can be instructed to delete the file it has just read so can also
> clean up after itself.
> >>
> >> Steve Hindmarch
> >>
> >> From: stephen.hindmarch.bt.com via users <[email protected]>
> >> Sent: 29 September 2022 09:19
> >> To: [email protected]; [email protected]
> >> Subject: RE: Can ExecuteStreamCommand do this?
> >>
> >> James,
> >>
> >> Using ${filename} and -si together seems wrong to me. What happens when
> you try that on the command line?
> >>
> >> Steve Hindmarch
> >>
> >> From: James McMahon <[email protected]>
> >> Sent: 28 September 2022 13:49
> >> To: [email protected]; Hindmarch,SJ,Stephen,VIR R <
> [email protected]>
> >> Subject: Re: Can ExecuteStreamCommand do this?
> >>
> >> Thank you Steve. I 've employed a ListFile/FetchFile to load the 7z
> files into the flow . When I have my ESC configured like this following, I
> get my unpacked files results to the #{unpacked.destination} directory on
> disk:
> >> Command Arguments
> x;${filename};-spf;-o#{unpacked.destination};-aou
> >> Command Path                    /bin/7a
> >> Ignore STDIN                       true
> >> Working Directory                #{unpacked.destination}
> >> Argument Delimiter               ;
> >> Output Destination Attribute  No value set
> >> I get twelve files in my output destination folder.
> >>
> >> When I try this one, get an error and no output:
> >> Command Arguments            x;${filename};-si;-so;-spf;-aou
> >> Command Path                    /bin/7a
> >> Ignore STDIN                       false
> >> Working Directory                #{unpacked.destination}
> >> Argument Delimiter               ;
> >> Output Destination Attribute  No value set
> >>
> >> This yields this error...
> >> Executable command /bin/7za ended in an error: ERROR: Can not open the
> file as archive
> >> E_NOTIMPL
> >> ...and it yields only one flowfile result in Output Stream, and that is
> a brief text/plain report of the results of the 7za extraction like this:
> >>
> >> This indicates it did indeed find my 7z file and it did indeed identify
> the 12 files in it, yet still I get no output to my outgoing flow path:
> >> Extracting archive: /parent/subparent/testArchive.7z
> >> - -
> >> Path = /parentdir/subdir/testArchive.7z
> >> Type = 7z
> >> Physical Size = 7204
> >> Headers Size = 298
> >> Method = LZMA2:96k
> >> Solid = +
> >> Blocks = 1
> >>
> >> Everything is Ok
> >>
> >> Folders: 1
> >> Files: 12
> >> Size: 90238
> >> Compressed: 7204
> >>
> >> ${filename} in both cases is a fully qualified name to the file, like
> this: /dir/subdir/myTestFile.7z.
> >>
> >> I can't seem to get the ESC output stream to be the extracted files.
> Anything jump out at you?
> >>
> >> On Wed, Sep 28, 2022 at 8:06 AM stephen.hindmarch.bt.com via users <
> [email protected]> wrote:
> >>
> >> Hi James,
> >>
> >> I am not in a position to test this right now, but you have to think of
> the flowfile content as STDIN and STDOUT. So with 7zip you need to use the
> “-si” and “-so” flags to ensure there are no files involved. Then if you
> can load the content of a file into a flowfile, eg with GetFile, then you
> should be able to unpack it with ExecuteStreamCommand. Set “Ignore STDIN” =
> “false”.
> >>
> >> I have written up my own use case on github. This involves having a
> Redis script as the input, and results of the script as the output.
> >>
> >> my-nifi-cluster/experiment-redis_direct.md at main ·
> hindmasj/my-nifi-cluster · GitHub
> >>
> >> The first part of the post shows how to do it with the input commands
> on the command line, so a bit like you running “7za ${filename} -so”. The
> second part has the script inside the flowfile and is treated as STDIN, a
> bit like you doing “unzip -si -so”.
> >>
> >> See if that helps. Fundamentally, if you do “7za -si -so < myfile.7z”
> on the command line and see the output on the console, ExecuteStreamCommand
> will behave the same.
> >>
> >> Steve Hindmarch
> >> From: James McMahon <[email protected]>
> >> Sent: 28 September 2022 12:02
> >> To: [email protected]
> >> Subject: Can ExecuteStreamCommand do this?
> >>
> >> I continue to struggle with ExecuteStreamCommand, and am hoping one of
> you from our user community can help me with the following:
> >> 1. Can ExecuteStreamCommand be used as I am trying to use it?
> >> 2. Can you direct me to an example where ExecuteStreamCommand is
> configured to do something similar to my use case?
> >>
> >> My use case:
> >> The incoming flowfiles in my flow path are 7z zips. Based on what I've
> researched so far, NiFi's native processors don't handle unpacking of 7z
> files.
> >>
> >> I want to read the 7z files as STDIN to ExecuteStreamCommand.
> >> I'd like the processor to call out to a 7za app, which will unpack the
> 7z.
> >> One incoming flowfile will yield multiple output files. Let's say
> twelve in this case.
> >> My goal is to output those twelve as new flowfiles out of
> ExecuteStreamCommand, to its output stream path.
> >>
> >> I can't yet get this to work. Best I've been able to do is configure
> ExecuteStreamCommand to unpack ${filename} to a temporary output directory
> on disk. Then I have another path in my flow polling that directory every
> few minutes looking for new data. Am hoping to eliminate that intermediate
> write/read to/from disk by keeping this all within the flow and JVM memory.
> >>
> >> Thanks very much in advance for any assistance.
> >>
> >>
>

Re: Can ExecuteStreamCommand do this?

Reply via email to