This is proving to be difficult to do in practice. Many of the filenames in
the zip contain spaces and other characters, and these are failing to be
passed to the tar successfully.
This is the command I am testing at the command line to first extract the
filenames:
      unzip -l "budgets-state-govs (1).zip" | awk '$4 != "" {print $4}' |
egrep -v "^Name$" | egrep -v ".*----.*"
Note that I am forced to weed out all the lines that do not contain
filenames. And I have not yet tried to get any case working where there are
recursive directories in the zip.

This yields partial names such as Virginia for files with names like
Virginia 2023.mdb
When I try to pass Virginia to the tar command following the pipe, tar
chokes on these and fails.

I will continue to hammer away at this. If I find a solution I can employ
in an ExecuteStreamCommand, I will circle back and post it.

On Fri, Feb 2, 2024 at 6:03 PM Michael Moser <moser...@gmail.com> wrote:

> Yes, that's exactly what those commands do.  Your linux commands like
> unzip and tar can probably read directly from /dev/stdin and write directly
> to /dev/stdout if you want to.
>
> -- Mike
>
>
> On Fri, Feb 2, 2024 at 9:22 AM James McMahon <jsmcmah...@gmail.com> wrote:
>
>> Hi Michael. This is a very clever approach: convert from a zip (which
>> UnpackContent does not preserve file metadata for extracted files) to a tar
>> (for which UnpackContent does preserve file metadata), then employ the
>> UnpackContent.
>>
>> One quick followup question. The ExecuteStreamCommand will be in the nifi
>> flow, and so its input will be streaming incoming flowfiles, and its output
>> will be streamed as a flowfile. Are these two commands in the script where
>> we capture the incoming flowfile
>>
>> cat /dev/stdin >> $tmpzipfile
>>
>> ...and where we create the output flowfile from the ExecuteStreamCommand
>> processor?
>>
>> cat $tmptarfile >> /dev/stdout
>>
>>
>> On Thu, Feb 1, 2024 at 10:11 AM Michael Moser <moser...@gmail.com> wrote:
>>
>>> Hi Jim,
>>>
>>> The ExecuteStreamCommand will only output 1 flowfile, so using it to
>>> unzip in this fashion won't yield the results you need.
>>>
>>> Instead, you might try a workaround with ExecuteStreamCommand to unzip
>>> your file and then tar to repackage it.  Then UnpackContent should be able
>>> to read the tar file metadata.  I have used ExecuteStreamCommand to execute
>>> bash scripts.  An example is shown below, which you can modify for your
>>> needs.  The ExecuteStreamCommand properties "Command Path=/bin/bash" and
>>> "Command Arguments=/path/to/script.sh" is all you need for this script to
>>> work.
>>>
>>> #!/bin/bash
>>> tmpzipfile=$(mktemp)
>>> tmptarfile=$(mktemp)
>>> #remove the tmptarfile file, we just need a temporary filename, and will
>>> recreate it below
>>> rm -f $tmptarfile
>>> #create a directory to unzip files to
>>> tmpdir=$(mktemp -d)
>>>
>>> cat /dev/stdin >> $tmpzipfile
>>> # here is your unzip command to unzip $tmpzipfile to $tmpdir, preserving
>>> file metadata
>>> # here is your tar command to tar $tmpdir to $tmptarfile
>>> cat $tmptarfile >> /dev/stdout
>>>
>>> #cleanup
>>> rm -f $tmpzipfile
>>> rm -f $tmptarfile
>>> rm -rf $tmpdir
>>>
>>>
>>>
>>> On Wed, Jan 31, 2024 at 12:55 PM James McMahon <jsmcmah...@gmail.com>
>>> wrote:
>>>
>>>> If anyone can show me how to get my ExecuteStreamCommand configured
>>>> properly as a workaround, I am still interested in that.
>>>> Jim
>>>>
>>>> On Wed, Jan 31, 2024 at 12:39 PM James McMahon <jsmcmah...@gmail.com>
>>>> wrote:
>>>>
>>>>> I tried to find a Create option for tickets here,
>>>>> https://issues.apache.org/jira/projects/NIFI/issues/NIFI-11859?filter=allopenissues
>>>>> .
>>>>> I did not find one, and suspect maybe I have no such privilege perhaps?
>>>>> In any case, thank you for creating that.
>>>>> Jim
>>>>>
>>>>> On Wed, Jan 31, 2024 at 12:37 PM Joe Witt <joe.w...@gmail.com> wrote:
>>>>>
>>>>>> I went ahead and wrote it up here
>>>>>> https://issues.apache.org/jira/browse/NIFI-12709
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>> On Wed, Jan 31, 2024 at 10:30 AM James McMahon <jsmcmah...@gmail.com>
>>>>>> wrote:
>>>>>>
>>>>>>> Happy to do that Joe. How do I create and submit a JIRA for
>>>>>>> consideration? I have not done one - at least, not for years.
>>>>>>> If you get me started, I will do a concise and thorough description
>>>>>>> in the ticket.
>>>>>>> Sincerely,
>>>>>>> Jim
>>>>>>>
>>>>>>> On Wed, Jan 31, 2024 at 12:12 PM Joe Witt <joe.w...@gmail.com>
>>>>>>> wrote:
>>>>>>>
>>>>>>>> James,
>>>>>>>>
>>>>>>>> Makes sense to create a JIRA to improve UnpackContent to extract
>>>>>>>> these attributes in the event of a zip file that happens to present 
>>>>>>>> them.
>>>>>>>> The concept of lastModifiedDate does appear easily accessed if 
>>>>>>>> available in
>>>>>>>> the metadata.  Owner/Creator/Creation information looks less standard 
>>>>>>>> in
>>>>>>>> the case of a Zip but perhaps still capturable as extra fields.
>>>>>>>>
>>>>>>>> Thanks
>>>>>>>>
>>>>>>>> On Wed, Jan 31, 2024 at 10:01 AM James McMahon <
>>>>>>>> jsmcmah...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> I tried to use UnpackContent to extract the files within a zip
>>>>>>>>> file named ABC DEF (1).zip. (the filename has spaces in its name).
>>>>>>>>>
>>>>>>>>> UnpackContent seemed to work, but it did not preserve file
>>>>>>>>> attributes from the files in the zip. For example, the
>>>>>>>>> lastModifiedTime   is not available so downstream I am unable to do
>>>>>>>>> this: 
>>>>>>>>> ${file.lastModifiedTime:toDate("yyyy-MM-dd'T'HH:mm:ssZ"):format("yyyyMMddHHmmss")}
>>>>>>>>>
>>>>>>>>> I did some digging and found that on the UnpackContent page, it
>>>>>>>>> says:
>>>>>>>>> file.lastModifiedTime  "The date and time that the unpacked file
>>>>>>>>> was last modified (*tar only*)."
>>>>>>>>>
>>>>>>>>> I need these file attributes for those files I extract from the
>>>>>>>>> zip. So as an alternative I tried configuring an
>>>>>>>>> ExecuteStreamCommand processor like this:
>>>>>>>>> Command Arguments  -c;"unzip -p -q < -"
>>>>>>>>> Command Path  /bin/bash
>>>>>>>>> Argument Delimiter   ;
>>>>>>>>>
>>>>>>>>> It throws these errors:
>>>>>>>>>
>>>>>>>>> 16:41:30 UTCERROR13023d28-6154-17fd-b4e8-7a30b35980ca
>>>>>>>>> ExecuteStreamCommand[id=13023d28-6154-17fd-b4e8-7a30b35980ca] Failed 
>>>>>>>>> to
>>>>>>>>> write flow file to stdin due to Broken pipe: java.io.IOException: 
>>>>>>>>> Broken
>>>>>>>>> pipe 16:41:30 UTCERROR13023d28-6154-17fd-b4e8-7a30b35980ca
>>>>>>>>> ExecuteStreamCommand[id=13023d28-6154-17fd-b4e8-7a30b35980ca] 
>>>>>>>>> Transferring
>>>>>>>>> flow file FlowFile[filename=ABC DEF (1).zip] to nonzero status. 
>>>>>>>>> Executable
>>>>>>>>> command /bin/bash ended in an error: /bin/bash: -: No such file or 
>>>>>>>>> directory
>>>>>>>>>
>>>>>>>>> It does not seem to be applying the unzip to the stdin of the ESC
>>>>>>>>> processor. None of the files in the zip archive are output from ESC.
>>>>>>>>>
>>>>>>>>> What needs to be changed in my ESC configuration?
>>>>>>>>>
>>>>>>>>> Thank you in advance for any help.
>>>>>>>>>
>>>>>>>>>

Reply via email to