Re: Listing S3

Bryan Bende Tue, 25 Sep 2018 06:56:38 -0700

Hi Martijn,

The request for the "list" processors to support incoming flow files
comes up frequently. The issue is that the list processors are meant
to continuously watch a given directory/bucket and maintain state
about what has been seen and only find newer stuff. So if you let the
processor support incoming flow files then it means the directory can
potentially be different on every execution of the processor, which
then makes it problematic for maintaining state... how do we know if
there will ever be another flow file indicating the same directory and
whether we need to keep the state around? how much state can actually
store? etc.


I don't know exactly what you're use case is, but I think it would be
reasonable to support a variation of each "list" processor that
supports incoming flow files, but does NOT maintain state. Meaning, it
would be used to perform a one-time listing based on the incoming flow
file, and if another flow file came in later with the same
directory/bucket, it would have no knowledge of the previous execution
and thus list everything again.

-Bryan

On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers <mart...@dekkers.org.uk> wrote:
>
> Hi Koji,
>
> Thanks, that is exactly the path we took in the end. This is a repeating 
> pattern for us, and we would have preferred to keep it all contained in an 
> ISP. Since the output of the listing is very large, we run into some memory 
> issues at the SplitText step, so we use a few of those in sequence, which is 
> all a bit hacky. When we have some time we will get back to this, and 
> hopefully get it done "correctly".
>
> I am trying to work out what the reasoning is for none of the List-type 
> processors to accept incoming connections, we use them frequently and have to 
> resort to all kinds of acrobatics to work around this. In this instance we 
> use an external script, in some others we have to set up infrastructure 
> outside of NiFi to set parameters via the API. It would be a lot easier and 
> smoother if we could simply accept an incoming connection and use attributes.
>
> Thanks,
>
> Martijn
>
> On Tue, 25 Sep 2018 at 02:37, Koji Kawamura <ijokaruma...@gmail.com> wrote:
>>
>> Hi Martijn,
>>
>> I'm not an expert on Jython, but if you already have a python script
>> using boto3 working fine, then I'd suggest using ExecuteStreamCommand
>> instead.
>> For example:
>> - you can design the python script to print out JSON formatted string
>> about listed files
>> - then connect the outputs to SplitJson
>> - and use EvaluateJsonPath to extract required values to FlowFile attribute
>> - finally, use FetchS3Object
>>
>> Thanks,
>> Koji

Re: Listing S3

Reply via email to