Re: Listing S3

Martijn Dekkers Tue, 25 Sep 2018 22:30:02 -0700

Hi Bryan,

Thanks for your reply! Let me highlight our use case in some more detail.

This is a manufacturing setting. We have machines (100's) that perform some
tests on objects (millions). As part of those tests, these machines
generate various different files. We use a NiFi installation on each
machine to manage those files. However, we don't always want to pick up
those files, and we don't always want to treat them in the same way. Our
preferred solution is to have the NiFi instance listen using a
HandleHTTPRequest processor and we send it a POST with some properties
(files from which folders, what kind of files, what to do with them, etc.).
We then perform some validation of the request, and would next use the
properties to perform actions on those files (list, fetch, etc.). In this
case, we are looking for files that are being generated, so we need to look
for new files in a folder and thus require state. However, we can perform
multiple runs on a given folder. In this particular use case we don't move
files - we copy as they are generated, and a later invocation of the flow
with different properties we delete them.

We use many variations on the above theme. We have some flows that need to
fetch the files generated in the above example from their permanent storage
location, and perform some analysis. So in this case all the files are
already "in-place" in a folder, and we just need to do something with all
of those files.

Currently, in the first case, we configure, start, stop, and clear the
state of the flow for a given machine using the API. In the second case, we
shell out to a script that generates the listing, and go from there.

Whilst this approach works, it would be nice (and powerful) to be able to
keep everything in NiFi. We have many flows, many moving parts, and having
to use "external" scripts increases complexity in different ways. We now
have to ensure that the various scripts that do listings are always
reachable from NiFi, so NiFi deployment has become just a little bit more
complex. Given the amount of NiFi instances and the amount of processes we
have, all these little complexities add up. We have to make sure that
future developers understand that besides the flowfile, they also have to
look after these helper scripts - another additional complexity. The
requirement of manipulating the API in some cases means that we are
limiting the potential developer pool.

Personally, I believe that being able to selectively allow incoming
flowfiles and maintain state would be a real benefit. In our variations
outlined above, we have a situation where an incoming flowfile generates
tens of thousands of subsequent flowfiles - the incoming flowfile is simply
a trigger. In some cases we'd like to be able to say "do this, keep state"
and in other cases say "do this, don't worry about state"

I hope that clarifies things a bit. Thanks again for looking into this.

Martijn

On Tue, 25 Sep 2018 at 15:55, Bryan Bende <bbe...@gmail.com> wrote:

> Hi Martijn,
>
> The request for the "list" processors to support incoming flow files
> comes up frequently. The issue is that the list processors are meant
> to continuously watch a given directory/bucket and maintain state
> about what has been seen and only find newer stuff. So if you let the
> processor support incoming flow files then it means the directory can
> potentially be different on every execution of the processor, which
> then makes it problematic for maintaining state... how do we know if
> there will ever be another flow file indicating the same directory and
> whether we need to keep the state around? how much state can actually
> store? etc.
>
> I don't know exactly what you're use case is, but I think it would be
> reasonable to support a variation of each "list" processor that
> supports incoming flow files, but does NOT maintain state. Meaning, it
> would be used to perform a one-time listing based on the incoming flow
> file, and if another flow file came in later with the same
> directory/bucket, it would have no knowledge of the previous execution
> and thus list everything again.
>
> -Bryan
>
> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers <mart...@dekkers.org.uk>
> wrote:
> >
> > Hi Koji,
> >
> > Thanks, that is exactly the path we took in the end. This is a repeating
> pattern for us, and we would have preferred to keep it all contained in an
> ISP. Since the output of the listing is very large, we run into some memory
> issues at the SplitText step, so we use a few of those in sequence, which
> is all a bit hacky. When we have some time we will get back to this, and
> hopefully get it done "correctly".
> >
> > I am trying to work out what the reasoning is for none of the List-type
> processors to accept incoming connections, we use them frequently and have
> to resort to all kinds of acrobatics to work around this. In this instance
> we use an external script, in some others we have to set up infrastructure
> outside of NiFi to set parameters via the API. It would be a lot easier and
> smoother if we could simply accept an incoming connection and use
> attributes.
> >
> > Thanks,
> >
> > Martijn
> >
> > On Tue, 25 Sep 2018 at 02:37, Koji Kawamura <ijokaruma...@gmail.com>
> wrote:
> >>
> >> Hi Martijn,
> >>
> >> I'm not an expert on Jython, but if you already have a python script
> >> using boto3 working fine, then I'd suggest using ExecuteStreamCommand
> >> instead.
> >> For example:
> >> - you can design the python script to print out JSON formatted string
> >> about listed files
> >> - then connect the outputs to SplitJson
> >> - and use EvaluateJsonPath to extract required values to FlowFile
> attribute
> >> - finally, use FetchS3Object
> >>
> >> Thanks,
> >> Koji
>

Re: Listing S3

Reply via email to