Re: Listing S3

2021-05-17 Thread Noe Detore
any update on this? Was the ListS3 processor updated or are there plans to
create a new processor?



--
Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/


Re: Listing S3

2018-09-25 Thread Martijn Dekkers
Hi Bryan,

Thanks for your reply! Let me highlight our use case in some more detail.

This is a manufacturing setting. We have machines (100's) that perform some
tests on objects (millions). As part of those tests, these machines
generate various different files. We use a NiFi installation on each
machine to manage those files. However, we don't always want to pick up
those files, and we don't always want to treat them in the same way. Our
preferred solution is to have the NiFi instance listen using a
HandleHTTPRequest processor and we send it a POST with some properties
(files from which folders, what kind of files, what to do with them, etc.).
We then perform some validation of the request, and would next use the
properties to perform actions on those files (list, fetch, etc.). In this
case, we are looking for files that are being generated, so we need to look
for new files in a folder and thus require state. However, we can perform
multiple runs on a given folder. In this particular use case we don't move
files - we copy as they are generated, and a later invocation of the flow
with different properties we delete them.

We use many variations on the above theme. We have some flows that need to
fetch the files generated in the above example from their permanent storage
location, and perform some analysis. So in this case all the files are
already "in-place" in a folder, and we just need to do something with all
of those files.

Currently, in the first case, we configure, start, stop, and clear the
state of the flow for a given machine using the API. In the second case, we
shell out to a script that generates the listing, and go from there.

Whilst this approach works, it would be nice (and powerful) to be able to
keep everything in NiFi. We have many flows, many moving parts, and having
to use "external" scripts increases complexity in different ways. We now
have to ensure that the various scripts that do listings are always
reachable from NiFi, so NiFi deployment has become just a little bit more
complex. Given the amount of NiFi instances and the amount of processes we
have, all these little complexities add up. We have to make sure that
future developers understand that besides the flowfile, they also have to
look after these helper scripts - another additional complexity. The
requirement of manipulating the API in some cases means that we are
limiting the potential developer pool.

Personally, I believe that being able to selectively allow incoming
flowfiles and maintain state would be a real benefit. In our variations
outlined above, we have a situation where an incoming flowfile generates
tens of thousands of subsequent flowfiles - the incoming flowfile is simply
a trigger. In some cases we'd like to be able to say "do this, keep state"
and in other cases say "do this, don't worry about state"

I hope that clarifies things a bit. Thanks again for looking into this.

Martijn



On Tue, 25 Sep 2018 at 15:55, Bryan Bende  wrote:

> Hi Martijn,
>
> The request for the "list" processors to support incoming flow files
> comes up frequently. The issue is that the list processors are meant
> to continuously watch a given directory/bucket and maintain state
> about what has been seen and only find newer stuff. So if you let the
> processor support incoming flow files then it means the directory can
> potentially be different on every execution of the processor, which
> then makes it problematic for maintaining state... how do we know if
> there will ever be another flow file indicating the same directory and
> whether we need to keep the state around? how much state can actually
> store? etc.
>
> I don't know exactly what you're use case is, but I think it would be
> reasonable to support a variation of each "list" processor that
> supports incoming flow files, but does NOT maintain state. Meaning, it
> would be used to perform a one-time listing based on the incoming flow
> file, and if another flow file came in later with the same
> directory/bucket, it would have no knowledge of the previous execution
> and thus list everything again.
>
> -Bryan
>
> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers 
> wrote:
> >
> > Hi Koji,
> >
> > Thanks, that is exactly the path we took in the end. This is a repeating
> pattern for us, and we would have preferred to keep it all contained in an
> ISP. Since the output of the listing is very large, we run into some memory
> issues at the SplitText step, so we use a few of those in sequence, which
> is all a bit hacky. When we have some time we will get back to this, and
> hopefully get it done "correctly".
> >
> > I am trying to work out what the reasoning is for none of the List-type
> processors to accept incoming connections, we use them frequently and have
> to resort to all kinds of acrobatics to work around this. In this instance
> we use an external script, in some others we have to set up infrastructure
> outside of NiFi to set parameters via the API. It would be 

Re: Listing S3

2018-09-25 Thread Sivaprasanna
Fair point.

On Tue, Sep 25, 2018 at 8:14 PM Bryan Bende  wrote:

> Yea if it can be done in a way to only clear state once the processor
> is started with state management = false, this way if someone
> accidentally toggles the value and hits apply, but then wants to go
> back before starting, they wouldn't lose the state.
>
> On Tue, Sep 25, 2018 at 10:40 AM Mark Payne  wrote:
> >
> > It certainly would be ideal to clear the state.
> >
> > On Sep 25, 2018, at 10:34 AM, Sivaprasanna 
> wrote:
> >
> > I'm in for a configurable state management property.
> >
> > One question: Let's say we have a processor already running without an
> incoming connection and have the state management property set to 'true'.
> After a couple of iterations, it would have some state set. Later the user
> adds an incoming connection and has the state management property set to
> 'false'. In this case, do we have to clear off the state? Or maintain the
> state as is but just don't consider it?
> >
> > -
> > Sivaprasanna
> >
> > On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende  wrote:
> >>
> >> I like that approach too.
> >> On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard
> >>  wrote:
> >> >
> >> > +1 with Matt's proposal and Mark's comment, I think it'd help
> answering some use cases.
> >> > We just need to be very clear about the processor behavior for each
> possible case/configuration.
> >> >
> >> > Pierre
> >> >
> >> > Le mar. 25 sept. 2018 à 16:19, Mark Payne  a
> écrit :
> >> >>
> >> >> Matt,
> >> >>
> >> >> I think it's very dangerous to manipulate the behavior of the
> processor so drastically based
> >> >> on the presence or absence of an incoming connection. I think it is
> fair game, however, to allow
> >> >> for a new property to be added that indicates whether or not state
> is maintained. Then, the processor
> >> >> could be made invalid if attempting to maintain state and has an
> incoming connection.
> >> >>
> >> >> This approach would be nice anyway because there are valid use cases
> to have a ListFile processor,
> >> >> for example, be the 'source processor' and still want to do a full
> listing every hour, let's say, rather than keeping
> >> >> state and only getting the 'diff'.
> >> >>
> >> >> Thanks
> >> >> -Mark
> >> >>
> >> >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess 
> wrote:
> >> >> >
> >> >> > With so many List processors, having a separate version of them
> might
> >> >> > lead to component bloat. GenerateTableFetch is an example of a
> source
> >> >> > processor that can optionally accept incoming flow files only for
> the
> >> >> > purpose of configuration (attributes, e.g.). The analogy to
> Flow-Based
> >> >> > Programming is the OPTIONS named input, which is specifically for
> >> >> > configuration instead of data flow.  For GenerateTableFetch, the
> >> >> > Max-Value Columns and Columns To Return property must be blank or
> >> >> > constant for all possible incoming tables. The former effectively
> >> >> > "disables" state, and the latter ensures single state for column
> >> >> > names/types, although the max values are stored in state by table
> >> >> > name, so the onus is on the user to ensure that the number of
> >> >> > different tables is not so large as to clobber the state store (~1
> MB
> >> >> > in practice IIRC).
> >> >> >
> >> >> > What about doing something similar for List processors? If there
> is no
> >> >> > incoming connection, then they continue to behave as they always
> have.
> >> >> > If there is an incoming connection and no flow file, no work is
> >> >> > performed. If there is an incoming connection with available flow
> >> >> > file(s), then it is more "event-driven" in the sense that state
> will
> >> >> > not be maintained, with the tradeoff that flow file attributes can
> be
> >> >> > used to configure the List properties?
> >> >> >
> >> >> > Regards,
> >> >> > Matt
> >> >> >
> >> >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende 
> wrote:
> >> >> >>
> >> >> >> Hi Martijn,
> >> >> >>
> >> >> >> The request for the "list" processors to support incoming flow
> files
> >> >> >> comes up frequently. The issue is that the list processors are
> meant
> >> >> >> to continuously watch a given directory/bucket and maintain state
> >> >> >> about what has been seen and only find newer stuff. So if you let
> the
> >> >> >> processor support incoming flow files then it means the directory
> can
> >> >> >> potentially be different on every execution of the processor,
> which
> >> >> >> then makes it problematic for maintaining state... how do we know
> if
> >> >> >> there will ever be another flow file indicating the same
> directory and
> >> >> >> whether we need to keep the state around? how much state can
> actually
> >> >> >> store? etc.
> >> >> >>
> >> >> >> I don't know exactly what you're use case is, but I think it
> would be
> >> >> >> reasonable to support a variation of each "list" processor that
> >> >> >> supports incoming flow files, but does NOT maintain state.
> 

Re: Listing S3

2018-09-25 Thread Bryan Bende
Yea if it can be done in a way to only clear state once the processor
is started with state management = false, this way if someone
accidentally toggles the value and hits apply, but then wants to go
back before starting, they wouldn't lose the state.

On Tue, Sep 25, 2018 at 10:40 AM Mark Payne  wrote:
>
> It certainly would be ideal to clear the state.
>
> On Sep 25, 2018, at 10:34 AM, Sivaprasanna  wrote:
>
> I'm in for a configurable state management property.
>
> One question: Let's say we have a processor already running without an 
> incoming connection and have the state management property set to 'true'. 
> After a couple of iterations, it would have some state set. Later the user 
> adds an incoming connection and has the state management property set to 
> 'false'. In this case, do we have to clear off the state? Or maintain the 
> state as is but just don't consider it?
>
> -
> Sivaprasanna
>
> On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende  wrote:
>>
>> I like that approach too.
>> On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard
>>  wrote:
>> >
>> > +1 with Matt's proposal and Mark's comment, I think it'd help answering 
>> > some use cases.
>> > We just need to be very clear about the processor behavior for each 
>> > possible case/configuration.
>> >
>> > Pierre
>> >
>> > Le mar. 25 sept. 2018 à 16:19, Mark Payne  a écrit :
>> >>
>> >> Matt,
>> >>
>> >> I think it's very dangerous to manipulate the behavior of the processor 
>> >> so drastically based
>> >> on the presence or absence of an incoming connection. I think it is fair 
>> >> game, however, to allow
>> >> for a new property to be added that indicates whether or not state is 
>> >> maintained. Then, the processor
>> >> could be made invalid if attempting to maintain state and has an incoming 
>> >> connection.
>> >>
>> >> This approach would be nice anyway because there are valid use cases to 
>> >> have a ListFile processor,
>> >> for example, be the 'source processor' and still want to do a full 
>> >> listing every hour, let's say, rather than keeping
>> >> state and only getting the 'diff'.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess  wrote:
>> >> >
>> >> > With so many List processors, having a separate version of them might
>> >> > lead to component bloat. GenerateTableFetch is an example of a source
>> >> > processor that can optionally accept incoming flow files only for the
>> >> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based
>> >> > Programming is the OPTIONS named input, which is specifically for
>> >> > configuration instead of data flow.  For GenerateTableFetch, the
>> >> > Max-Value Columns and Columns To Return property must be blank or
>> >> > constant for all possible incoming tables. The former effectively
>> >> > "disables" state, and the latter ensures single state for column
>> >> > names/types, although the max values are stored in state by table
>> >> > name, so the onus is on the user to ensure that the number of
>> >> > different tables is not so large as to clobber the state store (~1 MB
>> >> > in practice IIRC).
>> >> >
>> >> > What about doing something similar for List processors? If there is no
>> >> > incoming connection, then they continue to behave as they always have.
>> >> > If there is an incoming connection and no flow file, no work is
>> >> > performed. If there is an incoming connection with available flow
>> >> > file(s), then it is more "event-driven" in the sense that state will
>> >> > not be maintained, with the tradeoff that flow file attributes can be
>> >> > used to configure the List properties?
>> >> >
>> >> > Regards,
>> >> > Matt
>> >> >
>> >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende  wrote:
>> >> >>
>> >> >> Hi Martijn,
>> >> >>
>> >> >> The request for the "list" processors to support incoming flow files
>> >> >> comes up frequently. The issue is that the list processors are meant
>> >> >> to continuously watch a given directory/bucket and maintain state
>> >> >> about what has been seen and only find newer stuff. So if you let the
>> >> >> processor support incoming flow files then it means the directory can
>> >> >> potentially be different on every execution of the processor, which
>> >> >> then makes it problematic for maintaining state... how do we know if
>> >> >> there will ever be another flow file indicating the same directory and
>> >> >> whether we need to keep the state around? how much state can actually
>> >> >> store? etc.
>> >> >>
>> >> >> I don't know exactly what you're use case is, but I think it would be
>> >> >> reasonable to support a variation of each "list" processor that
>> >> >> supports incoming flow files, but does NOT maintain state. Meaning, it
>> >> >> would be used to perform a one-time listing based on the incoming flow
>> >> >> file, and if another flow file came in later with the same
>> >> >> directory/bucket, it would have no knowledge of the previous execution
>> >> >> 

Re: Listing S3

2018-09-25 Thread Mark Payne
It certainly would be ideal to clear the state.

On Sep 25, 2018, at 10:34 AM, Sivaprasanna 
mailto:sivaprasanna...@gmail.com>> wrote:

I'm in for a configurable state management property.

One question: Let's say we have a processor already running without an incoming 
connection and have the state management property set to 'true'. After a couple 
of iterations, it would have some state set. Later the user adds an incoming 
connection and has the state management property set to 'false'. In this case, 
do we have to clear off the state? Or maintain the state as is but just don't 
consider it?

-
Sivaprasanna

On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende 
mailto:bbe...@gmail.com>> wrote:
I like that approach too.
On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard
mailto:pierre.villard...@gmail.com>> wrote:
>
> +1 with Matt's proposal and Mark's comment, I think it'd help answering some 
> use cases.
> We just need to be very clear about the processor behavior for each possible 
> case/configuration.
>
> Pierre
>
> Le mar. 25 sept. 2018 à 16:19, Mark Payne 
> mailto:marka...@hotmail.com>> a écrit :
>>
>> Matt,
>>
>> I think it's very dangerous to manipulate the behavior of the processor so 
>> drastically based
>> on the presence or absence of an incoming connection. I think it is fair 
>> game, however, to allow
>> for a new property to be added that indicates whether or not state is 
>> maintained. Then, the processor
>> could be made invalid if attempting to maintain state and has an incoming 
>> connection.
>>
>> This approach would be nice anyway because there are valid use cases to have 
>> a ListFile processor,
>> for example, be the 'source processor' and still want to do a full listing 
>> every hour, let's say, rather than keeping
>> state and only getting the 'diff'.
>>
>> Thanks
>> -Mark
>>
>> > On Sep 25, 2018, at 10:11 AM, Matt Burgess 
>> > mailto:mattyb...@apache.org>> wrote:
>> >
>> > With so many List processors, having a separate version of them might
>> > lead to component bloat. GenerateTableFetch is an example of a source
>> > processor that can optionally accept incoming flow files only for the
>> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based
>> > Programming is the OPTIONS named input, which is specifically for
>> > configuration instead of data flow.  For GenerateTableFetch, the
>> > Max-Value Columns and Columns To Return property must be blank or
>> > constant for all possible incoming tables. The former effectively
>> > "disables" state, and the latter ensures single state for column
>> > names/types, although the max values are stored in state by table
>> > name, so the onus is on the user to ensure that the number of
>> > different tables is not so large as to clobber the state store (~1 MB
>> > in practice IIRC).
>> >
>> > What about doing something similar for List processors? If there is no
>> > incoming connection, then they continue to behave as they always have.
>> > If there is an incoming connection and no flow file, no work is
>> > performed. If there is an incoming connection with available flow
>> > file(s), then it is more "event-driven" in the sense that state will
>> > not be maintained, with the tradeoff that flow file attributes can be
>> > used to configure the List properties?
>> >
>> > Regards,
>> > Matt
>> >
>> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende 
>> > mailto:bbe...@gmail.com>> wrote:
>> >>
>> >> Hi Martijn,
>> >>
>> >> The request for the "list" processors to support incoming flow files
>> >> comes up frequently. The issue is that the list processors are meant
>> >> to continuously watch a given directory/bucket and maintain state
>> >> about what has been seen and only find newer stuff. So if you let the
>> >> processor support incoming flow files then it means the directory can
>> >> potentially be different on every execution of the processor, which
>> >> then makes it problematic for maintaining state... how do we know if
>> >> there will ever be another flow file indicating the same directory and
>> >> whether we need to keep the state around? how much state can actually
>> >> store? etc.
>> >>
>> >> I don't know exactly what you're use case is, but I think it would be
>> >> reasonable to support a variation of each "list" processor that
>> >> supports incoming flow files, but does NOT maintain state. Meaning, it
>> >> would be used to perform a one-time listing based on the incoming flow
>> >> file, and if another flow file came in later with the same
>> >> directory/bucket, it would have no knowledge of the previous execution
>> >> and thus list everything again.
>> >>
>> >> -Bryan
>> >>
>> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers 
>> >> mailto:mart...@dekkers.org.uk>> wrote:
>> >>>
>> >>> Hi Koji,
>> >>>
>> >>> Thanks, that is exactly the path we took in the end. This is a repeating 
>> >>> pattern for us, and we would have preferred to keep it all contained in 
>> >>> an ISP. Since the output of the 

Re: Listing S3

2018-09-25 Thread Sivaprasanna
I'm in for a configurable state management property.

One question: Let's say we have a processor already running without an
incoming connection and have the state management property set to 'true'.
After a couple of iterations, it would have some state set. Later the user
adds an incoming connection and has the state management property set to
'false'. In this case, do we have to clear off the state? Or maintain the
state as is but just don't consider it?

-
Sivaprasanna

On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende  wrote:

> I like that approach too.
> On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard
>  wrote:
> >
> > +1 with Matt's proposal and Mark's comment, I think it'd help answering
> some use cases.
> > We just need to be very clear about the processor behavior for each
> possible case/configuration.
> >
> > Pierre
> >
> > Le mar. 25 sept. 2018 à 16:19, Mark Payne  a
> écrit :
> >>
> >> Matt,
> >>
> >> I think it's very dangerous to manipulate the behavior of the processor
> so drastically based
> >> on the presence or absence of an incoming connection. I think it is
> fair game, however, to allow
> >> for a new property to be added that indicates whether or not state is
> maintained. Then, the processor
> >> could be made invalid if attempting to maintain state and has an
> incoming connection.
> >>
> >> This approach would be nice anyway because there are valid use cases to
> have a ListFile processor,
> >> for example, be the 'source processor' and still want to do a full
> listing every hour, let's say, rather than keeping
> >> state and only getting the 'diff'.
> >>
> >> Thanks
> >> -Mark
> >>
> >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess 
> wrote:
> >> >
> >> > With so many List processors, having a separate version of them might
> >> > lead to component bloat. GenerateTableFetch is an example of a source
> >> > processor that can optionally accept incoming flow files only for the
> >> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based
> >> > Programming is the OPTIONS named input, which is specifically for
> >> > configuration instead of data flow.  For GenerateTableFetch, the
> >> > Max-Value Columns and Columns To Return property must be blank or
> >> > constant for all possible incoming tables. The former effectively
> >> > "disables" state, and the latter ensures single state for column
> >> > names/types, although the max values are stored in state by table
> >> > name, so the onus is on the user to ensure that the number of
> >> > different tables is not so large as to clobber the state store (~1 MB
> >> > in practice IIRC).
> >> >
> >> > What about doing something similar for List processors? If there is no
> >> > incoming connection, then they continue to behave as they always have.
> >> > If there is an incoming connection and no flow file, no work is
> >> > performed. If there is an incoming connection with available flow
> >> > file(s), then it is more "event-driven" in the sense that state will
> >> > not be maintained, with the tradeoff that flow file attributes can be
> >> > used to configure the List properties?
> >> >
> >> > Regards,
> >> > Matt
> >> >
> >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende  wrote:
> >> >>
> >> >> Hi Martijn,
> >> >>
> >> >> The request for the "list" processors to support incoming flow files
> >> >> comes up frequently. The issue is that the list processors are meant
> >> >> to continuously watch a given directory/bucket and maintain state
> >> >> about what has been seen and only find newer stuff. So if you let the
> >> >> processor support incoming flow files then it means the directory can
> >> >> potentially be different on every execution of the processor, which
> >> >> then makes it problematic for maintaining state... how do we know if
> >> >> there will ever be another flow file indicating the same directory
> and
> >> >> whether we need to keep the state around? how much state can actually
> >> >> store? etc.
> >> >>
> >> >> I don't know exactly what you're use case is, but I think it would be
> >> >> reasonable to support a variation of each "list" processor that
> >> >> supports incoming flow files, but does NOT maintain state. Meaning,
> it
> >> >> would be used to perform a one-time listing based on the incoming
> flow
> >> >> file, and if another flow file came in later with the same
> >> >> directory/bucket, it would have no knowledge of the previous
> execution
> >> >> and thus list everything again.
> >> >>
> >> >> -Bryan
> >> >>
> >> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers <
> mart...@dekkers.org.uk> wrote:
> >> >>>
> >> >>> Hi Koji,
> >> >>>
> >> >>> Thanks, that is exactly the path we took in the end. This is a
> repeating pattern for us, and we would have preferred to keep it all
> contained in an ISP. Since the output of the listing is very large, we run
> into some memory issues at the SplitText step, so we use a few of those in
> sequence, which is all a bit hacky. When we have some 

Re: Listing S3

2018-09-25 Thread Bryan Bende
I like that approach too.
On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard
 wrote:
>
> +1 with Matt's proposal and Mark's comment, I think it'd help answering some 
> use cases.
> We just need to be very clear about the processor behavior for each possible 
> case/configuration.
>
> Pierre
>
> Le mar. 25 sept. 2018 à 16:19, Mark Payne  a écrit :
>>
>> Matt,
>>
>> I think it's very dangerous to manipulate the behavior of the processor so 
>> drastically based
>> on the presence or absence of an incoming connection. I think it is fair 
>> game, however, to allow
>> for a new property to be added that indicates whether or not state is 
>> maintained. Then, the processor
>> could be made invalid if attempting to maintain state and has an incoming 
>> connection.
>>
>> This approach would be nice anyway because there are valid use cases to have 
>> a ListFile processor,
>> for example, be the 'source processor' and still want to do a full listing 
>> every hour, let's say, rather than keeping
>> state and only getting the 'diff'.
>>
>> Thanks
>> -Mark
>>
>> > On Sep 25, 2018, at 10:11 AM, Matt Burgess  wrote:
>> >
>> > With so many List processors, having a separate version of them might
>> > lead to component bloat. GenerateTableFetch is an example of a source
>> > processor that can optionally accept incoming flow files only for the
>> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based
>> > Programming is the OPTIONS named input, which is specifically for
>> > configuration instead of data flow.  For GenerateTableFetch, the
>> > Max-Value Columns and Columns To Return property must be blank or
>> > constant for all possible incoming tables. The former effectively
>> > "disables" state, and the latter ensures single state for column
>> > names/types, although the max values are stored in state by table
>> > name, so the onus is on the user to ensure that the number of
>> > different tables is not so large as to clobber the state store (~1 MB
>> > in practice IIRC).
>> >
>> > What about doing something similar for List processors? If there is no
>> > incoming connection, then they continue to behave as they always have.
>> > If there is an incoming connection and no flow file, no work is
>> > performed. If there is an incoming connection with available flow
>> > file(s), then it is more "event-driven" in the sense that state will
>> > not be maintained, with the tradeoff that flow file attributes can be
>> > used to configure the List properties?
>> >
>> > Regards,
>> > Matt
>> >
>> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende  wrote:
>> >>
>> >> Hi Martijn,
>> >>
>> >> The request for the "list" processors to support incoming flow files
>> >> comes up frequently. The issue is that the list processors are meant
>> >> to continuously watch a given directory/bucket and maintain state
>> >> about what has been seen and only find newer stuff. So if you let the
>> >> processor support incoming flow files then it means the directory can
>> >> potentially be different on every execution of the processor, which
>> >> then makes it problematic for maintaining state... how do we know if
>> >> there will ever be another flow file indicating the same directory and
>> >> whether we need to keep the state around? how much state can actually
>> >> store? etc.
>> >>
>> >> I don't know exactly what you're use case is, but I think it would be
>> >> reasonable to support a variation of each "list" processor that
>> >> supports incoming flow files, but does NOT maintain state. Meaning, it
>> >> would be used to perform a one-time listing based on the incoming flow
>> >> file, and if another flow file came in later with the same
>> >> directory/bucket, it would have no knowledge of the previous execution
>> >> and thus list everything again.
>> >>
>> >> -Bryan
>> >>
>> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers  
>> >> wrote:
>> >>>
>> >>> Hi Koji,
>> >>>
>> >>> Thanks, that is exactly the path we took in the end. This is a repeating 
>> >>> pattern for us, and we would have preferred to keep it all contained in 
>> >>> an ISP. Since the output of the listing is very large, we run into some 
>> >>> memory issues at the SplitText step, so we use a few of those in 
>> >>> sequence, which is all a bit hacky. When we have some time we will get 
>> >>> back to this, and hopefully get it done "correctly".
>> >>>
>> >>> I am trying to work out what the reasoning is for none of the List-type 
>> >>> processors to accept incoming connections, we use them frequently and 
>> >>> have to resort to all kinds of acrobatics to work around this. In this 
>> >>> instance we use an external script, in some others we have to set up 
>> >>> infrastructure outside of NiFi to set parameters via the API. It would 
>> >>> be a lot easier and smoother if we could simply accept an incoming 
>> >>> connection and use attributes.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Martijn
>> >>>
>> >>> On Tue, 25 Sep 2018 at 02:37, Koji 

Re: Listing S3

2018-09-25 Thread Pierre Villard
+1 with Matt's proposal and Mark's comment, I think it'd help answering
some use cases.
We just need to be very clear about the processor behavior for each
possible case/configuration.

Pierre

Le mar. 25 sept. 2018 à 16:19, Mark Payne  a écrit :

> Matt,
>
> I think it's very dangerous to manipulate the behavior of the processor so
> drastically based
> on the presence or absence of an incoming connection. I think it is fair
> game, however, to allow
> for a new property to be added that indicates whether or not state is
> maintained. Then, the processor
> could be made invalid if attempting to maintain state and has an incoming
> connection.
>
> This approach would be nice anyway because there are valid use cases to
> have a ListFile processor,
> for example, be the 'source processor' and still want to do a full listing
> every hour, let's say, rather than keeping
> state and only getting the 'diff'.
>
> Thanks
> -Mark
>
> > On Sep 25, 2018, at 10:11 AM, Matt Burgess  wrote:
> >
> > With so many List processors, having a separate version of them might
> > lead to component bloat. GenerateTableFetch is an example of a source
> > processor that can optionally accept incoming flow files only for the
> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based
> > Programming is the OPTIONS named input, which is specifically for
> > configuration instead of data flow.  For GenerateTableFetch, the
> > Max-Value Columns and Columns To Return property must be blank or
> > constant for all possible incoming tables. The former effectively
> > "disables" state, and the latter ensures single state for column
> > names/types, although the max values are stored in state by table
> > name, so the onus is on the user to ensure that the number of
> > different tables is not so large as to clobber the state store (~1 MB
> > in practice IIRC).
> >
> > What about doing something similar for List processors? If there is no
> > incoming connection, then they continue to behave as they always have.
> > If there is an incoming connection and no flow file, no work is
> > performed. If there is an incoming connection with available flow
> > file(s), then it is more "event-driven" in the sense that state will
> > not be maintained, with the tradeoff that flow file attributes can be
> > used to configure the List properties?
> >
> > Regards,
> > Matt
> >
> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende  wrote:
> >>
> >> Hi Martijn,
> >>
> >> The request for the "list" processors to support incoming flow files
> >> comes up frequently. The issue is that the list processors are meant
> >> to continuously watch a given directory/bucket and maintain state
> >> about what has been seen and only find newer stuff. So if you let the
> >> processor support incoming flow files then it means the directory can
> >> potentially be different on every execution of the processor, which
> >> then makes it problematic for maintaining state... how do we know if
> >> there will ever be another flow file indicating the same directory and
> >> whether we need to keep the state around? how much state can actually
> >> store? etc.
> >>
> >> I don't know exactly what you're use case is, but I think it would be
> >> reasonable to support a variation of each "list" processor that
> >> supports incoming flow files, but does NOT maintain state. Meaning, it
> >> would be used to perform a one-time listing based on the incoming flow
> >> file, and if another flow file came in later with the same
> >> directory/bucket, it would have no knowledge of the previous execution
> >> and thus list everything again.
> >>
> >> -Bryan
> >>
> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers 
> wrote:
> >>>
> >>> Hi Koji,
> >>>
> >>> Thanks, that is exactly the path we took in the end. This is a
> repeating pattern for us, and we would have preferred to keep it all
> contained in an ISP. Since the output of the listing is very large, we run
> into some memory issues at the SplitText step, so we use a few of those in
> sequence, which is all a bit hacky. When we have some time we will get back
> to this, and hopefully get it done "correctly".
> >>>
> >>> I am trying to work out what the reasoning is for none of the
> List-type processors to accept incoming connections, we use them frequently
> and have to resort to all kinds of acrobatics to work around this. In this
> instance we use an external script, in some others we have to set up
> infrastructure outside of NiFi to set parameters via the API. It would be a
> lot easier and smoother if we could simply accept an incoming connection
> and use attributes.
> >>>
> >>> Thanks,
> >>>
> >>> Martijn
> >>>
> >>> On Tue, 25 Sep 2018 at 02:37, Koji Kawamura 
> wrote:
> 
>  Hi Martijn,
> 
>  I'm not an expert on Jython, but if you already have a python script
>  using boto3 working fine, then I'd suggest using ExecuteStreamCommand
>  instead.
>  For example:
>  - you can design the 

Re: Listing S3

2018-09-25 Thread Mark Payne
Matt,

I think it's very dangerous to manipulate the behavior of the processor so 
drastically based
on the presence or absence of an incoming connection. I think it is fair game, 
however, to allow
for a new property to be added that indicates whether or not state is 
maintained. Then, the processor
could be made invalid if attempting to maintain state and has an incoming 
connection.

This approach would be nice anyway because there are valid use cases to have a 
ListFile processor,
for example, be the 'source processor' and still want to do a full listing 
every hour, let's say, rather than keeping
state and only getting the 'diff'.

Thanks
-Mark

> On Sep 25, 2018, at 10:11 AM, Matt Burgess  wrote:
> 
> With so many List processors, having a separate version of them might
> lead to component bloat. GenerateTableFetch is an example of a source
> processor that can optionally accept incoming flow files only for the
> purpose of configuration (attributes, e.g.). The analogy to Flow-Based
> Programming is the OPTIONS named input, which is specifically for
> configuration instead of data flow.  For GenerateTableFetch, the
> Max-Value Columns and Columns To Return property must be blank or
> constant for all possible incoming tables. The former effectively
> "disables" state, and the latter ensures single state for column
> names/types, although the max values are stored in state by table
> name, so the onus is on the user to ensure that the number of
> different tables is not so large as to clobber the state store (~1 MB
> in practice IIRC).
> 
> What about doing something similar for List processors? If there is no
> incoming connection, then they continue to behave as they always have.
> If there is an incoming connection and no flow file, no work is
> performed. If there is an incoming connection with available flow
> file(s), then it is more "event-driven" in the sense that state will
> not be maintained, with the tradeoff that flow file attributes can be
> used to configure the List properties?
> 
> Regards,
> Matt
> 
> On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende  wrote:
>> 
>> Hi Martijn,
>> 
>> The request for the "list" processors to support incoming flow files
>> comes up frequently. The issue is that the list processors are meant
>> to continuously watch a given directory/bucket and maintain state
>> about what has been seen and only find newer stuff. So if you let the
>> processor support incoming flow files then it means the directory can
>> potentially be different on every execution of the processor, which
>> then makes it problematic for maintaining state... how do we know if
>> there will ever be another flow file indicating the same directory and
>> whether we need to keep the state around? how much state can actually
>> store? etc.
>> 
>> I don't know exactly what you're use case is, but I think it would be
>> reasonable to support a variation of each "list" processor that
>> supports incoming flow files, but does NOT maintain state. Meaning, it
>> would be used to perform a one-time listing based on the incoming flow
>> file, and if another flow file came in later with the same
>> directory/bucket, it would have no knowledge of the previous execution
>> and thus list everything again.
>> 
>> -Bryan
>> 
>> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers  
>> wrote:
>>> 
>>> Hi Koji,
>>> 
>>> Thanks, that is exactly the path we took in the end. This is a repeating 
>>> pattern for us, and we would have preferred to keep it all contained in an 
>>> ISP. Since the output of the listing is very large, we run into some memory 
>>> issues at the SplitText step, so we use a few of those in sequence, which 
>>> is all a bit hacky. When we have some time we will get back to this, and 
>>> hopefully get it done "correctly".
>>> 
>>> I am trying to work out what the reasoning is for none of the List-type 
>>> processors to accept incoming connections, we use them frequently and have 
>>> to resort to all kinds of acrobatics to work around this. In this instance 
>>> we use an external script, in some others we have to set up infrastructure 
>>> outside of NiFi to set parameters via the API. It would be a lot easier and 
>>> smoother if we could simply accept an incoming connection and use 
>>> attributes.
>>> 
>>> Thanks,
>>> 
>>> Martijn
>>> 
>>> On Tue, 25 Sep 2018 at 02:37, Koji Kawamura  wrote:
 
 Hi Martijn,
 
 I'm not an expert on Jython, but if you already have a python script
 using boto3 working fine, then I'd suggest using ExecuteStreamCommand
 instead.
 For example:
 - you can design the python script to print out JSON formatted string
 about listed files
 - then connect the outputs to SplitJson
 - and use EvaluateJsonPath to extract required values to FlowFile attribute
 - finally, use FetchS3Object
 
 Thanks,
 Koji



Re: Listing S3

2018-09-25 Thread Matt Burgess
With so many List processors, having a separate version of them might
lead to component bloat. GenerateTableFetch is an example of a source
processor that can optionally accept incoming flow files only for the
purpose of configuration (attributes, e.g.). The analogy to Flow-Based
Programming is the OPTIONS named input, which is specifically for
configuration instead of data flow.  For GenerateTableFetch, the
Max-Value Columns and Columns To Return property must be blank or
constant for all possible incoming tables. The former effectively
"disables" state, and the latter ensures single state for column
names/types, although the max values are stored in state by table
name, so the onus is on the user to ensure that the number of
different tables is not so large as to clobber the state store (~1 MB
in practice IIRC).

What about doing something similar for List processors? If there is no
incoming connection, then they continue to behave as they always have.
If there is an incoming connection and no flow file, no work is
performed. If there is an incoming connection with available flow
file(s), then it is more "event-driven" in the sense that state will
not be maintained, with the tradeoff that flow file attributes can be
used to configure the List properties?

Regards,
Matt

On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende  wrote:
>
> Hi Martijn,
>
> The request for the "list" processors to support incoming flow files
> comes up frequently. The issue is that the list processors are meant
> to continuously watch a given directory/bucket and maintain state
> about what has been seen and only find newer stuff. So if you let the
> processor support incoming flow files then it means the directory can
> potentially be different on every execution of the processor, which
> then makes it problematic for maintaining state... how do we know if
> there will ever be another flow file indicating the same directory and
> whether we need to keep the state around? how much state can actually
> store? etc.
>
> I don't know exactly what you're use case is, but I think it would be
> reasonable to support a variation of each "list" processor that
> supports incoming flow files, but does NOT maintain state. Meaning, it
> would be used to perform a one-time listing based on the incoming flow
> file, and if another flow file came in later with the same
> directory/bucket, it would have no knowledge of the previous execution
> and thus list everything again.
>
> -Bryan
>
> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers  
> wrote:
> >
> > Hi Koji,
> >
> > Thanks, that is exactly the path we took in the end. This is a repeating 
> > pattern for us, and we would have preferred to keep it all contained in an 
> > ISP. Since the output of the listing is very large, we run into some memory 
> > issues at the SplitText step, so we use a few of those in sequence, which 
> > is all a bit hacky. When we have some time we will get back to this, and 
> > hopefully get it done "correctly".
> >
> > I am trying to work out what the reasoning is for none of the List-type 
> > processors to accept incoming connections, we use them frequently and have 
> > to resort to all kinds of acrobatics to work around this. In this instance 
> > we use an external script, in some others we have to set up infrastructure 
> > outside of NiFi to set parameters via the API. It would be a lot easier and 
> > smoother if we could simply accept an incoming connection and use 
> > attributes.
> >
> > Thanks,
> >
> > Martijn
> >
> > On Tue, 25 Sep 2018 at 02:37, Koji Kawamura  wrote:
> >>
> >> Hi Martijn,
> >>
> >> I'm not an expert on Jython, but if you already have a python script
> >> using boto3 working fine, then I'd suggest using ExecuteStreamCommand
> >> instead.
> >> For example:
> >> - you can design the python script to print out JSON formatted string
> >> about listed files
> >> - then connect the outputs to SplitJson
> >> - and use EvaluateJsonPath to extract required values to FlowFile attribute
> >> - finally, use FetchS3Object
> >>
> >> Thanks,
> >> Koji


Re: Listing S3

2018-09-25 Thread Bryan Bende
Hi Martijn,

The request for the "list" processors to support incoming flow files
comes up frequently. The issue is that the list processors are meant
to continuously watch a given directory/bucket and maintain state
about what has been seen and only find newer stuff. So if you let the
processor support incoming flow files then it means the directory can
potentially be different on every execution of the processor, which
then makes it problematic for maintaining state... how do we know if
there will ever be another flow file indicating the same directory and
whether we need to keep the state around? how much state can actually
store? etc.

I don't know exactly what you're use case is, but I think it would be
reasonable to support a variation of each "list" processor that
supports incoming flow files, but does NOT maintain state. Meaning, it
would be used to perform a one-time listing based on the incoming flow
file, and if another flow file came in later with the same
directory/bucket, it would have no knowledge of the previous execution
and thus list everything again.

-Bryan

On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers  wrote:
>
> Hi Koji,
>
> Thanks, that is exactly the path we took in the end. This is a repeating 
> pattern for us, and we would have preferred to keep it all contained in an 
> ISP. Since the output of the listing is very large, we run into some memory 
> issues at the SplitText step, so we use a few of those in sequence, which is 
> all a bit hacky. When we have some time we will get back to this, and 
> hopefully get it done "correctly".
>
> I am trying to work out what the reasoning is for none of the List-type 
> processors to accept incoming connections, we use them frequently and have to 
> resort to all kinds of acrobatics to work around this. In this instance we 
> use an external script, in some others we have to set up infrastructure 
> outside of NiFi to set parameters via the API. It would be a lot easier and 
> smoother if we could simply accept an incoming connection and use attributes.
>
> Thanks,
>
> Martijn
>
> On Tue, 25 Sep 2018 at 02:37, Koji Kawamura  wrote:
>>
>> Hi Martijn,
>>
>> I'm not an expert on Jython, but if you already have a python script
>> using boto3 working fine, then I'd suggest using ExecuteStreamCommand
>> instead.
>> For example:
>> - you can design the python script to print out JSON formatted string
>> about listed files
>> - then connect the outputs to SplitJson
>> - and use EvaluateJsonPath to extract required values to FlowFile attribute
>> - finally, use FetchS3Object
>>
>> Thanks,
>> Koji


Re: Listing S3

2018-09-25 Thread Martijn Dekkers
Hi Koji,

Thanks, that is exactly the path we took in the end. This is a repeating
pattern for us, and we would have preferred to keep it all contained in an
ISP. Since the output of the listing is very large, we run into some memory
issues at the SplitText step, so we use a few of those in sequence, which
is all a bit hacky. When we have some time we will get back to this, and
hopefully get it done "correctly".

I am trying to work out what the reasoning is for none of the List-type
processors to accept incoming connections, we use them frequently and have
to resort to all kinds of acrobatics to work around this. In this instance
we use an external script, in some others we have to set up infrastructure
outside of NiFi to set parameters via the API. It would be a lot easier and
smoother if we could simply accept an incoming connection and use
attributes.

Thanks,

Martijn

On Tue, 25 Sep 2018 at 02:37, Koji Kawamura  wrote:

> Hi Martijn,
>
> I'm not an expert on Jython, but if you already have a python script
> using boto3 working fine, then I'd suggest using ExecuteStreamCommand
> instead.
> For example:
> - you can design the python script to print out JSON formatted string
> about listed files
> - then connect the outputs to SplitJson
> - and use EvaluateJsonPath to extract required values to FlowFile attribute
> - finally, use FetchS3Object
>
> Thanks,
> Koji
>


Re: Listing S3

2018-09-24 Thread Koji Kawamura
Hi Martijn,

I'm not an expert on Jython, but if you already have a python script
using boto3 working fine, then I'd suggest using ExecuteStreamCommand
instead.
For example:
- you can design the python script to print out JSON formatted string
about listed files
- then connect the outputs to SplitJson
- and use EvaluateJsonPath to extract required values to FlowFile attribute
- finally, use FetchS3Object

Thanks,
Koji


Listing S3

2018-09-19 Thread Martijn Dekkers
Hello all, I have a head-breaking issue, and I hope someone is able to
assist.

We have a requirement to pull a list of files from an S3 compatible store
on the basis of an incoming flowfile containing the required attributes
such as bucket and a few others, including a file suffix. Whilst we can
filter for suffix downstream from the listing, the ListS3 process doesn't
support incoming flowfiles, so we cannot use this processor.

We are wanting to implement an InvokeScriptedProcessor with jython, and we
have a very annoying issue we cannot track down.

We first implemented a simple python script that is using boto3 to fetch a
list of files, typically 20k files in the list. On the terminal all works
as expected.

When adjusting this for the ISP we receive the following error:

java.lang.reflect.UndeclaredThrowableException: null
[...]
Caused by: javax.script.ScriptException: KeyError: 'ConfigParser' in