Re: Listing S3

Bryan Bende Tue, 25 Sep 2018 07:44:40 -0700

Yea if it can be done in a way to only clear state once the processor
is started with state management = false, this way if someone
accidentally toggles the value and hits apply, but then wants to go
back before starting, they wouldn't lose the state.


On Tue, Sep 25, 2018 at 10:40 AM Mark Payne <marka...@hotmail.com> wrote:
>
> It certainly would be ideal to clear the state.
>
> On Sep 25, 2018, at 10:34 AM, Sivaprasanna <sivaprasanna...@gmail.com> wrote:
>
> I'm in for a configurable state management property.
>
> One question: Let's say we have a processor already running without an 
> incoming connection and have the state management property set to 'true'. 
> After a couple of iterations, it would have some state set. Later the user 
> adds an incoming connection and has the state management property set to 
> 'false'. In this case, do we have to clear off the state? Or maintain the 
> state as is but just don't consider it?
>
> -
> Sivaprasanna
>
> On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende <bbe...@gmail.com> wrote:
>>
>> I like that approach too.
>> On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard
>> <pierre.villard...@gmail.com> wrote:
>> >
>> > +1 with Matt's proposal and Mark's comment, I think it'd help answering 
>> > some use cases.
>> > We just need to be very clear about the processor behavior for each 
>> > possible case/configuration.
>> >
>> > Pierre
>> >
>> > Le mar. 25 sept. 2018 à 16:19, Mark Payne <marka...@hotmail.com> a écrit :
>> >>
>> >> Matt,
>> >>
>> >> I think it's very dangerous to manipulate the behavior of the processor 
>> >> so drastically based
>> >> on the presence or absence of an incoming connection. I think it is fair 
>> >> game, however, to allow
>> >> for a new property to be added that indicates whether or not state is 
>> >> maintained. Then, the processor
>> >> could be made invalid if attempting to maintain state and has an incoming 
>> >> connection.
>> >>
>> >> This approach would be nice anyway because there are valid use cases to 
>> >> have a ListFile processor,
>> >> for example, be the 'source processor' and still want to do a full 
>> >> listing every hour, let's say, rather than keeping
>> >> state and only getting the 'diff'.
>> >>
>> >> Thanks
>> >> -Mark
>> >>
>> >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess <mattyb...@apache.org> wrote:
>> >> >
>> >> > With so many List processors, having a separate version of them might
>> >> > lead to component bloat. GenerateTableFetch is an example of a source
>> >> > processor that can optionally accept incoming flow files only for the
>> >> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based
>> >> > Programming is the OPTIONS named input, which is specifically for
>> >> > configuration instead of data flow.  For GenerateTableFetch, the
>> >> > Max-Value Columns and Columns To Return property must be blank or
>> >> > constant for all possible incoming tables. The former effectively
>> >> > "disables" state, and the latter ensures single state for column
>> >> > names/types, although the max values are stored in state by table
>> >> > name, so the onus is on the user to ensure that the number of
>> >> > different tables is not so large as to clobber the state store (~1 MB
>> >> > in practice IIRC).
>> >> >
>> >> > What about doing something similar for List processors? If there is no
>> >> > incoming connection, then they continue to behave as they always have.
>> >> > If there is an incoming connection and no flow file, no work is
>> >> > performed. If there is an incoming connection with available flow
>> >> > file(s), then it is more "event-driven" in the sense that state will
>> >> > not be maintained, with the tradeoff that flow file attributes can be
>> >> > used to configure the List properties?
>> >> >
>> >> > Regards,
>> >> > Matt
>> >> >
>> >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende <bbe...@gmail.com> wrote:
>> >> >>
>> >> >> Hi Martijn,
>> >> >>
>> >> >> The request for the "list" processors to support incoming flow files
>> >> >> comes up frequently. The issue is that the list processors are meant
>> >> >> to continuously watch a given directory/bucket and maintain state
>> >> >> about what has been seen and only find newer stuff. So if you let the
>> >> >> processor support incoming flow files then it means the directory can
>> >> >> potentially be different on every execution of the processor, which
>> >> >> then makes it problematic for maintaining state... how do we know if
>> >> >> there will ever be another flow file indicating the same directory and
>> >> >> whether we need to keep the state around? how much state can actually
>> >> >> store? etc.
>> >> >>
>> >> >> I don't know exactly what you're use case is, but I think it would be
>> >> >> reasonable to support a variation of each "list" processor that
>> >> >> supports incoming flow files, but does NOT maintain state. Meaning, it
>> >> >> would be used to perform a one-time listing based on the incoming flow
>> >> >> file, and if another flow file came in later with the same
>> >> >> directory/bucket, it would have no knowledge of the previous execution
>> >> >> and thus list everything again.
>> >> >>
>> >> >> -Bryan
>> >> >>
>> >> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers 
>> >> >> <mart...@dekkers.org.uk> wrote:
>> >> >>>
>> >> >>> Hi Koji,
>> >> >>>
>> >> >>> Thanks, that is exactly the path we took in the end. This is a 
>> >> >>> repeating pattern for us, and we would have preferred to keep it all 
>> >> >>> contained in an ISP. Since the output of the listing is very large, 
>> >> >>> we run into some memory issues at the SplitText step, so we use a few 
>> >> >>> of those in sequence, which is all a bit hacky. When we have some 
>> >> >>> time we will get back to this, and hopefully get it done "correctly".
>> >> >>>
>> >> >>> I am trying to work out what the reasoning is for none of the 
>> >> >>> List-type processors to accept incoming connections, we use them 
>> >> >>> frequently and have to resort to all kinds of acrobatics to work 
>> >> >>> around this. In this instance we use an external script, in some 
>> >> >>> others we have to set up infrastructure outside of NiFi to set 
>> >> >>> parameters via the API. It would be a lot easier and smoother if we 
>> >> >>> could simply accept an incoming connection and use attributes.
>> >> >>>
>> >> >>> Thanks,
>> >> >>>
>> >> >>> Martijn
>> >> >>>
>> >> >>> On Tue, 25 Sep 2018 at 02:37, Koji Kawamura <ijokaruma...@gmail.com> 
>> >> >>> wrote:
>> >> >>>>
>> >> >>>> Hi Martijn,
>> >> >>>>
>> >> >>>> I'm not an expert on Jython, but if you already have a python script
>> >> >>>> using boto3 working fine, then I'd suggest using ExecuteStreamCommand
>> >> >>>> instead.
>> >> >>>> For example:
>> >> >>>> - you can design the python script to print out JSON formatted string
>> >> >>>> about listed files
>> >> >>>> - then connect the outputs to SplitJson
>> >> >>>> - and use EvaluateJsonPath to extract required values to FlowFile 
>> >> >>>> attribute
>> >> >>>> - finally, use FetchS3Object
>> >> >>>>
>> >> >>>> Thanks,
>> >> >>>> Koji
>> >>
>
>

Re: Listing S3

Reply via email to