Re: Listing S3

Bryan Bende Tue, 25 Sep 2018 07:24:25 -0700
I like that approach too.
On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard
<pierre.villard...@gmail.com> wrote:
>
> +1 with Matt's proposal and Mark's comment, I think it'd help answering some 
> use cases.
> We just need to be very clear about the processor behavior for each possible 
> case/configuration.
>
> Pierre
>
> Le mar. 25 sept. 2018 à 16:19, Mark Payne <marka...@hotmail.com> a écrit :
>>
>> Matt,
>>
>> I think it's very dangerous to manipulate the behavior of the processor so 
>> drastically based
>> on the presence or absence of an incoming connection. I think it is fair 
>> game, however, to allow
>> for a new property to be added that indicates whether or not state is 
>> maintained. Then, the processor
>> could be made invalid if attempting to maintain state and has an incoming 
>> connection.
>>
>> This approach would be nice anyway because there are valid use cases to have 
>> a ListFile processor,
>> for example, be the 'source processor' and still want to do a full listing 
>> every hour, let's say, rather than keeping
>> state and only getting the 'diff'.
>>
>> Thanks
>> -Mark
>>
>> > On Sep 25, 2018, at 10:11 AM, Matt Burgess <mattyb...@apache.org> wrote:
>> >
>> > With so many List processors, having a separate version of them might
>> > lead to component bloat. GenerateTableFetch is an example of a source
>> > processor that can optionally accept incoming flow files only for the
>> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based
>> > Programming is the OPTIONS named input, which is specifically for
>> > configuration instead of data flow.  For GenerateTableFetch, the
>> > Max-Value Columns and Columns To Return property must be blank or
>> > constant for all possible incoming tables. The former effectively
>> > "disables" state, and the latter ensures single state for column
>> > names/types, although the max values are stored in state by table
>> > name, so the onus is on the user to ensure that the number of
>> > different tables is not so large as to clobber the state store (~1 MB
>> > in practice IIRC).
>> >
>> > What about doing something similar for List processors? If there is no
>> > incoming connection, then they continue to behave as they always have.
>> > If there is an incoming connection and no flow file, no work is
>> > performed. If there is an incoming connection with available flow
>> > file(s), then it is more "event-driven" in the sense that state will
>> > not be maintained, with the tradeoff that flow file attributes can be
>> > used to configure the List properties?
>> >
>> > Regards,
>> > Matt
>> >
>> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende <bbe...@gmail.com> wrote:
>> >>
>> >> Hi Martijn,
>> >>
>> >> The request for the "list" processors to support incoming flow files
>> >> comes up frequently. The issue is that the list processors are meant
>> >> to continuously watch a given directory/bucket and maintain state
>> >> about what has been seen and only find newer stuff. So if you let the
>> >> processor support incoming flow files then it means the directory can
>> >> potentially be different on every execution of the processor, which
>> >> then makes it problematic for maintaining state... how do we know if
>> >> there will ever be another flow file indicating the same directory and
>> >> whether we need to keep the state around? how much state can actually
>> >> store? etc.
>> >>
>> >> I don't know exactly what you're use case is, but I think it would be
>> >> reasonable to support a variation of each "list" processor that
>> >> supports incoming flow files, but does NOT maintain state. Meaning, it
>> >> would be used to perform a one-time listing based on the incoming flow
>> >> file, and if another flow file came in later with the same
>> >> directory/bucket, it would have no knowledge of the previous execution
>> >> and thus list everything again.
>> >>
>> >> -Bryan
>> >>
>> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers <mart...@dekkers.org.uk> 
>> >> wrote:
>> >>>
>> >>> Hi Koji,
>> >>>
>> >>> Thanks, that is exactly the path we took in the end. This is a repeating 
>> >>> pattern for us, and we would have preferred to keep it all contained in 
>> >>> an ISP. Since the output of the listing is very large, we run into some 
>> >>> memory issues at the SplitText step, so we use a few of those in 
>> >>> sequence, which is all a bit hacky. When we have some time we will get 
>> >>> back to this, and hopefully get it done "correctly".
>> >>>
>> >>> I am trying to work out what the reasoning is for none of the List-type 
>> >>> processors to accept incoming connections, we use them frequently and 
>> >>> have to resort to all kinds of acrobatics to work around this. In this 
>> >>> instance we use an external script, in some others we have to set up 
>> >>> infrastructure outside of NiFi to set parameters via the API. It would 
>> >>> be a lot easier and smoother if we could simply accept an incoming 
>> >>> connection and use attributes.
>> >>>
>> >>> Thanks,
>> >>>
>> >>> Martijn
>> >>>
>> >>> On Tue, 25 Sep 2018 at 02:37, Koji Kawamura <ijokaruma...@gmail.com> 
>> >>> wrote:
>> >>>>
>> >>>> Hi Martijn,
>> >>>>
>> >>>> I'm not an expert on Jython, but if you already have a python script
>> >>>> using boto3 working fine, then I'd suggest using ExecuteStreamCommand
>> >>>> instead.
>> >>>> For example:
>> >>>> - you can design the python script to print out JSON formatted string
>> >>>> about listed files
>> >>>> - then connect the outputs to SplitJson
>> >>>> - and use EvaluateJsonPath to extract required values to FlowFile 
>> >>>> attribute
>> >>>> - finally, use FetchS3Object
>> >>>>
>> >>>> Thanks,
>> >>>> Koji
>>
Re: Listing S3

Reply via email to