Re: Listing S3
any update on this? Was the ListS3 processor updated or are there plans to create a new processor? -- Sent from: http://apache-nifi-users-list.2361937.n4.nabble.com/
Re: Listing S3
Hi Bryan, Thanks for your reply! Let me highlight our use case in some more detail. This is a manufacturing setting. We have machines (100's) that perform some tests on objects (millions). As part of those tests, these machines generate various different files. We use a NiFi installation on each machine to manage those files. However, we don't always want to pick up those files, and we don't always want to treat them in the same way. Our preferred solution is to have the NiFi instance listen using a HandleHTTPRequest processor and we send it a POST with some properties (files from which folders, what kind of files, what to do with them, etc.). We then perform some validation of the request, and would next use the properties to perform actions on those files (list, fetch, etc.). In this case, we are looking for files that are being generated, so we need to look for new files in a folder and thus require state. However, we can perform multiple runs on a given folder. In this particular use case we don't move files - we copy as they are generated, and a later invocation of the flow with different properties we delete them. We use many variations on the above theme. We have some flows that need to fetch the files generated in the above example from their permanent storage location, and perform some analysis. So in this case all the files are already "in-place" in a folder, and we just need to do something with all of those files. Currently, in the first case, we configure, start, stop, and clear the state of the flow for a given machine using the API. In the second case, we shell out to a script that generates the listing, and go from there. Whilst this approach works, it would be nice (and powerful) to be able to keep everything in NiFi. We have many flows, many moving parts, and having to use "external" scripts increases complexity in different ways. We now have to ensure that the various scripts that do listings are always reachable from NiFi, so NiFi deployment has become just a little bit more complex. Given the amount of NiFi instances and the amount of processes we have, all these little complexities add up. We have to make sure that future developers understand that besides the flowfile, they also have to look after these helper scripts - another additional complexity. The requirement of manipulating the API in some cases means that we are limiting the potential developer pool. Personally, I believe that being able to selectively allow incoming flowfiles and maintain state would be a real benefit. In our variations outlined above, we have a situation where an incoming flowfile generates tens of thousands of subsequent flowfiles - the incoming flowfile is simply a trigger. In some cases we'd like to be able to say "do this, keep state" and in other cases say "do this, don't worry about state" I hope that clarifies things a bit. Thanks again for looking into this. Martijn On Tue, 25 Sep 2018 at 15:55, Bryan Bende wrote: > Hi Martijn, > > The request for the "list" processors to support incoming flow files > comes up frequently. The issue is that the list processors are meant > to continuously watch a given directory/bucket and maintain state > about what has been seen and only find newer stuff. So if you let the > processor support incoming flow files then it means the directory can > potentially be different on every execution of the processor, which > then makes it problematic for maintaining state... how do we know if > there will ever be another flow file indicating the same directory and > whether we need to keep the state around? how much state can actually > store? etc. > > I don't know exactly what you're use case is, but I think it would be > reasonable to support a variation of each "list" processor that > supports incoming flow files, but does NOT maintain state. Meaning, it > would be used to perform a one-time listing based on the incoming flow > file, and if another flow file came in later with the same > directory/bucket, it would have no knowledge of the previous execution > and thus list everything again. > > -Bryan > > On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers > wrote: > > > > Hi Koji, > > > > Thanks, that is exactly the path we took in the end. This is a repeating > pattern for us, and we would have preferred to keep it all contained in an > ISP. Since the output of the listing is very large, we run into some memory > issues at the SplitText step, so we use a few of those in sequence, which > is all a bit hacky. When we have some time we will get back to this, and > hopefully get it done "correctly". > > > > I am trying to work out what the reasoning is for none of the List-type > processors to accept incoming connections, we use them frequently and have > to resort to all kinds of acrobatics to work around this. In this instance > we use an external script, in some others we have to set up infrastructure > outside of NiFi to set parameters via the API. It would be
Re: Listing S3
Fair point. On Tue, Sep 25, 2018 at 8:14 PM Bryan Bende wrote: > Yea if it can be done in a way to only clear state once the processor > is started with state management = false, this way if someone > accidentally toggles the value and hits apply, but then wants to go > back before starting, they wouldn't lose the state. > > On Tue, Sep 25, 2018 at 10:40 AM Mark Payne wrote: > > > > It certainly would be ideal to clear the state. > > > > On Sep 25, 2018, at 10:34 AM, Sivaprasanna > wrote: > > > > I'm in for a configurable state management property. > > > > One question: Let's say we have a processor already running without an > incoming connection and have the state management property set to 'true'. > After a couple of iterations, it would have some state set. Later the user > adds an incoming connection and has the state management property set to > 'false'. In this case, do we have to clear off the state? Or maintain the > state as is but just don't consider it? > > > > - > > Sivaprasanna > > > > On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende wrote: > >> > >> I like that approach too. > >> On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard > >> wrote: > >> > > >> > +1 with Matt's proposal and Mark's comment, I think it'd help > answering some use cases. > >> > We just need to be very clear about the processor behavior for each > possible case/configuration. > >> > > >> > Pierre > >> > > >> > Le mar. 25 sept. 2018 à 16:19, Mark Payne a > écrit : > >> >> > >> >> Matt, > >> >> > >> >> I think it's very dangerous to manipulate the behavior of the > processor so drastically based > >> >> on the presence or absence of an incoming connection. I think it is > fair game, however, to allow > >> >> for a new property to be added that indicates whether or not state > is maintained. Then, the processor > >> >> could be made invalid if attempting to maintain state and has an > incoming connection. > >> >> > >> >> This approach would be nice anyway because there are valid use cases > to have a ListFile processor, > >> >> for example, be the 'source processor' and still want to do a full > listing every hour, let's say, rather than keeping > >> >> state and only getting the 'diff'. > >> >> > >> >> Thanks > >> >> -Mark > >> >> > >> >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess > wrote: > >> >> > > >> >> > With so many List processors, having a separate version of them > might > >> >> > lead to component bloat. GenerateTableFetch is an example of a > source > >> >> > processor that can optionally accept incoming flow files only for > the > >> >> > purpose of configuration (attributes, e.g.). The analogy to > Flow-Based > >> >> > Programming is the OPTIONS named input, which is specifically for > >> >> > configuration instead of data flow. For GenerateTableFetch, the > >> >> > Max-Value Columns and Columns To Return property must be blank or > >> >> > constant for all possible incoming tables. The former effectively > >> >> > "disables" state, and the latter ensures single state for column > >> >> > names/types, although the max values are stored in state by table > >> >> > name, so the onus is on the user to ensure that the number of > >> >> > different tables is not so large as to clobber the state store (~1 > MB > >> >> > in practice IIRC). > >> >> > > >> >> > What about doing something similar for List processors? If there > is no > >> >> > incoming connection, then they continue to behave as they always > have. > >> >> > If there is an incoming connection and no flow file, no work is > >> >> > performed. If there is an incoming connection with available flow > >> >> > file(s), then it is more "event-driven" in the sense that state > will > >> >> > not be maintained, with the tradeoff that flow file attributes can > be > >> >> > used to configure the List properties? > >> >> > > >> >> > Regards, > >> >> > Matt > >> >> > > >> >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende > wrote: > >> >> >> > >> >> >> Hi Martijn, > >> >> >> > >> >> >> The request for the "list" processors to support incoming flow > files > >> >> >> comes up frequently. The issue is that the list processors are > meant > >> >> >> to continuously watch a given directory/bucket and maintain state > >> >> >> about what has been seen and only find newer stuff. So if you let > the > >> >> >> processor support incoming flow files then it means the directory > can > >> >> >> potentially be different on every execution of the processor, > which > >> >> >> then makes it problematic for maintaining state... how do we know > if > >> >> >> there will ever be another flow file indicating the same > directory and > >> >> >> whether we need to keep the state around? how much state can > actually > >> >> >> store? etc. > >> >> >> > >> >> >> I don't know exactly what you're use case is, but I think it > would be > >> >> >> reasonable to support a variation of each "list" processor that > >> >> >> supports incoming flow files, but does NOT maintain state. >
Re: Listing S3
Yea if it can be done in a way to only clear state once the processor is started with state management = false, this way if someone accidentally toggles the value and hits apply, but then wants to go back before starting, they wouldn't lose the state. On Tue, Sep 25, 2018 at 10:40 AM Mark Payne wrote: > > It certainly would be ideal to clear the state. > > On Sep 25, 2018, at 10:34 AM, Sivaprasanna wrote: > > I'm in for a configurable state management property. > > One question: Let's say we have a processor already running without an > incoming connection and have the state management property set to 'true'. > After a couple of iterations, it would have some state set. Later the user > adds an incoming connection and has the state management property set to > 'false'. In this case, do we have to clear off the state? Or maintain the > state as is but just don't consider it? > > - > Sivaprasanna > > On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende wrote: >> >> I like that approach too. >> On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard >> wrote: >> > >> > +1 with Matt's proposal and Mark's comment, I think it'd help answering >> > some use cases. >> > We just need to be very clear about the processor behavior for each >> > possible case/configuration. >> > >> > Pierre >> > >> > Le mar. 25 sept. 2018 à 16:19, Mark Payne a écrit : >> >> >> >> Matt, >> >> >> >> I think it's very dangerous to manipulate the behavior of the processor >> >> so drastically based >> >> on the presence or absence of an incoming connection. I think it is fair >> >> game, however, to allow >> >> for a new property to be added that indicates whether or not state is >> >> maintained. Then, the processor >> >> could be made invalid if attempting to maintain state and has an incoming >> >> connection. >> >> >> >> This approach would be nice anyway because there are valid use cases to >> >> have a ListFile processor, >> >> for example, be the 'source processor' and still want to do a full >> >> listing every hour, let's say, rather than keeping >> >> state and only getting the 'diff'. >> >> >> >> Thanks >> >> -Mark >> >> >> >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess wrote: >> >> > >> >> > With so many List processors, having a separate version of them might >> >> > lead to component bloat. GenerateTableFetch is an example of a source >> >> > processor that can optionally accept incoming flow files only for the >> >> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based >> >> > Programming is the OPTIONS named input, which is specifically for >> >> > configuration instead of data flow. For GenerateTableFetch, the >> >> > Max-Value Columns and Columns To Return property must be blank or >> >> > constant for all possible incoming tables. The former effectively >> >> > "disables" state, and the latter ensures single state for column >> >> > names/types, although the max values are stored in state by table >> >> > name, so the onus is on the user to ensure that the number of >> >> > different tables is not so large as to clobber the state store (~1 MB >> >> > in practice IIRC). >> >> > >> >> > What about doing something similar for List processors? If there is no >> >> > incoming connection, then they continue to behave as they always have. >> >> > If there is an incoming connection and no flow file, no work is >> >> > performed. If there is an incoming connection with available flow >> >> > file(s), then it is more "event-driven" in the sense that state will >> >> > not be maintained, with the tradeoff that flow file attributes can be >> >> > used to configure the List properties? >> >> > >> >> > Regards, >> >> > Matt >> >> > >> >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende wrote: >> >> >> >> >> >> Hi Martijn, >> >> >> >> >> >> The request for the "list" processors to support incoming flow files >> >> >> comes up frequently. The issue is that the list processors are meant >> >> >> to continuously watch a given directory/bucket and maintain state >> >> >> about what has been seen and only find newer stuff. So if you let the >> >> >> processor support incoming flow files then it means the directory can >> >> >> potentially be different on every execution of the processor, which >> >> >> then makes it problematic for maintaining state... how do we know if >> >> >> there will ever be another flow file indicating the same directory and >> >> >> whether we need to keep the state around? how much state can actually >> >> >> store? etc. >> >> >> >> >> >> I don't know exactly what you're use case is, but I think it would be >> >> >> reasonable to support a variation of each "list" processor that >> >> >> supports incoming flow files, but does NOT maintain state. Meaning, it >> >> >> would be used to perform a one-time listing based on the incoming flow >> >> >> file, and if another flow file came in later with the same >> >> >> directory/bucket, it would have no knowledge of the previous execution >> >> >>
Re: Listing S3
It certainly would be ideal to clear the state. On Sep 25, 2018, at 10:34 AM, Sivaprasanna mailto:sivaprasanna...@gmail.com>> wrote: I'm in for a configurable state management property. One question: Let's say we have a processor already running without an incoming connection and have the state management property set to 'true'. After a couple of iterations, it would have some state set. Later the user adds an incoming connection and has the state management property set to 'false'. In this case, do we have to clear off the state? Or maintain the state as is but just don't consider it? - Sivaprasanna On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende mailto:bbe...@gmail.com>> wrote: I like that approach too. On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard mailto:pierre.villard...@gmail.com>> wrote: > > +1 with Matt's proposal and Mark's comment, I think it'd help answering some > use cases. > We just need to be very clear about the processor behavior for each possible > case/configuration. > > Pierre > > Le mar. 25 sept. 2018 à 16:19, Mark Payne > mailto:marka...@hotmail.com>> a écrit : >> >> Matt, >> >> I think it's very dangerous to manipulate the behavior of the processor so >> drastically based >> on the presence or absence of an incoming connection. I think it is fair >> game, however, to allow >> for a new property to be added that indicates whether or not state is >> maintained. Then, the processor >> could be made invalid if attempting to maintain state and has an incoming >> connection. >> >> This approach would be nice anyway because there are valid use cases to have >> a ListFile processor, >> for example, be the 'source processor' and still want to do a full listing >> every hour, let's say, rather than keeping >> state and only getting the 'diff'. >> >> Thanks >> -Mark >> >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess >> > mailto:mattyb...@apache.org>> wrote: >> > >> > With so many List processors, having a separate version of them might >> > lead to component bloat. GenerateTableFetch is an example of a source >> > processor that can optionally accept incoming flow files only for the >> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based >> > Programming is the OPTIONS named input, which is specifically for >> > configuration instead of data flow. For GenerateTableFetch, the >> > Max-Value Columns and Columns To Return property must be blank or >> > constant for all possible incoming tables. The former effectively >> > "disables" state, and the latter ensures single state for column >> > names/types, although the max values are stored in state by table >> > name, so the onus is on the user to ensure that the number of >> > different tables is not so large as to clobber the state store (~1 MB >> > in practice IIRC). >> > >> > What about doing something similar for List processors? If there is no >> > incoming connection, then they continue to behave as they always have. >> > If there is an incoming connection and no flow file, no work is >> > performed. If there is an incoming connection with available flow >> > file(s), then it is more "event-driven" in the sense that state will >> > not be maintained, with the tradeoff that flow file attributes can be >> > used to configure the List properties? >> > >> > Regards, >> > Matt >> > >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende >> > mailto:bbe...@gmail.com>> wrote: >> >> >> >> Hi Martijn, >> >> >> >> The request for the "list" processors to support incoming flow files >> >> comes up frequently. The issue is that the list processors are meant >> >> to continuously watch a given directory/bucket and maintain state >> >> about what has been seen and only find newer stuff. So if you let the >> >> processor support incoming flow files then it means the directory can >> >> potentially be different on every execution of the processor, which >> >> then makes it problematic for maintaining state... how do we know if >> >> there will ever be another flow file indicating the same directory and >> >> whether we need to keep the state around? how much state can actually >> >> store? etc. >> >> >> >> I don't know exactly what you're use case is, but I think it would be >> >> reasonable to support a variation of each "list" processor that >> >> supports incoming flow files, but does NOT maintain state. Meaning, it >> >> would be used to perform a one-time listing based on the incoming flow >> >> file, and if another flow file came in later with the same >> >> directory/bucket, it would have no knowledge of the previous execution >> >> and thus list everything again. >> >> >> >> -Bryan >> >> >> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers >> >> mailto:mart...@dekkers.org.uk>> wrote: >> >>> >> >>> Hi Koji, >> >>> >> >>> Thanks, that is exactly the path we took in the end. This is a repeating >> >>> pattern for us, and we would have preferred to keep it all contained in >> >>> an ISP. Since the output of the
Re: Listing S3
I'm in for a configurable state management property. One question: Let's say we have a processor already running without an incoming connection and have the state management property set to 'true'. After a couple of iterations, it would have some state set. Later the user adds an incoming connection and has the state management property set to 'false'. In this case, do we have to clear off the state? Or maintain the state as is but just don't consider it? - Sivaprasanna On Tue, Sep 25, 2018 at 7:54 PM Bryan Bende wrote: > I like that approach too. > On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard > wrote: > > > > +1 with Matt's proposal and Mark's comment, I think it'd help answering > some use cases. > > We just need to be very clear about the processor behavior for each > possible case/configuration. > > > > Pierre > > > > Le mar. 25 sept. 2018 à 16:19, Mark Payne a > écrit : > >> > >> Matt, > >> > >> I think it's very dangerous to manipulate the behavior of the processor > so drastically based > >> on the presence or absence of an incoming connection. I think it is > fair game, however, to allow > >> for a new property to be added that indicates whether or not state is > maintained. Then, the processor > >> could be made invalid if attempting to maintain state and has an > incoming connection. > >> > >> This approach would be nice anyway because there are valid use cases to > have a ListFile processor, > >> for example, be the 'source processor' and still want to do a full > listing every hour, let's say, rather than keeping > >> state and only getting the 'diff'. > >> > >> Thanks > >> -Mark > >> > >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess > wrote: > >> > > >> > With so many List processors, having a separate version of them might > >> > lead to component bloat. GenerateTableFetch is an example of a source > >> > processor that can optionally accept incoming flow files only for the > >> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based > >> > Programming is the OPTIONS named input, which is specifically for > >> > configuration instead of data flow. For GenerateTableFetch, the > >> > Max-Value Columns and Columns To Return property must be blank or > >> > constant for all possible incoming tables. The former effectively > >> > "disables" state, and the latter ensures single state for column > >> > names/types, although the max values are stored in state by table > >> > name, so the onus is on the user to ensure that the number of > >> > different tables is not so large as to clobber the state store (~1 MB > >> > in practice IIRC). > >> > > >> > What about doing something similar for List processors? If there is no > >> > incoming connection, then they continue to behave as they always have. > >> > If there is an incoming connection and no flow file, no work is > >> > performed. If there is an incoming connection with available flow > >> > file(s), then it is more "event-driven" in the sense that state will > >> > not be maintained, with the tradeoff that flow file attributes can be > >> > used to configure the List properties? > >> > > >> > Regards, > >> > Matt > >> > > >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende wrote: > >> >> > >> >> Hi Martijn, > >> >> > >> >> The request for the "list" processors to support incoming flow files > >> >> comes up frequently. The issue is that the list processors are meant > >> >> to continuously watch a given directory/bucket and maintain state > >> >> about what has been seen and only find newer stuff. So if you let the > >> >> processor support incoming flow files then it means the directory can > >> >> potentially be different on every execution of the processor, which > >> >> then makes it problematic for maintaining state... how do we know if > >> >> there will ever be another flow file indicating the same directory > and > >> >> whether we need to keep the state around? how much state can actually > >> >> store? etc. > >> >> > >> >> I don't know exactly what you're use case is, but I think it would be > >> >> reasonable to support a variation of each "list" processor that > >> >> supports incoming flow files, but does NOT maintain state. Meaning, > it > >> >> would be used to perform a one-time listing based on the incoming > flow > >> >> file, and if another flow file came in later with the same > >> >> directory/bucket, it would have no knowledge of the previous > execution > >> >> and thus list everything again. > >> >> > >> >> -Bryan > >> >> > >> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers < > mart...@dekkers.org.uk> wrote: > >> >>> > >> >>> Hi Koji, > >> >>> > >> >>> Thanks, that is exactly the path we took in the end. This is a > repeating pattern for us, and we would have preferred to keep it all > contained in an ISP. Since the output of the listing is very large, we run > into some memory issues at the SplitText step, so we use a few of those in > sequence, which is all a bit hacky. When we have some
Re: Listing S3
I like that approach too. On Tue, Sep 25, 2018 at 10:21 AM Pierre Villard wrote: > > +1 with Matt's proposal and Mark's comment, I think it'd help answering some > use cases. > We just need to be very clear about the processor behavior for each possible > case/configuration. > > Pierre > > Le mar. 25 sept. 2018 à 16:19, Mark Payne a écrit : >> >> Matt, >> >> I think it's very dangerous to manipulate the behavior of the processor so >> drastically based >> on the presence or absence of an incoming connection. I think it is fair >> game, however, to allow >> for a new property to be added that indicates whether or not state is >> maintained. Then, the processor >> could be made invalid if attempting to maintain state and has an incoming >> connection. >> >> This approach would be nice anyway because there are valid use cases to have >> a ListFile processor, >> for example, be the 'source processor' and still want to do a full listing >> every hour, let's say, rather than keeping >> state and only getting the 'diff'. >> >> Thanks >> -Mark >> >> > On Sep 25, 2018, at 10:11 AM, Matt Burgess wrote: >> > >> > With so many List processors, having a separate version of them might >> > lead to component bloat. GenerateTableFetch is an example of a source >> > processor that can optionally accept incoming flow files only for the >> > purpose of configuration (attributes, e.g.). The analogy to Flow-Based >> > Programming is the OPTIONS named input, which is specifically for >> > configuration instead of data flow. For GenerateTableFetch, the >> > Max-Value Columns and Columns To Return property must be blank or >> > constant for all possible incoming tables. The former effectively >> > "disables" state, and the latter ensures single state for column >> > names/types, although the max values are stored in state by table >> > name, so the onus is on the user to ensure that the number of >> > different tables is not so large as to clobber the state store (~1 MB >> > in practice IIRC). >> > >> > What about doing something similar for List processors? If there is no >> > incoming connection, then they continue to behave as they always have. >> > If there is an incoming connection and no flow file, no work is >> > performed. If there is an incoming connection with available flow >> > file(s), then it is more "event-driven" in the sense that state will >> > not be maintained, with the tradeoff that flow file attributes can be >> > used to configure the List properties? >> > >> > Regards, >> > Matt >> > >> > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende wrote: >> >> >> >> Hi Martijn, >> >> >> >> The request for the "list" processors to support incoming flow files >> >> comes up frequently. The issue is that the list processors are meant >> >> to continuously watch a given directory/bucket and maintain state >> >> about what has been seen and only find newer stuff. So if you let the >> >> processor support incoming flow files then it means the directory can >> >> potentially be different on every execution of the processor, which >> >> then makes it problematic for maintaining state... how do we know if >> >> there will ever be another flow file indicating the same directory and >> >> whether we need to keep the state around? how much state can actually >> >> store? etc. >> >> >> >> I don't know exactly what you're use case is, but I think it would be >> >> reasonable to support a variation of each "list" processor that >> >> supports incoming flow files, but does NOT maintain state. Meaning, it >> >> would be used to perform a one-time listing based on the incoming flow >> >> file, and if another flow file came in later with the same >> >> directory/bucket, it would have no knowledge of the previous execution >> >> and thus list everything again. >> >> >> >> -Bryan >> >> >> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers >> >> wrote: >> >>> >> >>> Hi Koji, >> >>> >> >>> Thanks, that is exactly the path we took in the end. This is a repeating >> >>> pattern for us, and we would have preferred to keep it all contained in >> >>> an ISP. Since the output of the listing is very large, we run into some >> >>> memory issues at the SplitText step, so we use a few of those in >> >>> sequence, which is all a bit hacky. When we have some time we will get >> >>> back to this, and hopefully get it done "correctly". >> >>> >> >>> I am trying to work out what the reasoning is for none of the List-type >> >>> processors to accept incoming connections, we use them frequently and >> >>> have to resort to all kinds of acrobatics to work around this. In this >> >>> instance we use an external script, in some others we have to set up >> >>> infrastructure outside of NiFi to set parameters via the API. It would >> >>> be a lot easier and smoother if we could simply accept an incoming >> >>> connection and use attributes. >> >>> >> >>> Thanks, >> >>> >> >>> Martijn >> >>> >> >>> On Tue, 25 Sep 2018 at 02:37, Koji
Re: Listing S3
+1 with Matt's proposal and Mark's comment, I think it'd help answering some use cases. We just need to be very clear about the processor behavior for each possible case/configuration. Pierre Le mar. 25 sept. 2018 à 16:19, Mark Payne a écrit : > Matt, > > I think it's very dangerous to manipulate the behavior of the processor so > drastically based > on the presence or absence of an incoming connection. I think it is fair > game, however, to allow > for a new property to be added that indicates whether or not state is > maintained. Then, the processor > could be made invalid if attempting to maintain state and has an incoming > connection. > > This approach would be nice anyway because there are valid use cases to > have a ListFile processor, > for example, be the 'source processor' and still want to do a full listing > every hour, let's say, rather than keeping > state and only getting the 'diff'. > > Thanks > -Mark > > > On Sep 25, 2018, at 10:11 AM, Matt Burgess wrote: > > > > With so many List processors, having a separate version of them might > > lead to component bloat. GenerateTableFetch is an example of a source > > processor that can optionally accept incoming flow files only for the > > purpose of configuration (attributes, e.g.). The analogy to Flow-Based > > Programming is the OPTIONS named input, which is specifically for > > configuration instead of data flow. For GenerateTableFetch, the > > Max-Value Columns and Columns To Return property must be blank or > > constant for all possible incoming tables. The former effectively > > "disables" state, and the latter ensures single state for column > > names/types, although the max values are stored in state by table > > name, so the onus is on the user to ensure that the number of > > different tables is not so large as to clobber the state store (~1 MB > > in practice IIRC). > > > > What about doing something similar for List processors? If there is no > > incoming connection, then they continue to behave as they always have. > > If there is an incoming connection and no flow file, no work is > > performed. If there is an incoming connection with available flow > > file(s), then it is more "event-driven" in the sense that state will > > not be maintained, with the tradeoff that flow file attributes can be > > used to configure the List properties? > > > > Regards, > > Matt > > > > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende wrote: > >> > >> Hi Martijn, > >> > >> The request for the "list" processors to support incoming flow files > >> comes up frequently. The issue is that the list processors are meant > >> to continuously watch a given directory/bucket and maintain state > >> about what has been seen and only find newer stuff. So if you let the > >> processor support incoming flow files then it means the directory can > >> potentially be different on every execution of the processor, which > >> then makes it problematic for maintaining state... how do we know if > >> there will ever be another flow file indicating the same directory and > >> whether we need to keep the state around? how much state can actually > >> store? etc. > >> > >> I don't know exactly what you're use case is, but I think it would be > >> reasonable to support a variation of each "list" processor that > >> supports incoming flow files, but does NOT maintain state. Meaning, it > >> would be used to perform a one-time listing based on the incoming flow > >> file, and if another flow file came in later with the same > >> directory/bucket, it would have no knowledge of the previous execution > >> and thus list everything again. > >> > >> -Bryan > >> > >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers > wrote: > >>> > >>> Hi Koji, > >>> > >>> Thanks, that is exactly the path we took in the end. This is a > repeating pattern for us, and we would have preferred to keep it all > contained in an ISP. Since the output of the listing is very large, we run > into some memory issues at the SplitText step, so we use a few of those in > sequence, which is all a bit hacky. When we have some time we will get back > to this, and hopefully get it done "correctly". > >>> > >>> I am trying to work out what the reasoning is for none of the > List-type processors to accept incoming connections, we use them frequently > and have to resort to all kinds of acrobatics to work around this. In this > instance we use an external script, in some others we have to set up > infrastructure outside of NiFi to set parameters via the API. It would be a > lot easier and smoother if we could simply accept an incoming connection > and use attributes. > >>> > >>> Thanks, > >>> > >>> Martijn > >>> > >>> On Tue, 25 Sep 2018 at 02:37, Koji Kawamura > wrote: > > Hi Martijn, > > I'm not an expert on Jython, but if you already have a python script > using boto3 working fine, then I'd suggest using ExecuteStreamCommand > instead. > For example: > - you can design the
Re: Listing S3
Matt, I think it's very dangerous to manipulate the behavior of the processor so drastically based on the presence or absence of an incoming connection. I think it is fair game, however, to allow for a new property to be added that indicates whether or not state is maintained. Then, the processor could be made invalid if attempting to maintain state and has an incoming connection. This approach would be nice anyway because there are valid use cases to have a ListFile processor, for example, be the 'source processor' and still want to do a full listing every hour, let's say, rather than keeping state and only getting the 'diff'. Thanks -Mark > On Sep 25, 2018, at 10:11 AM, Matt Burgess wrote: > > With so many List processors, having a separate version of them might > lead to component bloat. GenerateTableFetch is an example of a source > processor that can optionally accept incoming flow files only for the > purpose of configuration (attributes, e.g.). The analogy to Flow-Based > Programming is the OPTIONS named input, which is specifically for > configuration instead of data flow. For GenerateTableFetch, the > Max-Value Columns and Columns To Return property must be blank or > constant for all possible incoming tables. The former effectively > "disables" state, and the latter ensures single state for column > names/types, although the max values are stored in state by table > name, so the onus is on the user to ensure that the number of > different tables is not so large as to clobber the state store (~1 MB > in practice IIRC). > > What about doing something similar for List processors? If there is no > incoming connection, then they continue to behave as they always have. > If there is an incoming connection and no flow file, no work is > performed. If there is an incoming connection with available flow > file(s), then it is more "event-driven" in the sense that state will > not be maintained, with the tradeoff that flow file attributes can be > used to configure the List properties? > > Regards, > Matt > > On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende wrote: >> >> Hi Martijn, >> >> The request for the "list" processors to support incoming flow files >> comes up frequently. The issue is that the list processors are meant >> to continuously watch a given directory/bucket and maintain state >> about what has been seen and only find newer stuff. So if you let the >> processor support incoming flow files then it means the directory can >> potentially be different on every execution of the processor, which >> then makes it problematic for maintaining state... how do we know if >> there will ever be another flow file indicating the same directory and >> whether we need to keep the state around? how much state can actually >> store? etc. >> >> I don't know exactly what you're use case is, but I think it would be >> reasonable to support a variation of each "list" processor that >> supports incoming flow files, but does NOT maintain state. Meaning, it >> would be used to perform a one-time listing based on the incoming flow >> file, and if another flow file came in later with the same >> directory/bucket, it would have no knowledge of the previous execution >> and thus list everything again. >> >> -Bryan >> >> On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers >> wrote: >>> >>> Hi Koji, >>> >>> Thanks, that is exactly the path we took in the end. This is a repeating >>> pattern for us, and we would have preferred to keep it all contained in an >>> ISP. Since the output of the listing is very large, we run into some memory >>> issues at the SplitText step, so we use a few of those in sequence, which >>> is all a bit hacky. When we have some time we will get back to this, and >>> hopefully get it done "correctly". >>> >>> I am trying to work out what the reasoning is for none of the List-type >>> processors to accept incoming connections, we use them frequently and have >>> to resort to all kinds of acrobatics to work around this. In this instance >>> we use an external script, in some others we have to set up infrastructure >>> outside of NiFi to set parameters via the API. It would be a lot easier and >>> smoother if we could simply accept an incoming connection and use >>> attributes. >>> >>> Thanks, >>> >>> Martijn >>> >>> On Tue, 25 Sep 2018 at 02:37, Koji Kawamura wrote: Hi Martijn, I'm not an expert on Jython, but if you already have a python script using boto3 working fine, then I'd suggest using ExecuteStreamCommand instead. For example: - you can design the python script to print out JSON formatted string about listed files - then connect the outputs to SplitJson - and use EvaluateJsonPath to extract required values to FlowFile attribute - finally, use FetchS3Object Thanks, Koji
Re: Listing S3
With so many List processors, having a separate version of them might lead to component bloat. GenerateTableFetch is an example of a source processor that can optionally accept incoming flow files only for the purpose of configuration (attributes, e.g.). The analogy to Flow-Based Programming is the OPTIONS named input, which is specifically for configuration instead of data flow. For GenerateTableFetch, the Max-Value Columns and Columns To Return property must be blank or constant for all possible incoming tables. The former effectively "disables" state, and the latter ensures single state for column names/types, although the max values are stored in state by table name, so the onus is on the user to ensure that the number of different tables is not so large as to clobber the state store (~1 MB in practice IIRC). What about doing something similar for List processors? If there is no incoming connection, then they continue to behave as they always have. If there is an incoming connection and no flow file, no work is performed. If there is an incoming connection with available flow file(s), then it is more "event-driven" in the sense that state will not be maintained, with the tradeoff that flow file attributes can be used to configure the List properties? Regards, Matt On Tue, Sep 25, 2018 at 9:55 AM Bryan Bende wrote: > > Hi Martijn, > > The request for the "list" processors to support incoming flow files > comes up frequently. The issue is that the list processors are meant > to continuously watch a given directory/bucket and maintain state > about what has been seen and only find newer stuff. So if you let the > processor support incoming flow files then it means the directory can > potentially be different on every execution of the processor, which > then makes it problematic for maintaining state... how do we know if > there will ever be another flow file indicating the same directory and > whether we need to keep the state around? how much state can actually > store? etc. > > I don't know exactly what you're use case is, but I think it would be > reasonable to support a variation of each "list" processor that > supports incoming flow files, but does NOT maintain state. Meaning, it > would be used to perform a one-time listing based on the incoming flow > file, and if another flow file came in later with the same > directory/bucket, it would have no knowledge of the previous execution > and thus list everything again. > > -Bryan > > On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers > wrote: > > > > Hi Koji, > > > > Thanks, that is exactly the path we took in the end. This is a repeating > > pattern for us, and we would have preferred to keep it all contained in an > > ISP. Since the output of the listing is very large, we run into some memory > > issues at the SplitText step, so we use a few of those in sequence, which > > is all a bit hacky. When we have some time we will get back to this, and > > hopefully get it done "correctly". > > > > I am trying to work out what the reasoning is for none of the List-type > > processors to accept incoming connections, we use them frequently and have > > to resort to all kinds of acrobatics to work around this. In this instance > > we use an external script, in some others we have to set up infrastructure > > outside of NiFi to set parameters via the API. It would be a lot easier and > > smoother if we could simply accept an incoming connection and use > > attributes. > > > > Thanks, > > > > Martijn > > > > On Tue, 25 Sep 2018 at 02:37, Koji Kawamura wrote: > >> > >> Hi Martijn, > >> > >> I'm not an expert on Jython, but if you already have a python script > >> using boto3 working fine, then I'd suggest using ExecuteStreamCommand > >> instead. > >> For example: > >> - you can design the python script to print out JSON formatted string > >> about listed files > >> - then connect the outputs to SplitJson > >> - and use EvaluateJsonPath to extract required values to FlowFile attribute > >> - finally, use FetchS3Object > >> > >> Thanks, > >> Koji
Re: Listing S3
Hi Martijn, The request for the "list" processors to support incoming flow files comes up frequently. The issue is that the list processors are meant to continuously watch a given directory/bucket and maintain state about what has been seen and only find newer stuff. So if you let the processor support incoming flow files then it means the directory can potentially be different on every execution of the processor, which then makes it problematic for maintaining state... how do we know if there will ever be another flow file indicating the same directory and whether we need to keep the state around? how much state can actually store? etc. I don't know exactly what you're use case is, but I think it would be reasonable to support a variation of each "list" processor that supports incoming flow files, but does NOT maintain state. Meaning, it would be used to perform a one-time listing based on the incoming flow file, and if another flow file came in later with the same directory/bucket, it would have no knowledge of the previous execution and thus list everything again. -Bryan On Tue, Sep 25, 2018 at 2:02 AM Martijn Dekkers wrote: > > Hi Koji, > > Thanks, that is exactly the path we took in the end. This is a repeating > pattern for us, and we would have preferred to keep it all contained in an > ISP. Since the output of the listing is very large, we run into some memory > issues at the SplitText step, so we use a few of those in sequence, which is > all a bit hacky. When we have some time we will get back to this, and > hopefully get it done "correctly". > > I am trying to work out what the reasoning is for none of the List-type > processors to accept incoming connections, we use them frequently and have to > resort to all kinds of acrobatics to work around this. In this instance we > use an external script, in some others we have to set up infrastructure > outside of NiFi to set parameters via the API. It would be a lot easier and > smoother if we could simply accept an incoming connection and use attributes. > > Thanks, > > Martijn > > On Tue, 25 Sep 2018 at 02:37, Koji Kawamura wrote: >> >> Hi Martijn, >> >> I'm not an expert on Jython, but if you already have a python script >> using boto3 working fine, then I'd suggest using ExecuteStreamCommand >> instead. >> For example: >> - you can design the python script to print out JSON formatted string >> about listed files >> - then connect the outputs to SplitJson >> - and use EvaluateJsonPath to extract required values to FlowFile attribute >> - finally, use FetchS3Object >> >> Thanks, >> Koji
Re: Listing S3
Hi Koji, Thanks, that is exactly the path we took in the end. This is a repeating pattern for us, and we would have preferred to keep it all contained in an ISP. Since the output of the listing is very large, we run into some memory issues at the SplitText step, so we use a few of those in sequence, which is all a bit hacky. When we have some time we will get back to this, and hopefully get it done "correctly". I am trying to work out what the reasoning is for none of the List-type processors to accept incoming connections, we use them frequently and have to resort to all kinds of acrobatics to work around this. In this instance we use an external script, in some others we have to set up infrastructure outside of NiFi to set parameters via the API. It would be a lot easier and smoother if we could simply accept an incoming connection and use attributes. Thanks, Martijn On Tue, 25 Sep 2018 at 02:37, Koji Kawamura wrote: > Hi Martijn, > > I'm not an expert on Jython, but if you already have a python script > using boto3 working fine, then I'd suggest using ExecuteStreamCommand > instead. > For example: > - you can design the python script to print out JSON formatted string > about listed files > - then connect the outputs to SplitJson > - and use EvaluateJsonPath to extract required values to FlowFile attribute > - finally, use FetchS3Object > > Thanks, > Koji >
Re: Listing S3
Hi Martijn, I'm not an expert on Jython, but if you already have a python script using boto3 working fine, then I'd suggest using ExecuteStreamCommand instead. For example: - you can design the python script to print out JSON formatted string about listed files - then connect the outputs to SplitJson - and use EvaluateJsonPath to extract required values to FlowFile attribute - finally, use FetchS3Object Thanks, Koji
Listing S3
Hello all, I have a head-breaking issue, and I hope someone is able to assist. We have a requirement to pull a list of files from an S3 compatible store on the basis of an incoming flowfile containing the required attributes such as bucket and a few others, including a file suffix. Whilst we can filter for suffix downstream from the listing, the ListS3 process doesn't support incoming flowfiles, so we cannot use this processor. We are wanting to implement an InvokeScriptedProcessor with jython, and we have a very annoying issue we cannot track down. We first implemented a simple python script that is using boto3 to fetch a list of files, typically 20k files in the list. On the terminal all works as expected. When adjusting this for the ISP we receive the following error: java.lang.reflect.UndeclaredThrowableException: null [...] Caused by: javax.script.ScriptException: KeyError: 'ConfigParser' in