Re: Implementing a “join” between a DataStream and a “set of rules”

Fabian Hueske Wed, 06 Jun 2018 04:09:02 -0700

Hi Turar,

Managed state is a general concept in Flink's DataStream API and not
specifically designed for windows (although they use internally).
I'd recommend the broadcast state that Aljoscha proposed. It was
specifically designed for these use cases.


It is true that the state is currently maintained in memory, but it is not
bound to 5MB but rather the size of your heap (e.g., 100s MBs / GBs) if you
configure a state backend that writes to a distributed file system (eg.g.,
FSStateBackend or RocksDBStateBackend). There is some ongoing work to also
support broadcast state in RocksDB.

Best, Fabian



2018-06-05 22:53 GMT+02:00 Sandybayev, Turar (CAI - Atlanta) <
turar.sandyba...@coxautoinc.com>:

> Hi Amit,
>
> In my current approach the idea for updating rule set data was to have
> some kind of a "control" stream that will trigger an update to a local data
> structure, or a "control" event within the main data stream that will
> trigger the same.
>
> Using external system like a cache or database is also an option, but that
> still will require some kind of a trigger to reload rule set or a single
> rule, in case of any updates to it.
>
> Others have suggested using Flink managed state, but I'm still not sure
> whether that is a generally recommended approach in this scenario, as it
> seems like it was more meant for windowing-type processing instead?
>
> Thanks,
> Turar
>
> On 6/5/18, 8:46 AM, "Amit Jain" <aj201...@gmail.com> wrote:
>
>     Hi Sandybayev,
>
>     In the current state, Flink does not provide a solution to the
>     mentioned use case. However, there is open FLIP[1] [2] which has been
>     created to address the same.
>
>     I can see in your current approach, you are not able to update the
>     rule set data. I think you can update rule set data by building
>     DataStream around changelogs which are stored in message
>     queue/distributed file system.
>     OR
>     You can store rule set data in the external system where you can query
>     for incoming keys from Flink.
>
>     [1]: https://cwiki.apache.org/confluence/display/FLINK/FLIP-
> 17+Side+Inputs+for+DataStream+API
>     [2]: https://issues.apache.org/jira/browse/FLINK-6131
>
>     On Tue, Jun 5, 2018 at 5:52 PM, Sandybayev, Turar (CAI - Atlanta)
>     <turar.sandyba...@coxautoinc.com> wrote:
>     > Hi,
>     >
>     >
>     >
>     > What is the best practice recommendation for the following use case?
> We need
>     > to match a stream against a set of “rules”, which are essentially a
> Flink
>     > DataSet concept. Updates to this “rules set" are possible but not
> frequent.
>     > Each stream event must be checked against all the records in “rules
> set”,
>     > and each match produces one or more events into a sink. Number of
> records in
>     > a rule set are in the 6 digit range.
>     >
>     >
>     >
>     > Currently we're simply loading rules into a local List of rules and
> using
>     > flatMap over an incoming DataStream. Inside flatMap, we're just
> iterating
>     > over a list comparing each event to each rule.
>     >
>     >
>     >
>     > To speed up the iteration, we can also split the list into several
> batches,
>     > essentially creating a list of lists, and creating a separate thread
> to
>     > iterate over each sub-list (using Futures in either Java or Scala).
>     >
>     >
>     >
>     > Questions:
>     >
>     > 1.            Is there a better way to do this kind of a join?
>     >
>     > 2.            If not, is it safe to add additional parallelism by
> creating
>     > new threads inside each flatMap operation, on top of what Flink is
> already
>     > doing?
>     >
>     >
>     >
>     > Thanks in advance!
>     >
>     > Turar
>     >
>     >
>
>
>

Re: Implementing a “join” between a DataStream and a “set of rules”

Reply via email to