Answers inline:

On Wed, Feb 3, 2021 at 3:55 PM Marco Villalobos <mvillalo...@kineteque.com>
wrote:

> Hi Gorden,
>
> Thank you very much for the detailed response.
>
> I considered using the state-state processor API, however, our enrichment
> requirements make the state-processor API a bit inconvenient.
> 1. if an element from the stream matches a record in the database then it
> can remain in the cache a very long time (potentially forever).
> 2. if an element from the stream does not match a record in the database
> then that miss cannot be cached a very long time because that record might
> be added to the database and we have to pick it up in a timely manner.
> 3. Our stream has many elements that lack enrichment information in the
> database.
>
> Thus, for that reason, the state processor api only really helps with
> records that already exist in the database, even though the stream has many
> records that do not exist.
>
> That is why I was brainstorming over my idea of using an iterative stream
> that uses caching in the body, but AsyncIO in a feedback loop.
>
> You mentioned "in general I think it is currently discouraged to us it
> (iterative streams)." May I ask what is your source for that statement? I
> see no mention of any discouragement in Flink's documentation.
>

This SO thread contains some answers, and links to some further answers:
https://stackoverflow.com/questions/61710605/flink-iterations-in-data-stream-api-disadvantages


>
> I will look into how State Functions can help me in this scenario. I have
> not read up much on stateful functions.
>
> If I were to write a proof of concept, and my database queries were
> performed with JDBC, could I just write an embedded function that performs
> the JDBC call directly (I want to avoid changing our deployment topology
> for now) and package it with my Data Stream Job?
>

Yes, you can establish JDBC connections directly in Flink user functions.

>
> Thank you.
>
> > On Feb 2, 2021, at 10:08 PM, Tzu-Li (Gordon) Tai <tzuli...@apache.org>
> wrote:
> >
> > Hi Marco,
> >
> > In the ideal setup, enrichment data existing in external databases is
> > bootstrapped into the streaming job via Flink's State Processor API, and
> any
> > follow-up changes to the enrichment data is streamed into the job as a
> > second union input on the enrichment operator.
> > For this solution to scale, lookups to the enrichment data needs to be by
> > the same key as the input data, i.e. the enrichment data is
> co-partitioned
> > with the input data stream.
> >
> > I assume you've already thought about whether or not this would work for
> > your case, as it's a common setup for streaming enrichment.
> >
> > Otherwise, I believe your brainstorming is heading in the right
> direction,
> > in the case that remote database lookups + local caching in state is a
> must.
> > I'm personally not familiar with the iterative streams in Flink, but in
> > general I think it is currently discouraged to use it.
> >
> > On the other hand, I think using Stateful Function's [1] programing
> > abstraction might work here, as it allows arbitrary messaging between
> > functions and cyclic dataflows.
> > There's also an SDK that allows you to embed StateFun functions within a
> > Flink DataStream job [2].
> >
> > Very briefly, the way you would model this database cache hit / remote
> > lookup is by implementing a function, e.g. called DatabaseCache.
> > The function would expect message types of Lookup(lookupKey), and replies
> > with a response of Result(lookupKey, value). The abstraction allows you,
> for
> > on incoming message, to register state (similar to vanilla Flink), as
> well
> > as register async operations with which you'll use to perform remote
> > database lookups in case of cache / state miss. It also provides means
> for
> > "timers" in the form of delayed messages being sent to itself, if you
> need
> > some mechanism for cache invalidation.
> >
> > Hope this provides some direction for you to think about!
> >
> > Cheers,
> > Gordon
> >
> > [1]
> https://ci.apache.org/projects/flink/flink-statefun-docs-release-2.2/
> > [2]
> >
> https://ci.apache.org/projects/flink/flink-statefun-docs-release-2.2/sdk/flink-datastream.html
> >
> >
> >
> > --
> > Sent from:
> http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/
>
>

Reply via email to