Re: Dynamic ad hoc query deployment strategy

Timo Walther Tue, 24 Nov 2020 08:15:56 -0800

I agree with Dawid.

Maybe one thing to add is that reusing parts of the pipeline is possiblevia StatementSets in TableEnvironment. They allow you to add multiplequeries that consume from a common part of the pipeline (for example acommon source). But all of that is compiled into one big job and staticduring runtime, not isolated.

One option is to introduce an additional Flink job that multiplexes thesource Kafka topic into more Kafka topics such that isolated jobs canaccess this intermediate storage.


I hope this helps.

Regards,
Timo

On 24.11.20 16:54, Dawid Wysakowicz wrote:

Hi,

Really sorry for a late reply.
To the best of my knowledge there is no such possibility to "attach" toa source/reader of a different job. Every job would read the sourceseparately.
`The GenericInMemoryCatalog is an in-memory
implementation of a catalog. All objects will be available only for the
lifetime of the session.`. I presume, in session mode, we can share Kafka
source for multiple SQL jobs?
Unfortunately this is wrong assumption. Catalogs store "metadata ofTables, such as connetion parameters, schema etc. Not the data itself,or parts of the graph. The information from a catalog can be used tocreate an execution graph that can be submitted to a cluster. It hasnothing to do with a session cluster. The session here means a job/thelifetime of the GenericInMemoryCatalog.
Both queries will share the same reader as they are part of a single job
graph. Can we somehow take a snapshot of this and submit another query with
them again under the same job graph?
Again unfortunately there is no guarantees this will work. As of now itis a limitation of SQL that it does not support stateful upgrades of aGraph or Flink version. As Till said, if the plan will contain the samesub plans they should be able to match. However with such an extensivechanges to the graph I would not count that it happens. It can work forrather simpler changes such as e.g. changing a predicate (but still itcan greatly affect the plan if the predicate could've been optimized).There were and there are some discussions going on to improve thesituation here.
A proper solution for the problem for a STREAMING job would be ratherhard in my opinion. As we would have to somehow keep the state of theshared source between multiple different jobs. We would need to knowe.g. the offsets that a certain job consumed up to a certain checkpoint.What to do if e.g. a particular query requests to start reading fromoffsets in the past etc.
There is some ongoing effort to support caching a queries that could bereused between jobs in the same cluster as a better support forInteractive programming[1], but I don't think it will support aSTREAMING mode.
Just as a side. I am not a Spark expert and I might be completely wrong,but as far as I am familiar with Spark, it also does not supportdynamically reusing streaming sources. It does have the caching ofintermediate shuffles implemented, something that the FLIP-36 resembles.
Best regards,

Dawid

[1] https://cwiki.apache.org/confluence/x/8hclBg

On 23/11/2020 21:09, lalala wrote:
Hi Till,

Thank you for your comment. I am looking forward to hearing from Timo and
Dawid as well.

Best regards,



--
Sent from:http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/

Re: Dynamic ad hoc query deployment strategy

Reply via email to