Re: [DISCUSS] FLIP-419: Optimize multi-sink query plan generation

Jeyhun Karimov Wed, 17 Jan 2024 12:11:30 -0800

Hi Jim,

Thanks for your comments. Please find my answers below:


   1. StreamOptimizeContext may still be needed to pass the fact that we
>    are optimizing a streaming query.  I don't think this class will go away
>    completely.  (I agree it may become more simple if the kind or
>    mini-batch configuration can be removed.)


What I meant is that it might go away if we get rid of
*isUpdateBeforeRequired* and *getMiniBatchInterval *fields.
Of course if we can get rid of only one of them, then the
*StreamOptimizeContext* class will not be removed but get simpler.
Will update the doc accordingly.

   2. How are the mini-batch and changelog inference rules tightly coupled?
>    I looked a little bit and I haven't seen any connection between them.
> It
>    seems like the changelog inference is what needs to run multiple times.


Sorry for the misunderstanding. The mini-batch and changelog inference are
not coupled among themselves but with the high-level optimization logic.
The idea is to separate the query optimization into 1) optimize 2) enrich
with changelog inference 3) enrich with mini-batch interval inference and
4) rewrite

   3. I think your point about code complexity is unnecessary.
> StreamOptimizeContext
>    extends org.apache.calcite.plan.Context which is used an interface to
> pass
>    information and objects through the Calcite stack.


I partially agree. Please see my answer above for the question 1.

   4. Is an alternative where the complexity of the changelog optimization
>    can be moved into the `FlinkChangelogModeInferenceProgram`?  (If this is
>    coupling between the mini-batch and changelog rules, then this would not
>    make sense.)


Good point. Yes, this is definitely an alternative.

   5. There are some other smaller refactorings.  I tried some of them
>    here: https://github.com/apache/flink/pull/24108 Mostly, it is syntax
>    and using lazy vals to avoid recomputing various things.  (Feel free to
>    take whatever actually works; I haven't run the tests.)


I took a look at your PR. For sure, some of the refactorings I will reuse
(probably rebase by the time I have this ready :))


Separately, folks on the Calcite dev list are thinking about multi-query
> optimization:
> https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k
> https://issues.apache.org/jira/browse/CALCITE-6188


Seems interesting. But Calcite's MQO approach will probably require some
drastic changes in our codebase once we adopt it.
This approach is more incremental.

Hope my comments answer your questions.

Regards,
Jeyhun Karimov

On Wed, Jan 17, 2024 at 2:36 AM Jim Hughes <[email protected]>
wrote:

> Hi Jeyhun,
>
>
> Generally, I like the idea of speeding up the optimizer in the case of
> multiple queries!
>
>
> I am new to the optimizer, but I have a few comments / questions.
>
>
>
>    1. StreamOptimizeContext may still be needed to pass the fact that we
>    are optimizing a streaming query.  I don't think this class will go away
>    completely.  (I agree it may become more simple if the kind or
>    mini-batch configuration can be removed.)
>    2. How are the mini-batch and changelog inference rules tightly coupled?
>    I looked a little bit and I haven't seen any connection between them.
> It
>    seems like the changelog inference is what needs to run multiple times.
>    3. I think your point about code complexity is unnecessary.
> StreamOptimizeContext
>    extends org.apache.calcite.plan.Context which is used an interface to
> pass
>    information and objects through the Calcite stack.
>    4. Is an alternative where the complexity of the changelog optimization
>    can be moved into the `FlinkChangelogModeInferenceProgram`?  (If this is
>    coupling between the mini-batch and changelog rules, then this would not
>    make sense.)
>    5. There are some other smaller refactorings.  I tried some of them
>    here: https://github.com/apache/flink/pull/24108 Mostly, it is syntax
>    and using lazy vals to avoid recomputing various things.  (Feel free to
>    take whatever actually works; I haven't run the tests.)
>
> Separately, folks on the Calcite dev list are thinking about multi-query
> optimization:
> https://lists.apache.org/thread/mcdqwrtpx0os54t2nn9vtk17spkp5o5k
> https://issues.apache.org/jira/browse/CALCITE-6188
>
> Cheers,
>
>
> Jim
>
> On Tue, Jan 16, 2024 at 5:45 PM Jeyhun Karimov <[email protected]>
> wrote:
>
> > Hi devs,
> >
> > I’d like to start a discussion on FLIP-419: Optimize multi-sink query
> plan
> > generation [1].
> >
> >
> > Currently, the optimization process of multi-sink query plans are
> > suboptimal: 1) it requires to go through the optimization process several
> > times and 2) as a result of this some low-level code complexity is
> > introduced on high level optimization classes such
> > as StreamCommonSubGraphBasedOptimizer.
> >
> >
> > To address this issue, this FLIP introduces  to decouple changelog and
> > mini-batch interval inference from the main optimization process.
> >
> > Please find more details in the FLIP wiki document [1]. Looking forward
> to
> > your feedback.
> >
> > [1]
> >
> >
> https://cwiki.apache.org/confluence/display/FLINK/FLIP-419%3A+Optimize+multi-sink+query+plan+generation
> >
> >
> > Regards,
> > Jeyhun Karimov
> >
>

Re: [DISCUSS] FLIP-419: Optimize multi-sink query plan generation

Reply via email to