Re: Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

Alessandro Solimando Fri, 11 Feb 2022 01:15:43 -0800

Hello everyone,
@Vladimir, +1 on the change introducing "enforceDerive()".


@Roman, could you walk us through the limitations you found that forced you
to copy-paste the whole class?

Maybe there is some middle ground for your problem(s) too, similar in
spirit to what Vladimir proposed for the other limitation.

I am not against making the class more public if necessary, but it would be
nice to have a discussion here before going down that path.
If the discussion leads to a better design of the original class, all
projects would benefit from that.

Best regards,
Alessandro

On Fri, 11 Feb 2022 at 04:14, Roman Kondakov <kondako...@mail.ru.invalid>
wrote:

> Hi Vladimir,
>
> +1 for making the rule driver more public. We've faced similar problems
> in the downstream project. The solution was to copy and paste the
> TopDownRuleDrive code with small fixes since it was not possible to
> override the default behavior.
>
> --
> Roman Kondakov
>
>
> On 11.02.2022 02:50, Vladimir Ozerov wrote:
> > Hi,
> >
> > In the Cascades driver, it is possible to propagate the requests top-down
> > using the "passThrough", method and then notify parents bottom-up about
> the
> > concrete physical implementations of inputs using the "derive" method.
> >
> > In some optimizers, the valid parent node cannot be created before the
> > trait sets of inputs are known. An example is a custom distribution trait
> > that includes the number of shards in the system. The parent operator
> alone
> > may guess the distribution keys, but cannot know the number of input
> > shards. To mitigate this, you may create a "template" node with an
> infinite
> > cost from within the optimization rule that will propagate the
> > passThrough/drive calls but would never participate in the final plan.
> >
> > Currency, the top-down driver designed in a way that the nodes created
> from
> > the "passThrough" method are not notified on the "derive" stage. This
> leads
> > to the incomplete exploration of the search space. For example, the rule
> > may produce the node "A1.template" that will be converted into a normal
> > "A1" node in the derive phase. However, if the parent operator produced
> > "A2.template" from "A1.template" using pass-through mechanics, the
> > "A2.template" will never be notified about the concrete input traits,
> > possibly losing the optimal plan. This is especially painful in
> distributed
> > engines, where the number of shards is important for the placement of
> > Shuffle operators.
> >
> > It seems that the problem could be solved with relatively low effort. The
> > "derive" is not invoked on the nodes created from the "passThrough"
> method,
> > because such nodes are placed in the "passThroughCache" collection.
> Instead
> > of doing this unconditionally, we may introduce an additional predicate
> > that would selectively enforce "derive" on such nodes. For example, this
> > could be a default method in the PhysicalNode interface, like:
> >
> > interface PhysicalNode {
> >    default boolean enforceDerive() { return false; }
> > }
> >
> > If there are no objections, I'll proceed with this change.
> >
> > Alternatively, we may make the TopDownRuleDriver more "public", so that
> the
> > user can extend it and decide within the driver whether to cache a
> > particular node or not.
> >
> > I would appreciate your feedback on the matter.
> >
> > Regards,
> > Vladimir.
> >
>

Re: Allow Cascades driver invoking "derive" on the nodes produced by "passThrough"

Reply via email to