Stamatis,

Small comment from me, for distributed systems is more efficient if
SINGLETON distribution appears only on root node, in other words as late as
possible, it allows to filter or cut rows (project) on source nodes which
significantly reduces network costs.

To do that (In case we use original volcano approach) we need to request
all possible distribution traits on project node (some of them we cannot
request, because we don't know whether a repartitioning and rehashing
happens in the middle) to provide a place for possible converted inputs
(possibly leaving empty subsets, which causes exceptions in current
implementation). Such approach significant slowdowns planning phase.

Regards,
Igor

пт, 1 нояб. 2019 г., 11:42 Stamatis Zampetakis <zabe...@gmail.com>:

> The discussion is interesting thanks for the nice examples.
> I am replying in this thread since it has all the examples and the history
> of the conversation.
>
> *Part I: Distribution physical or logical property*
>
> The Volcano paper writes the following regarding properties:
>
> "Logical properties can be derived from the logical algebra expression and
> include schema, expected size, etc., while physical properties depend on
> algorithms, e.g., sort order, partitioning, etc."
>
> We could say that partitioning is not only a property which depends on an
> algorithm but a property which depends on the schema. When you have a table
> you know in advance that the particular table is replicated.
>
> Basically, this means that the starting "logical" plan in your case could
> be the following:
> RootLogicalRel[convention=LOGICAL, distribution=ANY]
> -> ProjectLogicalRel[convention=LOGICAL, distribution=ANY]
>   -> MapScanLogicalRel[convention=LOGICAL, distribution=REPLICATED]
>
> When you are about to create your physical projection it is not necessary
> to create three equivalent paths but the most promising by asking the input
> what can it offer as traits. Again this partially overlaps with the
> discussion about "On demand traitset" [1] where as I wrote there is already
> a mechanism to derive traits from children (collation, distribution, etc.).
>
> Note that I am not saying that there is nothing more to be done to improve
> the Calcite optimizer; I am just trying to provide a viable alternative
> given the current situation of the project.
>
> *Part II: Example with the original Volcano optimizer*
>
> Now let me try to show how the original (from the paper) Volcano optimizer
> could handle your query.
>
> The optimization starts and we request two properties convention and
> distribution.
>
> RootLogicalRel[convention=PHYSICAL, distribution=SINGLETON]
> -> ProjectLogicalRel
>   -> MapScanLogicalRel
>
> Rule 1: RootLogicalRel -> RootPhysicalRel applies the algorithm but cannot
> satisfy the SINGLETON property so it propagates the demand
>
> RootPhysicalRel
> -> ProjectLogicalRel [convention=PHYSICAL, distribution=SINGLETON]
>   -> MapScanLogicalRel
>
> Rule 2: ProjectLogicalRel -> ProjectPhysicalRel applies the algorithm but
> cannot satisfy the SINGLETON property so it propagates the demand
>
> RootPhysicalRel
> -> ProjectPhysicalRel
>   -> MapScanLogicalRel [convention=PHYSICAL, distribution=SINGLETON]
>
> Rule 3: MapScanLogicalRel -> MapScanPhysicalRel applies the algorithm and
> given that the table is replicated it satisfies the distribution property.
>
> RootPhysicalRel
> -> ProjectPhysicalRel
>   -> MapScanPhysicalRel
>
> So we end up with a complete plan that satisfies all properties and does
> not have redundant exchange operators.
> Moreover we didn't require all possible distributions when we applied Rule
> 2 since we just want to satisfy the SINGLETON property requested by the
> parent.
> The example above does not demonstrate the additional optimization paths
> which would be apply an enforcer (an Exchange operator to satisfy the
> SINGLETON property) at a higher level than the scan.
>
> Would this be an acceptable approach? Does it still suffer from a big
> search space?
>
> To make the analog with the actual Volcano optimizer in Calcite the thing
> that may be missing is to be able to know what trait was requested by the
> parent during the application of Rule 2.
>
> Best,
> Stamatis
>
> [1]
>
> https://lists.apache.org/thread.html/79dac47ea50b5dfbd3f234e368ed61d247fb0eb989f87fe01aedaf25@%3Cdev.calcite.apache.org%3E
>
> On Wed, Oct 30, 2019 at 7:24 PM Vladimir Ozerov <ppoze...@gmail.com>
> wrote:
>
> > Hi Stamatis,
> >
> > The problem that the presented reproducer is a very simplified version of
> > what is actually needed. In reality, there are several distribution
> types -
> > PARTITIONED, REPLICATED, SINGLETON. To make things worse, PARTITIONED may
> > or may not have concrete distribution fields. In theory, I can create one
> > transformation per distribution type, but that would increase plan space
> > significantly. In my sample "Root <- Project <- Scan" plan, there is no
> > room for optimization at all, we only need to convert the nodes and
> > propagate traits. But if I follow the proposed approach, the planner
> would
> > create three equivalent paths. For complex queries, this may increase
> > optimization time significantly.
> >
> > What I need instead is to gradually convert and optimize nodes
> *bottom-up*,
> > instead of top-bottom. That is, create a physical scan first, then
> create a
> > physical project on top of it, etc. But at the same time, some rules
> still
> > require the top-bottom approach. So essentially I need the optimizer to
> do
> > both. Abstract converters help me establish bottom-up preparation but do
> > this at the cost of considering too many trait pairs, and as a result,
> also
> > pollute the search space.
> >
> > To the contrast, precise command "I transformed the node A to node B,
> > please re-trigger the rules for A's parents" would allow us to re-trigger
> > only required rules, without adding more nodes.
> >
> > Does it make sense?
> >
> > Regards,
> > Vladimir.
> >
> > ср, 30 окт. 2019 г. в 21:02, Stamatis Zampetakis <zabe...@gmail.com>:
> >
> > > I admit that I didn't thoroughly read the exchanges so far but forcing
> > the
> > > rules trigger does not seem the right approach.
> > >
> > > Other than that I would like to backtrack a bit to point 4.3 raised by
> > > Vladimir.
> > >
> > > "ProjectPhysicalRule [6] - transforms logical project to physical
> > > project *ONLY* if there is an underlying physical input with REPLICATED
> > or
> > > SINGLETON distribution"
> > >
> > > The rule could be modified to do the following two transformations:
> > > 1. Create a physical project and require the input to be REPLICATED.
> > > 2. Create a physical project and require
> > > the input to be SINGLETON.
> > >
> > > I would assume that afterwards when your scan rule fires it should go
> to
> > > the appropriate subset and give you back the desired plan. Is there a
> > > problem with this approach?
> > >
> > > Best,
> > > Stamatis
> > >
> > > On Wed, Oct 30, 2019, 5:52 PM Seliverstov Igor <gvvinbl...@gmail.com>
> > > wrote:
> > >
> > > > Unfortunately it requires package-private API usage.
> > > >
> > > > Best solution there if Calcite provide it for end users.
> > > >
> > > > ср, 30 окт. 2019 г., 18:48 Vladimir Ozerov <ppoze...@gmail.com>:
> > > >
> > > > > “e pension” == “expand conversion” :)
> > > > >
> > > > > ср, 30 окт. 2019 г. в 18:46, Vladimir Ozerov <ppoze...@gmail.com>:
> > > > >
> > > > > > Yes, that may work. Even if e pension rule is used, for the most
> > > cases
> > > > it
> > > > > > will not trigger any real conversions, since we are moving from
> > > > abstract
> > > > > > convention to physical, and created converters will have the
> > opposite
> > > > > trait
> > > > > > direction (from physical to abstract).
> > > > > >
> > > > > > But again - ideally we only need to re-trigger the rules for a
> > > specific
> > > > > > node, no more than that. So API support like
> > > > > > “VolcanoPlanner.forceRules(RelNode)” would be very convenient.
> > > > > >
> > > > > > What do you think?
> > > > > >
> > > > > > ср, 30 окт. 2019 г. в 17:56, Seliverstov Igor <
> > gvvinbl...@gmail.com
> > > >:
> > > > > >
> > > > > >> I considered manual rules calling too, for now we use abstract
> > > > > converters
> > > > > >> +
> > > > > >> ExpandConversionRule for exchanges producing.
> > > > > >>
> > > > > >> You may create such converters manually (checking appropriate
> > > subset)
> > > > > this
> > > > > >> case you may reduce created converters count, also, a converter
> > is a
> > > > > quite
> > > > > >> special node, that does almost nothing (without corresponding
> > rule)
> > > it
> > > > > may
> > > > > >> be used just as a rule trigger.
> > > > > >>
> > > > > >> Regards,
> > > > > >> Igor
> > > > > >>
> > > > > >> ср, 30 окт. 2019 г., 17:31 Vladimir Ozerov <ppoze...@gmail.com
> >:
> > > > > >>
> > > > > >> > One funny hack which helped me is manual registration of a
> fake
> > > > > RelNode
> > > > > >> > with desired traits through VolcanoPlanner.register() method.
> > But
> > > > > again,
> > > > > >> > this leads to trashing. What could really help is a call to
> > > > > >> > VolcanoPlanner.fireRules() with desired rel. But this doesn't
> > work
> > > > out
> > > > > >> of
> > > > > >> > the box since some internals of the rule queue needs to be
> > > adjusted.
> > > > > >> >
> > > > > >> > What does the community think about adding a method which will
> > > > re-add
> > > > > >> rules
> > > > > >> > applicable to the specific RelNode to the rule queue?
> > > > > >> >
> > > > > >> > ср, 30 окт. 2019 г. в 17:00, Vladimir Ozerov <
> > ppoze...@gmail.com
> > > >:
> > > > > >> >
> > > > > >> > > Hi Igor,
> > > > > >> > >
> > > > > >> > > Yes, I came to the same conclusion, thank you. This is how
> it
> > > > > >> basically
> > > > > >> > > happens when converters are disabled:
> > > > > >> > > 1) We start with initial tree: [LogicalProject] <-
> > [LogicalScan]
> > > > > >> > > 2) Then we convert LogicalScan to PhysicalScan, so it is
> added
> > > to
> > > > > the
> > > > > >> > > set: [LogicalProject] <- [LogicalScan, PhysicalScan]
> > > > > >> > > 3) Finally, when it is time to fire a rule for PhysicalScan,
> > we
> > > > try
> > > > > to
> > > > > >> > get
> > > > > >> > > parents of that scan set with traits of the PhysicalScan.
> > Since
> > > > > there
> > > > > >> are
> > > > > >> > > no such parents (we skipped it intentionally), the rule is
> not
> > > > > queued.
> > > > > >> > >
> > > > > >> > > But when converters are enabled, a converter rel is created:
> > > > > >> > [LogicalProject]
> > > > > >> > > <- [LogicalScan, PhysicalScan,
> > ConverterFromPhysicalToLogical].
> > > No
> > > > > >> rules
> > > > > >> > > are fired for PhysicalScan again, but they are fired for
> > > converter
> > > > > >> since
> > > > > >> > > it has the necessary LOGICAL trait.
> > > > > >> > >
> > > > > >> > > It makes sense, that converters essentially allow forcing
> rule
> > > > > >> invocation
> > > > > >> > > on parents, even if the child was created with different
> > traits.
> > > > But
> > > > > >> it
> > > > > >> > > seems that this mechanism may be too heavy for complex
> queries
> > > > > >> because it
> > > > > >> > > literally creates hundreds of new converter rels and
> triggers
> > > > rules
> > > > > >> over
> > > > > >> > > and over again.
> > > > > >> > >
> > > > > >> > > We need some fine-grained alternative. Basically, what would
> > be
> > > > > enough
> > > > > >> > for
> > > > > >> > > me is to let the planner know somehow: "I created that rel,
> > and
> > > I
> > > > > want
> > > > > >> > you
> > > > > >> > > to execute parent rules not only using its trait but also on
> > > this
> > > > > and
> > > > > >> > those
> > > > > >> > > traits."
> > > > > >> > > Is there any API in Calcite which allows doing this without
> > > > > creating a
> > > > > >> > new
> > > > > >> > > rel node?
> > > > > >> > >
> > > > > >> > > Regards,
> > > > > >> > > Vladimir.
> > > > > >> > >
> > > > > >> > >
> > > > > >> > > ср, 30 окт. 2019 г. в 09:25, Seliverstov Igor <
> > > > gvvinbl...@gmail.com
> > > > > >:
> > > > > >> > >
> > > > > >> > >> Vladimir,
> > > > > >> > >>
> > > > > >> > >> Probably it'll help you:
> > > > > >> > >>
> > > > > >> > >> Seems the cause of issue in RelSubset.getParentRels()  The
> > > check
> > > > > used
> > > > > >> > when
> > > > > >> > >> the planner schedules newly matched rules after successful
> > > > > >> > transformation
> > > > > >> > >> (see VolcanoRuleCall.matchRecurse), it prevents the parent
> > rule
> > > > be
> > > > > >> > applied
> > > > > >> > >> once again (here your logical project with an input having
> > ANY
> > > > > >> > >> distribution
> > > > > >> > >> doesn't satisfy a transformed input traits).
> > > > > >> > >>
> > > > > >> > >> In our case we use another workaround, so there are also
> much
> > > > more
> > > > > >> > >> transformations than we wanted, so the desired rule is
> > > triggered.
> > > > > >> > >>
> > > > > >> > >>
> > > > > >> > >> вт, 29 окт. 2019 г., 14:46 Vladimir Ozerov <
> > ppoze...@gmail.com
> > > >:
> > > > > >> > >>
> > > > > >> > >> > Hi Vladimir,
> > > > > >> > >> >
> > > > > >> > >> > I am sorry. Pushed, it works now.
> > > > > >> > >> >
> > > > > >> > >> > вт, 29 окт. 2019 г. в 14:41, Vladimir Sitnikov <
> > > > > >> > >> > sitnikov.vladi...@gmail.com
> > > > > >> > >> > >:
> > > > > >> > >> >
> > > > > >> > >> > > > mvn clean test
> > > > > >> > >> > >
> > > > > >> > >> > > [ERROR] The goal you specified requires a project to
> > > execute
> > > > > but
> > > > > >> > >> there is
> > > > > >> > >> > > no POM in this directory
> > > > > >> > >> > >
> > > > > >> > >> > > Vladimir, please push missing files
> > > > > >> > >> > >
> > > > > >> > >> > > Vladimir
> > > > > >> > >> > >
> > > > > >> > >> >
> > > > > >> > >>
> > > > > >> > >
> > > > > >> >
> > > > > >>
> > > > > >
> > > > >
> > > >
> > >
> >
>

Reply via email to