Re: [DISCUSS] Towards Cascades Optimizer

2020-04-20 Thread hanu mapr
Hello Haisheng,


Thanks for the detailed analysis on the support for cascades framework. I
am quite interested to be part of the new optimization framework. I believe
this a very important infrastructural work to make calcite a robust query
optimizer.


I like your approach on the trait propagation and derivation of the traits
from child nodes. I have a question though in the example you provided.
Shouldn't HashJoin make more sense in place of MergeJoin as it can
passThrough the traits like Hash whereas MergeJoin needs the sorted
traits.(Please correct me if I am missing anything here).


How are you imagining the new Optimizer interface from user point of view.
Are you planning to use the same interface as that of VolcanoPlanner?


The concern about the optimizer can take up huge time for optimizing large
queries is a genuine. At the time when I was part of writing an optimizer
for a federated engine, I also dealt with it by making the search
time-bound. As mentioned even iteration based approach might be viable but
I think a little more thought might be needed to finalize.


I am of the opinion that option b add a new Planner might be the right
approach. We can continuously work on the new optimizer and make it robust
so that switching to the new optimizer can be seamless and can be
controlled by the user. From my experience any bug in the optimizer is
quite subtle and can have high magnitude degradation in performance, so
tweaking the existing VolcanoOptimizer can be risky.


Thanks,

-Hanumath Maduri

On Mon, Apr 20, 2020 at 11:04 AM Xiening Dai  wrote:

> Hi Jinpeng,
>
> Regarding this comment - I believe there are ways to calculate the logical
> cost, but I think it’s not that simple as  "cardinality * unit_copy_cost.”,
> would  you provide more details of other different ways? Just the algorithm
> description or pseudo code would help us understand. Thanks.
>
> Regarding the approach of creating new planner, I don’t think a new
> planner would lower the risk. We don’t know what we don’t know. If we
> introduced an issue while modifying the planner, most likely we would do
> the same with new planner class. A new planner doesn’t necessarily prevent
> the issue from happening, but just delay its surfacing, which is worse IMO.
>
> There’s one obvious benefit with new planner is that we can provide some
> sort of isolation so the change won’t cause test baseline updates, which
> could be painful at times.  We should see if we could use planner config to
> achieve the same if we decide to just modify the current planner.
>
>
> > On Apr 20, 2020, at 8:32 AM, Haisheng Yuan  wrote:
> >
> > Igor,
> >
> > That's great.
> >
> > On 2020/04/20 11:17:49, Seliverstov Igor  wrote:
> >> Haisheng, Xiening,
> >>
> >> Ok, Now I see how it should work.
> >>
> >> Thanks for your replies.
> >>
> >> Regards,
> >> Igor
> >>
> >>> 20 апр. 2020 г., в 09:56, Seliverstov Igor 
> написал(а):
> >>>
> >>> Haisheng, Xiening,
> >>>
> >>> Thanks for clarifying.
> >>>
> >>> In this proposal, we are not trying to split logical and physical
> planning entirely. - actually I was in doubt about an idea of entire
> splitting logical and physical phases, if you aren't going to, I have no
> objections.
> >>>
> >>> But it returns me to my first question: how we will propagate traits
> in bottom-up manner using proposed approach. (See [DISCUSS] Proposal to add
> API to force rules matching specific rels for details)
> >>>
> >>> One of inconveniences of current VolcanoPlanner implementation is
> amount of tricks that we need to get desired behaviour. It would be great
> if some of issues (or all of them) were solved in the new approach.
> >>>
> >>> Regards,
> >>> Igor
> >>>
> >>> пн, 20 апр. 2020 г., 7:02 Xiening Dai  xndai@gmail.com>>:
> >>> Hi Igor,
> >>>
> >>> Your comment - "because actual cost may be calculated correctly using
> physical operators only. So won't be able to implement Branch and Bound
> Space Pruning.“ is actually not true. In Cascade’s lower bound / upper
> bound pruning algorithm, you can get cost lower bound of input RelNode
> using cardinality * unit_copy_cost. The unit_copy_cost is a constant, which
> stands for the minimal cost for processing a tuple from the input. So this
> will be the minimal cost the input RelNode can achieve, and if this is
> indeed larger than the current best cost, this input node can be pruned.
> >>>
> >>> In this proposal, we are not trying to split logical and physical
> planning entirely. But for any given equivalent set, we would need to
> finish exploration before implementation. But for the entire memo tree,
> each set could be in different planning stage, and an equivalent set can be
> pruned even before it’s implemented. A text book example of aforementioned
> “lower bound / upper bound pruning” would be the join ordering case.
> >>>
> >>> Regarding #3, I think we can still achieve that (partially) through
> this proposal. Remember every time when we start the optimization, we 

[contribution request] calcite contribution.

2020-03-02 Thread hanu mapr
Hi All,

I would like to contribute to calcite. Can you please provide me the
required contribution permissions to assign a bug to my JIRA account?

JIRA account ID: hanu.ncr

Thanks,
-Hanu


Re: SemiJoin is not considered during join ordering.

2019-01-15 Thread hanu mapr
ship_date_sk`, `ws_order_number`]]])


I was actually looking for a plan with DrillSemiJoinRel pushed down when
the semi join is enabled.
Thank you for sharing the paper.

Regards,
-Hanu


On Mon, Jan 14, 2019 at 5:18 AM Stamatis Zampetakis 
wrote:

> @Hanu: Can you clarify a bit more where is the problem?
> From a quick loon in LoptOptimizeJoinRule it seems that all joins to be
> re-ordered are grouped using a MultiJoin operator.
> Have you checked that the SemiJoin in question is part of the MultiJoin
> operator?
>
> In general semi joins are more difficult to re-order with other joins since
> they do not have neither the commutative nor the associative property.
> There are other properties that allow the re-ordering of semi-joins but I
> am not aware if they are exploited by LoptOptimizeJoinRule. If you plan
> to work on this you may find interesting the work by Moerkotte et. al. on
> join enumeration [1].
>
> Best,
> Stamatis
>
> [1] On the Correct and Complete Enumeration of the CoreSearch Space
> <
> https://www.researchgate.net/profile/Guido_Moerkotte/publication/262216932_On_the_correct_and_complete_enumeration_of_the_core_search_space/links/58467f2c08ae61f75ddb205c/On-the-correct-and-complete-enumeration-of-the-core-search-space.pdf
> >
>
>
> Στις Δευ, 14 Ιαν 2019 στις 7:31 π.μ., ο/η Julian Hyde 
> έγραψε:
>
> > I don't see any reason in principle why LoptOptimizeJoinRule could not
> > handle semi-joins. Of course it will have to remember that these are
> > semi-joins so that it can re-create them as semi-joins afterwards. And
> > there may be complications because a semi-join eliminates duplicates
> > in its right-hand input (or rather, the number of rows is not
> > multiplied if a key has multiple values) and therefore moving a
> > semi-join up and down the join graph may not affect row-counts in the
> > expected way.
> >
> > It's certainly beneficial to push semi-joins down in many cases.
> >
> > Julian
> >
> >
> > On Fri, Jan 11, 2019 at 8:41 PM hanu mapr  wrote:
> > >
> > > Hello all,
> > >
> > > I am working on an issue wherein one of the tpcds query is regressing
> > when
> > > semijoin is also present with other inner joins in the query.
> Currently,
> > > DRILL uses LoptOptimizeJoinRule for join order optimization. During the
> > > debugging process, I observed that LoptOptimizeJoinRule is not
> > considering
> > > semi joins for join order optimization.
> > > Please do let me know if my understanding is correct. If so what is the
> > > best option for handling this scenario?.
> > >
> > > Thanks in advance.
> > >
> > > Regards,
> > > -Hanu
> >
>


SemiJoin is not considered during join ordering.

2019-01-11 Thread hanu mapr
Hello all,

I am working on an issue wherein one of the tpcds query is regressing when
semijoin is also present with other inner joins in the query. Currently,
DRILL uses LoptOptimizeJoinRule for join order optimization. During the
debugging process, I observed that LoptOptimizeJoinRule is not considering
semi joins for join order optimization.
Please do let me know if my understanding is correct. If so what is the
best option for handling this scenario?.

Thanks in advance.

Regards,
-Hanu