Re: Drill query planning error

weijie tong Wed, 26 Jul 2017 17:33:15 -0700

another tips ,the coun distinct query examples mentioned above are all
transferred by my rule,that is from Aggregate,Aggregate,Project,Scan to
Project,Scan.    Jira will be logged soon later as I am on road.


On Thu, 27 Jul 2017 at 8:17 AM Julian Hyde <jh...@apache.org> wrote:

> I think the “multiple distinct count” code path also gets triggered if
> there’s a mixture of distinct and non-distinct aggregates.
>
> > On Jul 26, 2017, at 5:16 PM, Aman Sinha <amansi...@apache.org> wrote:
> >
> > It sounds like you have narrowed it down further and it could be an issue
> > with the scalar check when 2 or more cartesian joins are present.
> > Please file a JIRA with relevant details.
> >
> > On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <tongweijie...@gmail.com
> <mailto:tongweijie...@gmail.com>>
> > wrote:
> >
> >> Thanks for pointing out the possible reasons @Aman @Julian . I am not
> sure
> >> that's the real problem .
> >>
> >> As I initially mentioned,sorry for not clearly described ,queries like
> >> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
> >> right.but "select count(*),sum(a),count (distinct b),count(distinct c)
> from
> >> t where dt=xxx"will be wrong.  The only one count(distinct) scenario
> does
> >> do the Cartesian join check too.It will and a broadcast operator on my
> >> transferred relnode. So I think I have not loose the scalar property.
> But
> >> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
> >> check for satisfying the Cartesian conditions happened only once.It
> sounds
> >> should happen twice as there are two nested loop join operators at the
> two
> >> count(distinct) scenarios .If it turns out a bug ,I will log a Jira .
> >>
> >> Best Regards
> >>
> >>
> >> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <amansi...@apache.org
> <mailto:amansi...@apache.org>> wrote:
> >>
> >>> Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks
> for
> >>> pointing that out.   I'll see if we can leverage that.  The grouping
> sets
> >>> is not yet supported in Drill, but that would be a better solution
> since
> >> it
> >>> avoids extra scans.
> >>>
> >>> -Aman
> >>>
> >>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jh...@apache.org
> <mailto:jh...@apache.org>> wrote:
> >>>
> >>>> Aman,
> >>>>
> >>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill
> >> question.
> >>>>
> >>>> A blanket restriction on cartesian joins is a blunt instrument.
> >> Sometimes
> >>>> cartesian joins are valid, safe, and the best plan for a query. This
> >> is a
> >>>> case in point. Users shouldn’t have to change config parameters to get
> >> it
> >>>> to work.
> >>>>
> >>>> (Actually I don’t know the query, but
> >>>>
> >>>>  select count(distinct deptno), count(distinct gender) from emp
> >>>>
> >>>> is equivalent.)
> >>>>
> >>>> Drill should detect that a relational expression can return at most
> one
> >>>> row, and allow a cartesian join if one side is such. Calcite has a
> >>>> RelMdMaxRowCount statistic for this. This was added as part of
> >>>> http://issues.apache.org/jira/browse/CALCITE-604 <
> >>>> http://issues.apache.org/jira/browse/CALCITE-604 <
> http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100%
> >>>> safe. No config parameters required.
> >>>>
> >>>> Also, Calcite has an alternative way of handling multiple distinct
> >>>> aggregates that rewrites to use grouping sets. It doesn’t generate
> >>>> self-joins, cartesian or otherwise.  http://issues.apache.org/jira/ <
> http://issues.apache.org/jira/>
> >>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732
> <http://issues.apache.org/jira/browse/CALCITE-732>>.
> >>>>
> >>>> Julian
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <amansi...@apache.org>
> >> wrote:
> >>>>>
> >>>>> [Since this is Drill specific, I put dev@calcite on BCC].
> >>>>>
> >>>>> If you have two aggregates: Count(distinct a), Count(distinct b), the
> >>>>> Calcite logical plan consists of a cartesian join of 2 subqueries
> >> each
> >>> of
> >>>>> which first does a group-by on the distinct column followed by a
> >> count
> >>>>> aggregate.   By default,  Drill only processes cartesian join if one
> >>>> input
> >>>>> of the join is known to be scalar (single row).  It sounds like after
> >>> you
> >>>>> did the transformation to use the cache, that scalar property somehow
> >>> did
> >>>>> not get propagated.
> >>>>> You can override this behavior by a session configuration:  (this
> >> will
> >>>> use
> >>>>> a nested loop join even if inputs are not provably scalar, but it
> >>> should
> >>>> be
> >>>>> used for specific query only).    For a more general solution, I
> >>> believe
> >>>>> you may have to create an enhancement JIRA with appropriate details.
> >>>>>  'alter session set planner.enable_nljoin_for_scalar_only = false';
> >>>>>
> >>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <
> >> tongweijie...@gmail.com>
> >>>>> wrote:
> >>>>>
> >>>>>> HI all:
> >>>>>>
> >>>>>> I materialize the count distinct query result to a cache, then when
> >>>> user
> >>>>>> query the count distinct , a specific rule will translate the query
> >> to
> >>>> the
> >>>>>> cache. It turns out right when the query has only one count
> >> (distinct
> >>> )
> >>>>>> operator ,but when it has two count (distinct ) ,it causes error
> >> .The
> >>>> error
> >>>>>> info is here:
> >>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269
> >>>>>>
> >>>>>>
> >>>>>> Best Regards.
>
>

Re: Drill query planning error

Reply via email to