Re: Drill query planning error

weijie tong Wed, 26 Jul 2017 19:32:27 -0700

here is the jira issue link:
https://issues.apache.org/jira/browse/DRILL-5691


On Thu, Jul 27, 2017 at 8:32 AM, weijie tong <[email protected]>
wrote:

> another tips ,the coun distinct query examples mentioned above are all
> transferred by my rule,that is from Aggregate,Aggregate,Project,Scan to
> Project,Scan.    Jira will be logged soon later as I am on road.
>
> On Thu, 27 Jul 2017 at 8:17 AM Julian Hyde <[email protected]> wrote:
>
>> I think the “multiple distinct count” code path also gets triggered if
>> there’s a mixture of distinct and non-distinct aggregates.
>>
>> > On Jul 26, 2017, at 5:16 PM, Aman Sinha <[email protected]> wrote:
>> >
>> > It sounds like you have narrowed it down further and it could be an
>> issue
>> > with the scalar check when 2 or more cartesian joins are present.
>> > Please file a JIRA with relevant details.
>> >
>> > On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <[email protected]
>> <mailto:[email protected]>>
>> > wrote:
>> >
>> >> Thanks for pointing out the possible reasons @Aman @Julian . I am not
>> sure
>> >> that's the real problem .
>> >>
>> >> As I initially mentioned,sorry for not clearly described ,queries like
>> >> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be
>> >> right.but "select count(*),sum(a),count (distinct b),count(distinct c)
>> from
>> >> t where dt=xxx"will be wrong.  The only one count(distinct) scenario
>> does
>> >> do the Cartesian join check too.It will and a broadcast operator on my
>> >> transferred relnode. So I think I have not loose the scalar property.
>> But
>> >> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule
>> >> check for satisfying the Cartesian conditions happened only once.It
>> sounds
>> >> should happen twice as there are two nested loop join operators at the
>> two
>> >> count(distinct) scenarios .If it turns out a bug ,I will log a Jira .
>> >>
>> >> Best Regards
>> >>
>> >>
>> >> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <[email protected]
>> <mailto:[email protected]>> wrote:
>> >>
>> >>> Yes, the RelMdMaxRowCount statistic would be useful for this.  Thanks
>> for
>> >>> pointing that out.   I'll see if we can leverage that.  The grouping
>> sets
>> >>> is not yet supported in Drill, but that would be a better solution
>> since
>> >> it
>> >>> avoids extra scans.
>> >>>
>> >>> -Aman
>> >>>
>> >>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <[email protected]
>> <mailto:[email protected]>> wrote:
>> >>>
>> >>>> Aman,
>> >>>>
>> >>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill
>> >> question.
>> >>>>
>> >>>> A blanket restriction on cartesian joins is a blunt instrument.
>> >> Sometimes
>> >>>> cartesian joins are valid, safe, and the best plan for a query. This
>> >> is a
>> >>>> case in point. Users shouldn’t have to change config parameters to
>> get
>> >> it
>> >>>> to work.
>> >>>>
>> >>>> (Actually I don’t know the query, but
>> >>>>
>> >>>>  select count(distinct deptno), count(distinct gender) from emp
>> >>>>
>> >>>> is equivalent.)
>> >>>>
>> >>>> Drill should detect that a relational expression can return at most
>> one
>> >>>> row, and allow a cartesian join if one side is such. Calcite has a
>> >>>> RelMdMaxRowCount statistic for this. This was added as part of
>> >>>> http://issues.apache.org/jira/browse/CALCITE-604 <
>> >>>> http://issues.apache.org/jira/browse/CALCITE-604 <
>> http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100%
>> >>>> safe. No config parameters required.
>> >>>>
>> >>>> Also, Calcite has an alternative way of handling multiple distinct
>> >>>> aggregates that rewrites to use grouping sets. It doesn’t generate
>> >>>> self-joins, cartesian or otherwise.  http://issues.apache.org/jira/
>> <http://issues.apache.org/jira/>
>> >>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732
>> <http://issues.apache.org/jira/browse/CALCITE-732>>.
>> >>>>
>> >>>> Julian
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>
>> >>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <[email protected]>
>> >> wrote:
>> >>>>>
>> >>>>> [Since this is Drill specific, I put dev@calcite on BCC].
>> >>>>>
>> >>>>> If you have two aggregates: Count(distinct a), Count(distinct b),
>> the
>> >>>>> Calcite logical plan consists of a cartesian join of 2 subqueries
>> >> each
>> >>> of
>> >>>>> which first does a group-by on the distinct column followed by a
>> >> count
>> >>>>> aggregate.   By default,  Drill only processes cartesian join if one
>> >>>> input
>> >>>>> of the join is known to be scalar (single row).  It sounds like
>> after
>> >>> you
>> >>>>> did the transformation to use the cache, that scalar property
>> somehow
>> >>> did
>> >>>>> not get propagated.
>> >>>>> You can override this behavior by a session configuration:  (this
>> >> will
>> >>>> use
>> >>>>> a nested loop join even if inputs are not provably scalar, but it
>> >>> should
>> >>>> be
>> >>>>> used for specific query only).    For a more general solution, I
>> >>> believe
>> >>>>> you may have to create an enhancement JIRA with appropriate details.
>> >>>>>  'alter session set planner.enable_nljoin_for_scalar_only = false';
>> >>>>>
>> >>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong <
>> >> [email protected]>
>> >>>>> wrote:
>> >>>>>
>> >>>>>> HI all:
>> >>>>>>
>> >>>>>> I materialize the count distinct query result to a cache, then when
>> >>>> user
>> >>>>>> query the count distinct , a specific rule will translate the query
>> >> to
>> >>>> the
>> >>>>>> cache. It turns out right when the query has only one count
>> >> (distinct
>> >>> )
>> >>>>>> operator ,but when it has two count (distinct ) ,it causes error
>> >> .The
>> >>>> error
>> >>>>>> info is here:
>> >>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee522
>> 69
>> >>>>>>
>> >>>>>>
>> >>>>>> Best Regards.
>>
>>

Re: Drill query planning error

Reply via email to