here is the jira issue link: https://issues.apache.org/jira/browse/DRILL-5691
On Thu, Jul 27, 2017 at 8:32 AM, weijie tong <tongweijie...@gmail.com> wrote: > another tips ,the coun distinct query examples mentioned above are all > transferred by my rule,that is from Aggregate,Aggregate,Project,Scan to > Project,Scan. Jira will be logged soon later as I am on road. > > On Thu, 27 Jul 2017 at 8:17 AM Julian Hyde <jh...@apache.org> wrote: > >> I think the “multiple distinct count” code path also gets triggered if >> there’s a mixture of distinct and non-distinct aggregates. >> >> > On Jul 26, 2017, at 5:16 PM, Aman Sinha <amansi...@apache.org> wrote: >> > >> > It sounds like you have narrowed it down further and it could be an >> issue >> > with the scalar check when 2 or more cartesian joins are present. >> > Please file a JIRA with relevant details. >> > >> > On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <tongweijie...@gmail.com >> <mailto:tongweijie...@gmail.com>> >> > wrote: >> > >> >> Thanks for pointing out the possible reasons @Aman @Julian . I am not >> sure >> >> that's the real problem . >> >> >> >> As I initially mentioned,sorry for not clearly described ,queries like >> >> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be >> >> right.but "select count(*),sum(a),count (distinct b),count(distinct c) >> from >> >> t where dt=xxx"will be wrong. The only one count(distinct) scenario >> does >> >> do the Cartesian join check too.It will and a broadcast operator on my >> >> transferred relnode. So I think I have not loose the scalar property. >> But >> >> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule >> >> check for satisfying the Cartesian conditions happened only once.It >> sounds >> >> should happen twice as there are two nested loop join operators at the >> two >> >> count(distinct) scenarios .If it turns out a bug ,I will log a Jira . >> >> >> >> Best Regards >> >> >> >> >> >> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <amansi...@apache.org >> <mailto:amansi...@apache.org>> wrote: >> >> >> >>> Yes, the RelMdMaxRowCount statistic would be useful for this. Thanks >> for >> >>> pointing that out. I'll see if we can leverage that. The grouping >> sets >> >>> is not yet supported in Drill, but that would be a better solution >> since >> >> it >> >>> avoids extra scans. >> >>> >> >>> -Aman >> >>> >> >>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jh...@apache.org >> <mailto:jh...@apache.org>> wrote: >> >>> >> >>>> Aman, >> >>>> >> >>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill >> >> question. >> >>>> >> >>>> A blanket restriction on cartesian joins is a blunt instrument. >> >> Sometimes >> >>>> cartesian joins are valid, safe, and the best plan for a query. This >> >> is a >> >>>> case in point. Users shouldn’t have to change config parameters to >> get >> >> it >> >>>> to work. >> >>>> >> >>>> (Actually I don’t know the query, but >> >>>> >> >>>> select count(distinct deptno), count(distinct gender) from emp >> >>>> >> >>>> is equivalent.) >> >>>> >> >>>> Drill should detect that a relational expression can return at most >> one >> >>>> row, and allow a cartesian join if one side is such. Calcite has a >> >>>> RelMdMaxRowCount statistic for this. This was added as part of >> >>>> http://issues.apache.org/jira/browse/CALCITE-604 < >> >>>> http://issues.apache.org/jira/browse/CALCITE-604 < >> http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100% >> >>>> safe. No config parameters required. >> >>>> >> >>>> Also, Calcite has an alternative way of handling multiple distinct >> >>>> aggregates that rewrites to use grouping sets. It doesn’t generate >> >>>> self-joins, cartesian or otherwise. http://issues.apache.org/jira/ >> <http://issues.apache.org/jira/> >> >>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732 >> <http://issues.apache.org/jira/browse/CALCITE-732>>. >> >>>> >> >>>> Julian >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>> >> >>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <amansi...@apache.org> >> >> wrote: >> >>>>> >> >>>>> [Since this is Drill specific, I put dev@calcite on BCC]. >> >>>>> >> >>>>> If you have two aggregates: Count(distinct a), Count(distinct b), >> the >> >>>>> Calcite logical plan consists of a cartesian join of 2 subqueries >> >> each >> >>> of >> >>>>> which first does a group-by on the distinct column followed by a >> >> count >> >>>>> aggregate. By default, Drill only processes cartesian join if one >> >>>> input >> >>>>> of the join is known to be scalar (single row). It sounds like >> after >> >>> you >> >>>>> did the transformation to use the cache, that scalar property >> somehow >> >>> did >> >>>>> not get propagated. >> >>>>> You can override this behavior by a session configuration: (this >> >> will >> >>>> use >> >>>>> a nested loop join even if inputs are not provably scalar, but it >> >>> should >> >>>> be >> >>>>> used for specific query only). For a more general solution, I >> >>> believe >> >>>>> you may have to create an enhancement JIRA with appropriate details. >> >>>>> 'alter session set planner.enable_nljoin_for_scalar_only = false'; >> >>>>> >> >>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong < >> >> tongweijie...@gmail.com> >> >>>>> wrote: >> >>>>> >> >>>>>> HI all: >> >>>>>> >> >>>>>> I materialize the count distinct query result to a cache, then when >> >>>> user >> >>>>>> query the count distinct , a specific rule will translate the query >> >> to >> >>>> the >> >>>>>> cache. It turns out right when the query has only one count >> >> (distinct >> >>> ) >> >>>>>> operator ,but when it has two count (distinct ) ,it causes error >> >> .The >> >>>> error >> >>>>>> info is here: >> >>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee522 >> 69 >> >>>>>> >> >>>>>> >> >>>>>> Best Regards. >> >>