another tips ,the coun distinct query examples mentioned above are all transferred by my rule,that is from Aggregate,Aggregate,Project,Scan to Project,Scan. Jira will be logged soon later as I am on road.
On Thu, 27 Jul 2017 at 8:17 AM Julian Hyde <jh...@apache.org> wrote: > I think the “multiple distinct count” code path also gets triggered if > there’s a mixture of distinct and non-distinct aggregates. > > > On Jul 26, 2017, at 5:16 PM, Aman Sinha <amansi...@apache.org> wrote: > > > > It sounds like you have narrowed it down further and it could be an issue > > with the scalar check when 2 or more cartesian joins are present. > > Please file a JIRA with relevant details. > > > > On Wed, Jul 26, 2017 at 3:13 PM, weijie tong <tongweijie...@gmail.com > <mailto:tongweijie...@gmail.com>> > > wrote: > > > >> Thanks for pointing out the possible reasons @Aman @Julian . I am not > sure > >> that's the real problem . > >> > >> As I initially mentioned,sorry for not clearly described ,queries like > >> "select count(*),sum(a),count(distinct b) from t where dt=xxx " will be > >> right.but "select count(*),sum(a),count (distinct b),count(distinct c) > from > >> t where dt=xxx"will be wrong. The only one count(distinct) scenario > does > >> do the Cartesian join check too.It will and a broadcast operator on my > >> transferred relnode. So I think I have not loose the scalar property. > But > >> when I debug the two count(distinct ) , I find the NestedLoopJoinPrule > >> check for satisfying the Cartesian conditions happened only once.It > sounds > >> should happen twice as there are two nested loop join operators at the > two > >> count(distinct) scenarios .If it turns out a bug ,I will log a Jira . > >> > >> Best Regards > >> > >> > >> On Thu, 27 Jul 2017 at 3:10 AM Aman Sinha <amansi...@apache.org > <mailto:amansi...@apache.org>> wrote: > >> > >>> Yes, the RelMdMaxRowCount statistic would be useful for this. Thanks > for > >>> pointing that out. I'll see if we can leverage that. The grouping > sets > >>> is not yet supported in Drill, but that would be a better solution > since > >> it > >>> avoids extra scans. > >>> > >>> -Aman > >>> > >>> On Wed, Jul 26, 2017 at 11:25 AM, Julian Hyde <jh...@apache.org > <mailto:jh...@apache.org>> wrote: > >>> > >>>> Aman, > >>>> > >>>> Thanks for moving dev@calcite to Bcc. This is properly a Drill > >> question. > >>>> > >>>> A blanket restriction on cartesian joins is a blunt instrument. > >> Sometimes > >>>> cartesian joins are valid, safe, and the best plan for a query. This > >> is a > >>>> case in point. Users shouldn’t have to change config parameters to get > >> it > >>>> to work. > >>>> > >>>> (Actually I don’t know the query, but > >>>> > >>>> select count(distinct deptno), count(distinct gender) from emp > >>>> > >>>> is equivalent.) > >>>> > >>>> Drill should detect that a relational expression can return at most > one > >>>> row, and allow a cartesian join if one side is such. Calcite has a > >>>> RelMdMaxRowCount statistic for this. This was added as part of > >>>> http://issues.apache.org/jira/browse/CALCITE-604 < > >>>> http://issues.apache.org/jira/browse/CALCITE-604 < > http://issues.apache.org/jira/browse/CALCITE-604>>. This rule is 100% > >>>> safe. No config parameters required. > >>>> > >>>> Also, Calcite has an alternative way of handling multiple distinct > >>>> aggregates that rewrites to use grouping sets. It doesn’t generate > >>>> self-joins, cartesian or otherwise. http://issues.apache.org/jira/ < > http://issues.apache.org/jira/> > >>>> browse/CALCITE-732 <http://issues.apache.org/jira/browse/CALCITE-732 > <http://issues.apache.org/jira/browse/CALCITE-732>>. > >>>> > >>>> Julian > >>>> > >>>> > >>>> > >>>> > >>>> > >>>> > >>>>> On Jul 26, 2017, at 9:20 AM, Aman Sinha <amansi...@apache.org> > >> wrote: > >>>>> > >>>>> [Since this is Drill specific, I put dev@calcite on BCC]. > >>>>> > >>>>> If you have two aggregates: Count(distinct a), Count(distinct b), the > >>>>> Calcite logical plan consists of a cartesian join of 2 subqueries > >> each > >>> of > >>>>> which first does a group-by on the distinct column followed by a > >> count > >>>>> aggregate. By default, Drill only processes cartesian join if one > >>>> input > >>>>> of the join is known to be scalar (single row). It sounds like after > >>> you > >>>>> did the transformation to use the cache, that scalar property somehow > >>> did > >>>>> not get propagated. > >>>>> You can override this behavior by a session configuration: (this > >> will > >>>> use > >>>>> a nested loop join even if inputs are not provably scalar, but it > >>> should > >>>> be > >>>>> used for specific query only). For a more general solution, I > >>> believe > >>>>> you may have to create an enhancement JIRA with appropriate details. > >>>>> 'alter session set planner.enable_nljoin_for_scalar_only = false'; > >>>>> > >>>>> On Wed, Jul 26, 2017 at 4:14 AM, weijie tong < > >> tongweijie...@gmail.com> > >>>>> wrote: > >>>>> > >>>>>> HI all: > >>>>>> > >>>>>> I materialize the count distinct query result to a cache, then when > >>>> user > >>>>>> query the count distinct , a specific rule will translate the query > >> to > >>>> the > >>>>>> cache. It turns out right when the query has only one count > >> (distinct > >>> ) > >>>>>> operator ,but when it has two count (distinct ) ,it causes error > >> .The > >>>> error > >>>>>> info is here: > >>>>>> https://gist.github.com/weijietong/1b8ed12db9490bf006e8b3fe0ee52269 > >>>>>> > >>>>>> > >>>>>> Best Regards. > >