Adding user list too.
---------- Forwarded message ---------- From: Reynold Xin <r...@databricks.com> Date: Tue, Oct 6, 2015 at 5:54 PM Subject: Re: multiple count distinct in SQL/DataFrame? To: "dev@spark.apache.org" <dev@spark.apache.org> To provide more context, if we do remove this feature, the following SQL query would throw an AnalysisException: select count(distinct colA), count(distinct colB) from foo; The following should still work: select count(distinct colA) from foo; The following should also work: select count(distinct colA, colB) from foo; On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <r...@databricks.com> wrote: > The current implementation of multiple count distinct in a single query is > very inferior in terms of performance and robustness, and it is also hard > to guarantee correctness of the implementation in some of the refactorings > for Tungsten. Supporting a better version of it is possible in the future, > but will take a lot of engineering efforts. Most other Hadoop-based SQL > systems (e.g. Hive, Impala) don't support this feature. > > As a result, we are considering removing support for multiple count > distinct in a single query in the next Spark release (1.6). If you use this > feature, please reply to this email. Thanks. > > Note that if you don't care about null values, it is relatively easy to > reconstruct a query using joins to support multiple distincts. > > >