To provide more context, if we do remove this feature, the following SQL query would throw an AnalysisException:
select count(distinct colA), count(distinct colB) from foo; The following should still work: select count(distinct colA) from foo; The following should also work: select count(distinct colA, colB) from foo; On Tue, Oct 6, 2015 at 5:51 PM, Reynold Xin <r...@databricks.com> wrote: > The current implementation of multiple count distinct in a single query is > very inferior in terms of performance and robustness, and it is also hard > to guarantee correctness of the implementation in some of the refactorings > for Tungsten. Supporting a better version of it is possible in the future, > but will take a lot of engineering efforts. Most other Hadoop-based SQL > systems (e.g. Hive, Impala) don't support this feature. > > As a result, we are considering removing support for multiple count > distinct in a single query in the next Spark release (1.6). If you use this > feature, please reply to this email. Thanks. > > Note that if you don't care about null values, it is relatively easy to > reconstruct a query using joins to support multiple distincts. > > >