@Reynold I seem to be missing something. Aren't the functions listed here <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$> to be treated as sql operators as well? I do see that these are mentioned as Functions available for DataFrame <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html> but it would be great if you can clarify this.
On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote: > I don't think these are bugs. The SQL standard for average is "avg", not > "mean". Similarly, a distinct count is supposed to be written as > "count(distinct col)", not "countDistinct(col)". > > We can, however, make "mean" an alias for "avg" to improve compatibility > between DataFrame and SQL. > > > On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani <sshagunsodh...@gmail.com > > wrote: > >> Also are the other aggregate functions to be treated as bugs or not? >> >> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani <sshagunsodh...@gmail.com >> > wrote: >> >>> Wouldnt it be: >>> >>> + expression[Max]("avg"), >>> >>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> Since there is already Average, the simplest change is the following: >>>> >>>> $ git diff >>>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>> diff --git >>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi >>>> index 3dce6c1..920f95b 100644 >>>> --- >>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>> +++ >>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>> @@ -184,6 +184,7 @@ object FunctionRegistry { >>>> expression[Last]("last"), >>>> expression[Last]("last_value"), >>>> expression[Max]("max"), >>>> + expression[Average]("mean"), >>>> expression[Min]("min"), >>>> expression[Stddev]("stddev"), >>>> expression[StddevPop]("stddev_pop"), >>>> >>>> FYI >>>> >>>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani < >>>> sshagunsodh...@gmail.com> wrote: >>>> >>>>> I tried adding the aggregate functions in the registry and they work, >>>>> other than mean, for which Ted has forwarded some code changes. I will try >>>>> out those changes and update the status here. >>>>> >>>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani < >>>>> sshagunsodh...@gmail.com> wrote: >>>>> >>>>>> Yup avg works good. So we have alternate functions to use in place on >>>>>> the functions pointed out earlier. But my point is that are those >>>>>> original >>>>>> aggregate functions not supposed to be used or I am using them in the >>>>>> wrong >>>>>> way or is it a bug as I asked in my first mail. >>>>>> >>>>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>> >>>>>>> Have you tried using avg in place of mean ? >>>>>>> >>>>>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j, >>>>>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") } >>>>>>> sqlContext.sql(""" >>>>>>> CREATE TEMPORARY TABLE partitionedParquet >>>>>>> USING org.apache.spark.sql.parquet >>>>>>> OPTIONS ( >>>>>>> path '/tmp/partitioned' >>>>>>> )""") >>>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show() >>>>>>> >>>>>>> Cheers >>>>>>> >>>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani < >>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>> >>>>>>>> So I tried @Reynold's suggestion. I could get countDistinct and >>>>>>>> sumDistinct running but mean and approxCountDistinct do not work. >>>>>>>> (I guess I am using the wrong syntax for approxCountDistinct) For >>>>>>>> mean, I >>>>>>>> think the registry entry is missing. Can someone clarify that as well? >>>>>>>> >>>>>>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani < >>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>> >>>>>>>>> Will try in a while when I get back. I assume this applies to all >>>>>>>>> functions other than mean. Also countDistinct is defined along with >>>>>>>>> all >>>>>>>>> other SQL functions. So I don't get "distinct is not part of function >>>>>>>>> name" >>>>>>>>> part. >>>>>>>>> On 27 Oct 2015 19:58, "Reynold Xin" <r...@databricks.com> wrote: >>>>>>>>> >>>>>>>>>> Try >>>>>>>>>> >>>>>>>>>> count(distinct columnane) >>>>>>>>>> >>>>>>>>>> In SQL distinct is not part of the function name. >>>>>>>>>> >>>>>>>>>> On Tuesday, October 27, 2015, Shagun Sodhani < >>>>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>>>> >>>>>>>>>>> Oops seems I made a mistake. The error message is : Exception in >>>>>>>>>>> thread "main" org.apache.spark.sql.AnalysisException: undefined >>>>>>>>>>> function >>>>>>>>>>> countDistinct >>>>>>>>>>> On 27 Oct 2015 15:49, "Shagun Sodhani" <sshagunsodh...@gmail.com> >>>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>>> Hi! I was trying out some aggregate functions in SparkSql and >>>>>>>>>>>> I noticed that certain aggregate operators are not working. This >>>>>>>>>>>> includes: >>>>>>>>>>>> >>>>>>>>>>>> approxCountDistinct >>>>>>>>>>>> countDistinct >>>>>>>>>>>> mean >>>>>>>>>>>> sumDistinct >>>>>>>>>>>> >>>>>>>>>>>> For example using countDistinct results in an error saying >>>>>>>>>>>> *Exception in thread "main" >>>>>>>>>>>> org.apache.spark.sql.AnalysisException: undefined function cosh;* >>>>>>>>>>>> >>>>>>>>>>>> I had a similar issue with cosh operator >>>>>>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-when-using-cosh-td14724.html> >>>>>>>>>>>> as well some time back and it turned out that it was not >>>>>>>>>>>> registered in the >>>>>>>>>>>> registry: >>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> *I* *think it is the same issue again and would be glad to >>>>>>>>>>>> send over a PR if someone can confirm if this is an actual bug and >>>>>>>>>>>> not some >>>>>>>>>>>> mistake on my part.* >>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>>>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM >>>>>>>>>>>> `table` >>>>>>>>>>>> Spark Version: 10.4 >>>>>>>>>>>> SparkSql Version: 1.5.1 >>>>>>>>>>>> >>>>>>>>>>>> I am using the standard example of (name, age) schema (though I >>>>>>>>>>>> am setting age as Double and not Int as I am trying out maths >>>>>>>>>>>> functions). >>>>>>>>>>>> >>>>>>>>>>>> The entire error stack can be found here >>>>>>>>>>>> <http://pastebin.com/G6YzQXnn>. >>>>>>>>>>>> >>>>>>>>>>>> Thanks! >>>>>>>>>>>> >>>>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >