I don't think these are bugs. The SQL standard for average is "avg", not
"mean". Similarly, a distinct count is supposed to be written as
"count(distinct col)", not "countDistinct(col)".

We can, however, make "mean" an alias for "avg" to improve compatibility
between DataFrame and SQL.


On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani <sshagunsodh...@gmail.com>
wrote:

> Also are the other aggregate functions to be treated as bugs or not?
>
> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani <sshagunsodh...@gmail.com>
> wrote:
>
>> Wouldnt it be:
>>
>> +    expression[Max]("avg"),
>>
>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>
>>> Since there is already Average, the simplest change is the following:
>>>
>>> $ git diff
>>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>> diff --git
>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
>>> index 3dce6c1..920f95b 100644
>>> ---
>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>> +++
>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>> @@ -184,6 +184,7 @@ object FunctionRegistry {
>>>      expression[Last]("last"),
>>>      expression[Last]("last_value"),
>>>      expression[Max]("max"),
>>> +    expression[Average]("mean"),
>>>      expression[Min]("min"),
>>>      expression[Stddev]("stddev"),
>>>      expression[StddevPop]("stddev_pop"),
>>>
>>> FYI
>>>
>>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
>>>> I tried adding the aggregate functions in the registry and they work,
>>>> other than mean, for which Ted has forwarded some code changes. I will try
>>>> out those changes and update the status here.
>>>>
>>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
>>>> sshagunsodh...@gmail.com> wrote:
>>>>
>>>>> Yup avg works good. So we have alternate functions to use in place on
>>>>> the functions pointed out earlier. But my point is that are those original
>>>>> aggregate functions not supposed to be used or I am using them in the 
>>>>> wrong
>>>>> way or is it a bug as I asked in my first mail.
>>>>>
>>>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>
>>>>>> Have you tried using avg in place of mean ?
>>>>>>
>>>>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>>>>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>>>>>>     sqlContext.sql("""
>>>>>>     CREATE TEMPORARY TABLE partitionedParquet
>>>>>>     USING org.apache.spark.sql.parquet
>>>>>>     OPTIONS (
>>>>>>       path '/tmp/partitioned'
>>>>>>     )""")
>>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>
>>>>>>> So I tried @Reynold's suggestion. I could get countDistinct and
>>>>>>> sumDistinct running but  mean and approxCountDistinct do not work.
>>>>>>> (I guess I am using the wrong syntax for approxCountDistinct) For mean, 
>>>>>>> I
>>>>>>> think the registry entry is missing. Can someone clarify that as well?
>>>>>>>
>>>>>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Will try in a while when I get back. I assume this applies to all
>>>>>>>> functions other than mean. Also countDistinct is defined along with all
>>>>>>>> other SQL functions. So I don't get "distinct is not part of function 
>>>>>>>> name"
>>>>>>>> part.
>>>>>>>> On 27 Oct 2015 19:58, "Reynold Xin" <r...@databricks.com> wrote:
>>>>>>>>
>>>>>>>>> Try
>>>>>>>>>
>>>>>>>>> count(distinct columnane)
>>>>>>>>>
>>>>>>>>> In SQL distinct is not part of the function name.
>>>>>>>>>
>>>>>>>>> On Tuesday, October 27, 2015, Shagun Sodhani <
>>>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Oops seems I made a mistake. The error message is : Exception in
>>>>>>>>>> thread "main" org.apache.spark.sql.AnalysisException: undefined 
>>>>>>>>>> function
>>>>>>>>>> countDistinct
>>>>>>>>>> On 27 Oct 2015 15:49, "Shagun Sodhani" <sshagunsodh...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Hi! I was trying out some aggregate  functions in SparkSql and I
>>>>>>>>>>> noticed that certain aggregate operators are not working. This 
>>>>>>>>>>> includes:
>>>>>>>>>>>
>>>>>>>>>>> approxCountDistinct
>>>>>>>>>>> countDistinct
>>>>>>>>>>> mean
>>>>>>>>>>> sumDistinct
>>>>>>>>>>>
>>>>>>>>>>> For example using countDistinct results in an error saying
>>>>>>>>>>> *Exception in thread "main"
>>>>>>>>>>> org.apache.spark.sql.AnalysisException: undefined function cosh;*
>>>>>>>>>>>
>>>>>>>>>>> I had a similar issue with cosh operator
>>>>>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-when-using-cosh-td14724.html>
>>>>>>>>>>> as well some time back and it turned out that it was not registered 
>>>>>>>>>>> in the
>>>>>>>>>>> registry:
>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *I* *think it is the same issue again and would be glad to send
>>>>>>>>>>> over a PR if someone can confirm if this is an actual bug and not 
>>>>>>>>>>> some
>>>>>>>>>>> mistake on my part.*
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM
>>>>>>>>>>> `table`
>>>>>>>>>>> Spark Version: 10.4
>>>>>>>>>>> SparkSql Version: 1.5.1
>>>>>>>>>>>
>>>>>>>>>>> I am using the standard example of (name, age) schema (though I
>>>>>>>>>>> am setting age as Double and not Int as I am trying out maths 
>>>>>>>>>>> functions).
>>>>>>>>>>>
>>>>>>>>>>> The entire error stack can be found here
>>>>>>>>>>> <http://pastebin.com/G6YzQXnn>.
>>>>>>>>>>>
>>>>>>>>>>> Thanks!
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Reply via email to