Re: Exception when using some aggregate operators

Shagun Sodhani Wed, 28 Oct 2015 03:50:29 -0700

@Reynold I seem to be missing something. Aren't the functions listed here
<http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$>
to
be treated as sql operators as well? I do see that these are mentioned
as Functions
available for DataFrame
<http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html>
but
it would be great if you can clarify this.


On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote:

> I don't think these are bugs. The SQL standard for average is "avg", not
> "mean". Similarly, a distinct count is supposed to be written as
> "count(distinct col)", not "countDistinct(col)".
>
> We can, however, make "mean" an alias for "avg" to improve compatibility
> between DataFrame and SQL.
>
>
> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani <sshagunsodh...@gmail.com
> > wrote:
>
>> Also are the other aggregate functions to be treated as bugs or not?
>>
>> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani <sshagunsodh...@gmail.com
>> > wrote:
>>
>>> Wouldnt it be:
>>>
>>> +    expression[Max]("avg"),
>>>
>>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Since there is already Average, the simplest change is the following:
>>>>
>>>> $ git diff
>>>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>> diff --git
>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
>>>> index 3dce6c1..920f95b 100644
>>>> ---
>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>> +++
>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>> @@ -184,6 +184,7 @@ object FunctionRegistry {
>>>>      expression[Last]("last"),
>>>>      expression[Last]("last_value"),
>>>>      expression[Max]("max"),
>>>> +    expression[Average]("mean"),
>>>>      expression[Min]("min"),
>>>>      expression[Stddev]("stddev"),
>>>>      expression[StddevPop]("stddev_pop"),
>>>>
>>>> FYI
>>>>
>>>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
>>>> sshagunsodh...@gmail.com> wrote:
>>>>
>>>>> I tried adding the aggregate functions in the registry and they work,
>>>>> other than mean, for which Ted has forwarded some code changes. I will try
>>>>> out those changes and update the status here.
>>>>>
>>>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>
>>>>>> Yup avg works good. So we have alternate functions to use in place on
>>>>>> the functions pointed out earlier. But my point is that are those 
>>>>>> original
>>>>>> aggregate functions not supposed to be used or I am using them in the 
>>>>>> wrong
>>>>>> way or is it a bug as I asked in my first mail.
>>>>>>
>>>>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>
>>>>>>> Have you tried using avg in place of mean ?
>>>>>>>
>>>>>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>>>>>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>>>>>>>     sqlContext.sql("""
>>>>>>>     CREATE TEMPORARY TABLE partitionedParquet
>>>>>>>     USING org.apache.spark.sql.parquet
>>>>>>>     OPTIONS (
>>>>>>>       path '/tmp/partitioned'
>>>>>>>     )""")
>>>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>
>>>>>>>> So I tried @Reynold's suggestion. I could get countDistinct and
>>>>>>>> sumDistinct running but  mean and approxCountDistinct do not work.
>>>>>>>> (I guess I am using the wrong syntax for approxCountDistinct) For 
>>>>>>>> mean, I
>>>>>>>> think the registry entry is missing. Can someone clarify that as well?
>>>>>>>>
>>>>>>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
>>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> Will try in a while when I get back. I assume this applies to all
>>>>>>>>> functions other than mean. Also countDistinct is defined along with 
>>>>>>>>> all
>>>>>>>>> other SQL functions. So I don't get "distinct is not part of function 
>>>>>>>>> name"
>>>>>>>>> part.
>>>>>>>>> On 27 Oct 2015 19:58, "Reynold Xin" <r...@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> Try
>>>>>>>>>>
>>>>>>>>>> count(distinct columnane)
>>>>>>>>>>
>>>>>>>>>> In SQL distinct is not part of the function name.
>>>>>>>>>>
>>>>>>>>>> On Tuesday, October 27, 2015, Shagun Sodhani <
>>>>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Oops seems I made a mistake. The error message is : Exception in
>>>>>>>>>>> thread "main" org.apache.spark.sql.AnalysisException: undefined 
>>>>>>>>>>> function
>>>>>>>>>>> countDistinct
>>>>>>>>>>> On 27 Oct 2015 15:49, "Shagun Sodhani" <sshagunsodh...@gmail.com>
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Hi! I was trying out some aggregate  functions in SparkSql and
>>>>>>>>>>>> I noticed that certain aggregate operators are not working. This 
>>>>>>>>>>>> includes:
>>>>>>>>>>>>
>>>>>>>>>>>> approxCountDistinct
>>>>>>>>>>>> countDistinct
>>>>>>>>>>>> mean
>>>>>>>>>>>> sumDistinct
>>>>>>>>>>>>
>>>>>>>>>>>> For example using countDistinct results in an error saying
>>>>>>>>>>>> *Exception in thread "main"
>>>>>>>>>>>> org.apache.spark.sql.AnalysisException: undefined function cosh;*
>>>>>>>>>>>>
>>>>>>>>>>>> I had a similar issue with cosh operator
>>>>>>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-when-using-cosh-td14724.html>
>>>>>>>>>>>> as well some time back and it turned out that it was not 
>>>>>>>>>>>> registered in the
>>>>>>>>>>>> registry:
>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *I* *think it is the same issue again and would be glad to
>>>>>>>>>>>> send over a PR if someone can confirm if this is an actual bug and 
>>>>>>>>>>>> not some
>>>>>>>>>>>> mistake on my part.*
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM
>>>>>>>>>>>> `table`
>>>>>>>>>>>> Spark Version: 10.4
>>>>>>>>>>>> SparkSql Version: 1.5.1
>>>>>>>>>>>>
>>>>>>>>>>>> I am using the standard example of (name, age) schema (though I
>>>>>>>>>>>> am setting age as Double and not Int as I am trying out maths 
>>>>>>>>>>>> functions).
>>>>>>>>>>>>
>>>>>>>>>>>> The entire error stack can be found here
>>>>>>>>>>>> <http://pastebin.com/G6YzQXnn>.
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Exception when using some aggregate operators

Reply via email to