Re: Exception when using some aggregate operators

Reynold Xin Wed, 28 Oct 2015 03:52:29 -0700

No those are just functions for the DataFrame programming API.

On Wed, Oct 28, 2015 at 11:49 AM, Shagun Sodhani <sshagunsodh...@gmail.com>
wrote:


> @Reynold I seem to be missing something. Aren't the functions listed here
> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$>
>  to
> be treated as sql operators as well? I do see that these are mentioned as 
> Functions
> available for DataFrame
> <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html>
>  but
> it would be great if you can clarify this.
>
> On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote:
>
>> I don't think these are bugs. The SQL standard for average is "avg", not
>> "mean". Similarly, a distinct count is supposed to be written as
>> "count(distinct col)", not "countDistinct(col)".
>>
>> We can, however, make "mean" an alias for "avg" to improve compatibility
>> between DataFrame and SQL.
>>
>>
>> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> Also are the other aggregate functions to be treated as bugs or not?
>>>
>>> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
>>>> Wouldnt it be:
>>>>
>>>> +    expression[Max]("avg"),
>>>>
>>>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>
>>>>> Since there is already Average, the simplest change is the following:
>>>>>
>>>>> $ git diff
>>>>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>> diff --git
>>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
>>>>> index 3dce6c1..920f95b 100644
>>>>> ---
>>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>> +++
>>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>> @@ -184,6 +184,7 @@ object FunctionRegistry {
>>>>>      expression[Last]("last"),
>>>>>      expression[Last]("last_value"),
>>>>>      expression[Max]("max"),
>>>>> +    expression[Average]("mean"),
>>>>>      expression[Min]("min"),
>>>>>      expression[Stddev]("stddev"),
>>>>>      expression[StddevPop]("stddev_pop"),
>>>>>
>>>>> FYI
>>>>>
>>>>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>
>>>>>> I tried adding the aggregate functions in the registry and they work,
>>>>>> other than mean, for which Ted has forwarded some code changes. I will 
>>>>>> try
>>>>>> out those changes and update the status here.
>>>>>>
>>>>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>
>>>>>>> Yup avg works good. So we have alternate functions to use in place
>>>>>>> on the functions pointed out earlier. But my point is that are those
>>>>>>> original aggregate functions not supposed to be used or I am using them 
>>>>>>> in
>>>>>>> the wrong way or is it a bug as I asked in my first mail.
>>>>>>>
>>>>>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>>>>>
>>>>>>>> Have you tried using avg in place of mean ?
>>>>>>>>
>>>>>>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>>>>>>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>>>>>>>>     sqlContext.sql("""
>>>>>>>>     CREATE TEMPORARY TABLE partitionedParquet
>>>>>>>>     USING org.apache.spark.sql.parquet
>>>>>>>>     OPTIONS (
>>>>>>>>       path '/tmp/partitioned'
>>>>>>>>     )""")
>>>>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>>
>>>>>>>>> So I tried @Reynold's suggestion. I could get countDistinct and
>>>>>>>>> sumDistinct running but  mean and approxCountDistinct do not
>>>>>>>>> work. (I guess I am using the wrong syntax for approxCountDistinct) 
>>>>>>>>> For
>>>>>>>>> mean, I think the registry entry is missing. Can someone clarify that 
>>>>>>>>> as
>>>>>>>>> well?
>>>>>>>>>
>>>>>>>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
>>>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> Will try in a while when I get back. I assume this applies to all
>>>>>>>>>> functions other than mean. Also countDistinct is defined along with 
>>>>>>>>>> all
>>>>>>>>>> other SQL functions. So I don't get "distinct is not part of 
>>>>>>>>>> function name"
>>>>>>>>>> part.
>>>>>>>>>> On 27 Oct 2015 19:58, "Reynold Xin" <r...@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> Try
>>>>>>>>>>>
>>>>>>>>>>> count(distinct columnane)
>>>>>>>>>>>
>>>>>>>>>>> In SQL distinct is not part of the function name.
>>>>>>>>>>>
>>>>>>>>>>> On Tuesday, October 27, 2015, Shagun Sodhani <
>>>>>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Oops seems I made a mistake. The error message is : Exception
>>>>>>>>>>>> in thread "main" org.apache.spark.sql.AnalysisException: undefined 
>>>>>>>>>>>> function
>>>>>>>>>>>> countDistinct
>>>>>>>>>>>> On 27 Oct 2015 15:49, "Shagun Sodhani" <
>>>>>>>>>>>> sshagunsodh...@gmail.com> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>> Hi! I was trying out some aggregate  functions in SparkSql and
>>>>>>>>>>>>> I noticed that certain aggregate operators are not working. This 
>>>>>>>>>>>>> includes:
>>>>>>>>>>>>>
>>>>>>>>>>>>> approxCountDistinct
>>>>>>>>>>>>> countDistinct
>>>>>>>>>>>>> mean
>>>>>>>>>>>>> sumDistinct
>>>>>>>>>>>>>
>>>>>>>>>>>>> For example using countDistinct results in an error saying
>>>>>>>>>>>>> *Exception in thread "main"
>>>>>>>>>>>>> org.apache.spark.sql.AnalysisException: undefined function cosh;*
>>>>>>>>>>>>>
>>>>>>>>>>>>> I had a similar issue with cosh operator
>>>>>>>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-when-using-cosh-td14724.html>
>>>>>>>>>>>>> as well some time back and it turned out that it was not 
>>>>>>>>>>>>> registered in the
>>>>>>>>>>>>> registry:
>>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> *I* *think it is the same issue again and would be glad to
>>>>>>>>>>>>> send over a PR if someone can confirm if this is an actual bug 
>>>>>>>>>>>>> and not some
>>>>>>>>>>>>> mistake on my part.*
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM
>>>>>>>>>>>>> `table`
>>>>>>>>>>>>> Spark Version: 10.4
>>>>>>>>>>>>> SparkSql Version: 1.5.1
>>>>>>>>>>>>>
>>>>>>>>>>>>> I am using the standard example of (name, age) schema (though
>>>>>>>>>>>>> I am setting age as Double and not Int as I am trying out maths 
>>>>>>>>>>>>> functions).
>>>>>>>>>>>>>
>>>>>>>>>>>>> The entire error stack can be found here
>>>>>>>>>>>>> <http://pastebin.com/G6YzQXnn>.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Thanks!
>>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>

Re: Exception when using some aggregate operators

Reply via email to