Re: Exception when using some aggregate operators

2015-10-28 Thread Shagun Sodhani
I tried adding the aggregate functions in the registry and they work, other
than mean, for which Ted has forwarded some code changes. I will try out
those changes and update the status here.

On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani 
wrote:

> Yup avg works good. So we have alternate functions to use in place on the
> functions pointed out earlier. But my point is that are those original
> aggregate functions not supposed to be used or I am using them in the wrong
> way or is it a bug as I asked in my first mail.
>
> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:
>
>> Have you tried using avg in place of mean ?
>>
>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>> sqlContext.sql("""
>> CREATE TEMPORARY TABLE partitionedParquet
>> USING org.apache.spark.sql.parquet
>> OPTIONS (
>>   path '/tmp/partitioned'
>> )""")
>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>
>> Cheers
>>
>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> So I tried @Reynold's suggestion. I could get countDistinct and
>>> sumDistinct running but  mean and approxCountDistinct do not work. (I
>>> guess I am using the wrong syntax for approxCountDistinct) For mean, I
>>> think the registry entry is missing. Can someone clarify that as well?
>>>
>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 Will try in a while when I get back. I assume this applies to all
 functions other than mean. Also countDistinct is defined along with all
 other SQL functions. So I don't get "distinct is not part of function name"
 part.
 On 27 Oct 2015 19:58, "Reynold Xin"  wrote:

> Try
>
> count(distinct columnane)
>
> In SQL distinct is not part of the function name.
>
> On Tuesday, October 27, 2015, Shagun Sodhani 
> wrote:
>
>> Oops seems I made a mistake. The error message is : Exception in
>> thread "main" org.apache.spark.sql.AnalysisException: undefined function
>> countDistinct
>> On 27 Oct 2015 15:49, "Shagun Sodhani" 
>> wrote:
>>
>>> Hi! I was trying out some aggregate  functions in SparkSql and I
>>> noticed that certain aggregate operators are not working. This includes:
>>>
>>> approxCountDistinct
>>> countDistinct
>>> mean
>>> sumDistinct
>>>
>>> For example using countDistinct results in an error saying
>>> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
>>> undefined function cosh;*
>>>
>>> I had a similar issue with cosh operator
>>> 
>>> as well some time back and it turned out that it was not registered in 
>>> the
>>> registry:
>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>>
>>>
>>> *I* *think it is the same issue again and would be glad to send
>>> over a PR if someone can confirm if this is an actual bug and not some
>>> mistake on my part.*
>>>
>>>
>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
>>> Spark Version: 10.4
>>> SparkSql Version: 1.5.1
>>>
>>> I am using the standard example of (name, age) schema (though I am
>>> setting age as Double and not Int as I am trying out maths functions).
>>>
>>> The entire error stack can be found here
>>> .
>>>
>>> Thanks!
>>>
>>
>>>
>>
>


Re: Exception when using some aggregate operators

2015-10-28 Thread Ted Yu
Since there is already Average, the simplest change is the following:

$ git diff
sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
diff --git
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
index 3dce6c1..920f95b 100644
---
a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
+++
b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
@@ -184,6 +184,7 @@ object FunctionRegistry {
 expression[Last]("last"),
 expression[Last]("last_value"),
 expression[Max]("max"),
+expression[Average]("mean"),
 expression[Min]("min"),
 expression[Stddev]("stddev"),
 expression[StddevPop]("stddev_pop"),

FYI

On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani 
wrote:

> I tried adding the aggregate functions in the registry and they work,
> other than mean, for which Ted has forwarded some code changes. I will try
> out those changes and update the status here.
>
> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani 
> wrote:
>
>> Yup avg works good. So we have alternate functions to use in place on the
>> functions pointed out earlier. But my point is that are those original
>> aggregate functions not supposed to be used or I am using them in the wrong
>> way or is it a bug as I asked in my first mail.
>>
>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:
>>
>>> Have you tried using avg in place of mean ?
>>>
>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>>> sqlContext.sql("""
>>> CREATE TEMPORARY TABLE partitionedParquet
>>> USING org.apache.spark.sql.parquet
>>> OPTIONS (
>>>   path '/tmp/partitioned'
>>> )""")
>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>>
>>> Cheers
>>>
>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 So I tried @Reynold's suggestion. I could get countDistinct and
 sumDistinct running but  mean and approxCountDistinct do not work. (I
 guess I am using the wrong syntax for approxCountDistinct) For mean, I
 think the registry entry is missing. Can someone clarify that as well?

 On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> Will try in a while when I get back. I assume this applies to all
> functions other than mean. Also countDistinct is defined along with all
> other SQL functions. So I don't get "distinct is not part of function 
> name"
> part.
> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>
>> Try
>>
>> count(distinct columnane)
>>
>> In SQL distinct is not part of the function name.
>>
>> On Tuesday, October 27, 2015, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> Oops seems I made a mistake. The error message is : Exception in
>>> thread "main" org.apache.spark.sql.AnalysisException: undefined function
>>> countDistinct
>>> On 27 Oct 2015 15:49, "Shagun Sodhani" 
>>> wrote:
>>>
 Hi! I was trying out some aggregate  functions in SparkSql and I
 noticed that certain aggregate operators are not working. This 
 includes:

 approxCountDistinct
 countDistinct
 mean
 sumDistinct

 For example using countDistinct results in an error saying
 *Exception in thread "main" org.apache.spark.sql.AnalysisException:
 undefined function cosh;*

 I had a similar issue with cosh operator
 
 as well some time back and it turned out that it was not registered in 
 the
 registry:
 https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala


 *I* *think it is the same issue again and would be glad to send
 over a PR if someone can confirm if this is an actual bug and not some
 mistake on my part.*


 Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
 Spark Version: 10.4
 SparkSql Version: 1.5.1

 I am using the standard example of (name, age) schema (though I am
 setting age as Double and not Int as I am trying out maths functions).

 The entire error stack can be found here
 .

 Thanks!

>>>

>>>
>>
>


Re: Exception when using some aggregate operators

2015-10-28 Thread Shagun Sodhani
Wouldnt it be:

+expression[Max]("avg"),

On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu  wrote:

> Since there is already Average, the simplest change is the following:
>
> $ git diff
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
> diff --git
> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
> index 3dce6c1..920f95b 100644
> ---
> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
> +++
> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
> @@ -184,6 +184,7 @@ object FunctionRegistry {
>  expression[Last]("last"),
>  expression[Last]("last_value"),
>  expression[Max]("max"),
> +expression[Average]("mean"),
>  expression[Min]("min"),
>  expression[Stddev]("stddev"),
>  expression[StddevPop]("stddev_pop"),
>
> FYI
>
> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani 
> wrote:
>
>> I tried adding the aggregate functions in the registry and they work,
>> other than mean, for which Ted has forwarded some code changes. I will try
>> out those changes and update the status here.
>>
>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani > > wrote:
>>
>>> Yup avg works good. So we have alternate functions to use in place on
>>> the functions pointed out earlier. But my point is that are those original
>>> aggregate functions not supposed to be used or I am using them in the wrong
>>> way or is it a bug as I asked in my first mail.
>>>
>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:
>>>
 Have you tried using avg in place of mean ?

 (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
 s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
 sqlContext.sql("""
 CREATE TEMPORARY TABLE partitionedParquet
 USING org.apache.spark.sql.parquet
 OPTIONS (
   path '/tmp/partitioned'
 )""")
 sqlContext.sql("""select avg(a) from partitionedParquet""").show()

 Cheers

 On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> So I tried @Reynold's suggestion. I could get countDistinct and
> sumDistinct running but  mean and approxCountDistinct do not work. (I
> guess I am using the wrong syntax for approxCountDistinct) For mean, I
> think the registry entry is missing. Can someone clarify that as well?
>
> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
> sshagunsodh...@gmail.com> wrote:
>
>> Will try in a while when I get back. I assume this applies to all
>> functions other than mean. Also countDistinct is defined along with all
>> other SQL functions. So I don't get "distinct is not part of function 
>> name"
>> part.
>> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>>
>>> Try
>>>
>>> count(distinct columnane)
>>>
>>> In SQL distinct is not part of the function name.
>>>
>>> On Tuesday, October 27, 2015, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 Oops seems I made a mistake. The error message is : Exception in
 thread "main" org.apache.spark.sql.AnalysisException: undefined 
 function
 countDistinct
 On 27 Oct 2015 15:49, "Shagun Sodhani" 
 wrote:

> Hi! I was trying out some aggregate  functions in SparkSql and I
> noticed that certain aggregate operators are not working. This 
> includes:
>
> approxCountDistinct
> countDistinct
> mean
> sumDistinct
>
> For example using countDistinct results in an error saying
> *Exception in thread "main"
> org.apache.spark.sql.AnalysisException: undefined function cosh;*
>
> I had a similar issue with cosh operator
> 
> as well some time back and it turned out that it was not registered 
> in the
> registry:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>
>
> *I* *think it is the same issue again and would be glad to send
> over a PR if someone can confirm if this is an actual bug and not some
> mistake on my part.*
>
>
> Query I am using: SELECT countDistinct(`age`) as `data` FROM
> `table`
> Spark Version: 10.4
> SparkSql Version: 1.5.1
>
> I am using the standard example of (name, age) schema (though I am
> setting age as Double and not 

Re: Exception when using some aggregate operators

2015-10-28 Thread Shagun Sodhani
Also are the other aggregate functions to be treated as bugs or not?

On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani 
wrote:

> Wouldnt it be:
>
> +expression[Max]("avg"),
>
> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu  wrote:
>
>> Since there is already Average, the simplest change is the following:
>>
>> $ git diff
>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>> diff --git
>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
>> index 3dce6c1..920f95b 100644
>> ---
>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>> +++
>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>> @@ -184,6 +184,7 @@ object FunctionRegistry {
>>  expression[Last]("last"),
>>  expression[Last]("last_value"),
>>  expression[Max]("max"),
>> +expression[Average]("mean"),
>>  expression[Min]("min"),
>>  expression[Stddev]("stddev"),
>>  expression[StddevPop]("stddev_pop"),
>>
>> FYI
>>
>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani > > wrote:
>>
>>> I tried adding the aggregate functions in the registry and they work,
>>> other than mean, for which Ted has forwarded some code changes. I will try
>>> out those changes and update the status here.
>>>
>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 Yup avg works good. So we have alternate functions to use in place on
 the functions pointed out earlier. But my point is that are those original
 aggregate functions not supposed to be used or I am using them in the wrong
 way or is it a bug as I asked in my first mail.

 On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:

> Have you tried using avg in place of mean ?
>
> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
> sqlContext.sql("""
> CREATE TEMPORARY TABLE partitionedParquet
> USING org.apache.spark.sql.parquet
> OPTIONS (
>   path '/tmp/partitioned'
> )""")
> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>
> Cheers
>
> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
> sshagunsodh...@gmail.com> wrote:
>
>> So I tried @Reynold's suggestion. I could get countDistinct and
>> sumDistinct running but  mean and approxCountDistinct do not work.
>> (I guess I am using the wrong syntax for approxCountDistinct) For mean, I
>> think the registry entry is missing. Can someone clarify that as well?
>>
>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> Will try in a while when I get back. I assume this applies to all
>>> functions other than mean. Also countDistinct is defined along with all
>>> other SQL functions. So I don't get "distinct is not part of function 
>>> name"
>>> part.
>>> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>>>
 Try

 count(distinct columnane)

 In SQL distinct is not part of the function name.

 On Tuesday, October 27, 2015, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> Oops seems I made a mistake. The error message is : Exception in
> thread "main" org.apache.spark.sql.AnalysisException: undefined 
> function
> countDistinct
> On 27 Oct 2015 15:49, "Shagun Sodhani" 
> wrote:
>
>> Hi! I was trying out some aggregate  functions in SparkSql and I
>> noticed that certain aggregate operators are not working. This 
>> includes:
>>
>> approxCountDistinct
>> countDistinct
>> mean
>> sumDistinct
>>
>> For example using countDistinct results in an error saying
>> *Exception in thread "main"
>> org.apache.spark.sql.AnalysisException: undefined function cosh;*
>>
>> I had a similar issue with cosh operator
>> 
>> as well some time back and it turned out that it was not registered 
>> in the
>> registry:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>
>>
>> *I* *think it is the same issue again and would be glad to send
>> over a PR if someone can confirm if this is an actual bug and not 
>> some
>> mistake on my part.*
>>

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
I don't think these are bugs. The SQL standard for average is "avg", not
"mean". Similarly, a distinct count is supposed to be written as
"count(distinct col)", not "countDistinct(col)".

We can, however, make "mean" an alias for "avg" to improve compatibility
between DataFrame and SQL.


On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani 
wrote:

> Also are the other aggregate functions to be treated as bugs or not?
>
> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani 
> wrote:
>
>> Wouldnt it be:
>>
>> +expression[Max]("avg"),
>>
>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu  wrote:
>>
>>> Since there is already Average, the simplest change is the following:
>>>
>>> $ git diff
>>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>> diff --git
>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
>>> index 3dce6c1..920f95b 100644
>>> ---
>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>> +++
>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>> @@ -184,6 +184,7 @@ object FunctionRegistry {
>>>  expression[Last]("last"),
>>>  expression[Last]("last_value"),
>>>  expression[Max]("max"),
>>> +expression[Average]("mean"),
>>>  expression[Min]("min"),
>>>  expression[Stddev]("stddev"),
>>>  expression[StddevPop]("stddev_pop"),
>>>
>>> FYI
>>>
>>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 I tried adding the aggregate functions in the registry and they work,
 other than mean, for which Ted has forwarded some code changes. I will try
 out those changes and update the status here.

 On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> Yup avg works good. So we have alternate functions to use in place on
> the functions pointed out earlier. But my point is that are those original
> aggregate functions not supposed to be used or I am using them in the 
> wrong
> way or is it a bug as I asked in my first mail.
>
> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:
>
>> Have you tried using avg in place of mean ?
>>
>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>> sqlContext.sql("""
>> CREATE TEMPORARY TABLE partitionedParquet
>> USING org.apache.spark.sql.parquet
>> OPTIONS (
>>   path '/tmp/partitioned'
>> )""")
>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>
>> Cheers
>>
>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> So I tried @Reynold's suggestion. I could get countDistinct and
>>> sumDistinct running but  mean and approxCountDistinct do not work.
>>> (I guess I am using the wrong syntax for approxCountDistinct) For mean, 
>>> I
>>> think the registry entry is missing. Can someone clarify that as well?
>>>
>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 Will try in a while when I get back. I assume this applies to all
 functions other than mean. Also countDistinct is defined along with all
 other SQL functions. So I don't get "distinct is not part of function 
 name"
 part.
 On 27 Oct 2015 19:58, "Reynold Xin"  wrote:

> Try
>
> count(distinct columnane)
>
> In SQL distinct is not part of the function name.
>
> On Tuesday, October 27, 2015, Shagun Sodhani <
> sshagunsodh...@gmail.com> wrote:
>
>> Oops seems I made a mistake. The error message is : Exception in
>> thread "main" org.apache.spark.sql.AnalysisException: undefined 
>> function
>> countDistinct
>> On 27 Oct 2015 15:49, "Shagun Sodhani" 
>> wrote:
>>
>>> Hi! I was trying out some aggregate  functions in SparkSql and I
>>> noticed that certain aggregate operators are not working. This 
>>> includes:
>>>
>>> approxCountDistinct
>>> countDistinct
>>> mean
>>> sumDistinct
>>>
>>> For example using countDistinct results in an error saying
>>> *Exception in thread "main"
>>> org.apache.spark.sql.AnalysisException: undefined function cosh;*
>>>
>>> I had a similar issue with cosh operator
>>> 

Re: Exception when using some aggregate operators

2015-10-28 Thread Ted Yu
Created SPARK-11371 with a patch.

Will create PR soon.

On Wed, Oct 28, 2015 at 3:42 AM, Reynold Xin  wrote:

> I don't think these are bugs. The SQL standard for average is "avg", not
> "mean". Similarly, a distinct count is supposed to be written as
> "count(distinct col)", not "countDistinct(col)".
>
> We can, however, make "mean" an alias for "avg" to improve compatibility
> between DataFrame and SQL.
>
>
> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani  > wrote:
>
>> Also are the other aggregate functions to be treated as bugs or not?
>>
>> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani > > wrote:
>>
>>> Wouldnt it be:
>>>
>>> +expression[Max]("avg"),
>>>
>>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu  wrote:
>>>
 Since there is already Average, the simplest change is the following:

 $ git diff
 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 diff --git
 a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
 index 3dce6c1..920f95b 100644
 ---
 a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 +++
 b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 @@ -184,6 +184,7 @@ object FunctionRegistry {
  expression[Last]("last"),
  expression[Last]("last_value"),
  expression[Max]("max"),
 +expression[Average]("mean"),
  expression[Min]("min"),
  expression[Stddev]("stddev"),
  expression[StddevPop]("stddev_pop"),

 FYI

 On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> I tried adding the aggregate functions in the registry and they work,
> other than mean, for which Ted has forwarded some code changes. I will try
> out those changes and update the status here.
>
> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
> sshagunsodh...@gmail.com> wrote:
>
>> Yup avg works good. So we have alternate functions to use in place on
>> the functions pointed out earlier. But my point is that are those 
>> original
>> aggregate functions not supposed to be used or I am using them in the 
>> wrong
>> way or is it a bug as I asked in my first mail.
>>
>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:
>>
>>> Have you tried using avg in place of mean ?
>>>
>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>>> sqlContext.sql("""
>>> CREATE TEMPORARY TABLE partitionedParquet
>>> USING org.apache.spark.sql.parquet
>>> OPTIONS (
>>>   path '/tmp/partitioned'
>>> )""")
>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>>
>>> Cheers
>>>
>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 So I tried @Reynold's suggestion. I could get countDistinct and
 sumDistinct running but  mean and approxCountDistinct do not work.
 (I guess I am using the wrong syntax for approxCountDistinct) For 
 mean, I
 think the registry entry is missing. Can someone clarify that as well?

 On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> Will try in a while when I get back. I assume this applies to all
> functions other than mean. Also countDistinct is defined along with 
> all
> other SQL functions. So I don't get "distinct is not part of function 
> name"
> part.
> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>
>> Try
>>
>> count(distinct columnane)
>>
>> In SQL distinct is not part of the function name.
>>
>> On Tuesday, October 27, 2015, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> Oops seems I made a mistake. The error message is : Exception in
>>> thread "main" org.apache.spark.sql.AnalysisException: undefined 
>>> function
>>> countDistinct
>>> On 27 Oct 2015 15:49, "Shagun Sodhani" 
>>> wrote:
>>>
 Hi! I was trying out some aggregate  functions in SparkSql and
 I noticed that certain aggregate operators are not working. This 
 includes:

 approxCountDistinct
 countDistinct
 mean
 sumDistinct

 For example using countDistinct results in an error saying

Re: Exception when using some aggregate operators

2015-10-28 Thread Shagun Sodhani
@Reynold I seem to be missing something. Aren't the functions listed here

to
be treated as sql operators as well? I do see that these are mentioned
as Functions
available for DataFrame

but
it would be great if you can clarify this.

On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin  wrote:

> I don't think these are bugs. The SQL standard for average is "avg", not
> "mean". Similarly, a distinct count is supposed to be written as
> "count(distinct col)", not "countDistinct(col)".
>
> We can, however, make "mean" an alias for "avg" to improve compatibility
> between DataFrame and SQL.
>
>
> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani  > wrote:
>
>> Also are the other aggregate functions to be treated as bugs or not?
>>
>> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani > > wrote:
>>
>>> Wouldnt it be:
>>>
>>> +expression[Max]("avg"),
>>>
>>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu  wrote:
>>>
 Since there is already Average, the simplest change is the following:

 $ git diff
 sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 diff --git
 a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
 index 3dce6c1..920f95b 100644
 ---
 a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 +++
 b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
 @@ -184,6 +184,7 @@ object FunctionRegistry {
  expression[Last]("last"),
  expression[Last]("last_value"),
  expression[Max]("max"),
 +expression[Average]("mean"),
  expression[Min]("min"),
  expression[Stddev]("stddev"),
  expression[StddevPop]("stddev_pop"),

 FYI

 On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> I tried adding the aggregate functions in the registry and they work,
> other than mean, for which Ted has forwarded some code changes. I will try
> out those changes and update the status here.
>
> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
> sshagunsodh...@gmail.com> wrote:
>
>> Yup avg works good. So we have alternate functions to use in place on
>> the functions pointed out earlier. But my point is that are those 
>> original
>> aggregate functions not supposed to be used or I am using them in the 
>> wrong
>> way or is it a bug as I asked in my first mail.
>>
>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:
>>
>>> Have you tried using avg in place of mean ?
>>>
>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
>>> sqlContext.sql("""
>>> CREATE TEMPORARY TABLE partitionedParquet
>>> USING org.apache.spark.sql.parquet
>>> OPTIONS (
>>>   path '/tmp/partitioned'
>>> )""")
>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>>>
>>> Cheers
>>>
>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 So I tried @Reynold's suggestion. I could get countDistinct and
 sumDistinct running but  mean and approxCountDistinct do not work.
 (I guess I am using the wrong syntax for approxCountDistinct) For 
 mean, I
 think the registry entry is missing. Can someone clarify that as well?

 On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> Will try in a while when I get back. I assume this applies to all
> functions other than mean. Also countDistinct is defined along with 
> all
> other SQL functions. So I don't get "distinct is not part of function 
> name"
> part.
> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>
>> Try
>>
>> count(distinct columnane)
>>
>> In SQL distinct is not part of the function name.
>>
>> On Tuesday, October 27, 2015, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> Oops seems I made a mistake. The error message is : Exception in
>>> thread "main" org.apache.spark.sql.AnalysisException: undefined 
>>> function
>>> countDistinct
>>> On 27 Oct 2015 15:49, "Shagun Sodhani" 
>>> wrote:
>>>
 Hi! I was trying out 

Re: Exception when using some aggregate operators

2015-10-28 Thread Reynold Xin
No those are just functions for the DataFrame programming API.

On Wed, Oct 28, 2015 at 11:49 AM, Shagun Sodhani 
wrote:

> @Reynold I seem to be missing something. Aren't the functions listed here
> 
>  to
> be treated as sql operators as well? I do see that these are mentioned as 
> Functions
> available for DataFrame
> 
>  but
> it would be great if you can clarify this.
>
> On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin  wrote:
>
>> I don't think these are bugs. The SQL standard for average is "avg", not
>> "mean". Similarly, a distinct count is supposed to be written as
>> "count(distinct col)", not "countDistinct(col)".
>>
>> We can, however, make "mean" an alias for "avg" to improve compatibility
>> between DataFrame and SQL.
>>
>>
>> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> Also are the other aggregate functions to be treated as bugs or not?
>>>
>>> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 Wouldnt it be:

 +expression[Max]("avg"),

 On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu  wrote:

> Since there is already Average, the simplest change is the following:
>
> $ git diff
> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
> diff --git
> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
> index 3dce6c1..920f95b 100644
> ---
> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
> +++
> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
> @@ -184,6 +184,7 @@ object FunctionRegistry {
>  expression[Last]("last"),
>  expression[Last]("last_value"),
>  expression[Max]("max"),
> +expression[Average]("mean"),
>  expression[Min]("min"),
>  expression[Stddev]("stddev"),
>  expression[StddevPop]("stddev_pop"),
>
> FYI
>
> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
> sshagunsodh...@gmail.com> wrote:
>
>> I tried adding the aggregate functions in the registry and they work,
>> other than mean, for which Ted has forwarded some code changes. I will 
>> try
>> out those changes and update the status here.
>>
>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> Yup avg works good. So we have alternate functions to use in place
>>> on the functions pointed out earlier. But my point is that are those
>>> original aggregate functions not supposed to be used or I am using them 
>>> in
>>> the wrong way or is it a bug as I asked in my first mail.
>>>
>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:
>>>
 Have you tried using avg in place of mean ?

 (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
 s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
 sqlContext.sql("""
 CREATE TEMPORARY TABLE partitionedParquet
 USING org.apache.spark.sql.parquet
 OPTIONS (
   path '/tmp/partitioned'
 )""")
 sqlContext.sql("""select avg(a) from partitionedParquet""").show()

 Cheers

 On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> So I tried @Reynold's suggestion. I could get countDistinct and
> sumDistinct running but  mean and approxCountDistinct do not
> work. (I guess I am using the wrong syntax for approxCountDistinct) 
> For
> mean, I think the registry entry is missing. Can someone clarify that 
> as
> well?
>
> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
> sshagunsodh...@gmail.com> wrote:
>
>> Will try in a while when I get back. I assume this applies to all
>> functions other than mean. Also countDistinct is defined along with 
>> all
>> other SQL functions. So I don't get "distinct is not part of 
>> function name"
>> part.
>> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>>
>>> Try
>>>
>>> count(distinct columnane)
>>>
>>> In SQL distinct is not part of the function name.
>>>
>>> On Tuesday, October 27, 2015, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 Oops seems I made a mistake. 

Re: Exception when using some aggregate operators

2015-10-28 Thread Shagun Sodhani
Ohh great! Thanks for the clarification.

On Wed, Oct 28, 2015 at 4:21 PM, Reynold Xin  wrote:

> No those are just functions for the DataFrame programming API.
>
> On Wed, Oct 28, 2015 at 11:49 AM, Shagun Sodhani  > wrote:
>
>> @Reynold I seem to be missing something. Aren't the functions listed here
>> 
>>  to
>> be treated as sql operators as well? I do see that these are mentioned as 
>> Functions
>> available for DataFrame
>> 
>>  but
>> it would be great if you can clarify this.
>>
>> On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin  wrote:
>>
>>> I don't think these are bugs. The SQL standard for average is "avg", not
>>> "mean". Similarly, a distinct count is supposed to be written as
>>> "count(distinct col)", not "countDistinct(col)".
>>>
>>> We can, however, make "mean" an alias for "avg" to improve compatibility
>>> between DataFrame and SQL.
>>>
>>>
>>> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 Also are the other aggregate functions to be treated as bugs or not?

 On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani <
 sshagunsodh...@gmail.com> wrote:

> Wouldnt it be:
>
> +expression[Max]("avg"),
>
> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu  wrote:
>
>> Since there is already Average, the simplest change is the following:
>>
>> $ git diff
>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>> diff --git
>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi
>> index 3dce6c1..920f95b 100644
>> ---
>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>> +++
>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>> @@ -184,6 +184,7 @@ object FunctionRegistry {
>>  expression[Last]("last"),
>>  expression[Last]("last_value"),
>>  expression[Max]("max"),
>> +expression[Average]("mean"),
>>  expression[Min]("min"),
>>  expression[Stddev]("stddev"),
>>  expression[StddevPop]("stddev_pop"),
>>
>> FYI
>>
>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> I tried adding the aggregate functions in the registry and they
>>> work, other than mean, for which Ted has forwarded some code changes. I
>>> will try out those changes and update the status here.
>>>
>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani <
>>> sshagunsodh...@gmail.com> wrote:
>>>
 Yup avg works good. So we have alternate functions to use in place
 on the functions pointed out earlier. But my point is that are those
 original aggregate functions not supposed to be used or I am using 
 them in
 the wrong way or is it a bug as I asked in my first mail.

 On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu 
 wrote:

> Have you tried using avg in place of mean ?
>
> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
> sqlContext.sql("""
> CREATE TEMPORARY TABLE partitionedParquet
> USING org.apache.spark.sql.parquet
> OPTIONS (
>   path '/tmp/partitioned'
> )""")
> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>
> Cheers
>
> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani <
> sshagunsodh...@gmail.com> wrote:
>
>> So I tried @Reynold's suggestion. I could get countDistinct and
>> sumDistinct running but  mean and approxCountDistinct do not
>> work. (I guess I am using the wrong syntax for approxCountDistinct) 
>> For
>> mean, I think the registry entry is missing. Can someone clarify 
>> that as
>> well?
>>
>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani <
>> sshagunsodh...@gmail.com> wrote:
>>
>>> Will try in a while when I get back. I assume this applies to
>>> all functions other than mean. Also countDistinct is defined along 
>>> with all
>>> other SQL functions. So I don't get "distinct is not part of 
>>> function name"
>>> part.
>>> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>>>
 Try

 count(distinct columnane)

Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Yup avg works good. So we have alternate functions to use in place on the
functions pointed out earlier. But my point is that are those original
aggregate functions not supposed to be used or I am using them in the wrong
way or is it a bug as I asked in my first mail.

On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu  wrote:

> Have you tried using avg in place of mean ?
>
> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j,
> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") }
> sqlContext.sql("""
> CREATE TEMPORARY TABLE partitionedParquet
> USING org.apache.spark.sql.parquet
> OPTIONS (
>   path '/tmp/partitioned'
> )""")
> sqlContext.sql("""select avg(a) from partitionedParquet""").show()
>
> Cheers
>
> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani  > wrote:
>
>> So I tried @Reynold's suggestion. I could get countDistinct and
>> sumDistinct running but  mean and approxCountDistinct do not work. (I
>> guess I am using the wrong syntax for approxCountDistinct) For mean, I
>> think the registry entry is missing. Can someone clarify that as well?
>>
>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani > > wrote:
>>
>>> Will try in a while when I get back. I assume this applies to all
>>> functions other than mean. Also countDistinct is defined along with all
>>> other SQL functions. So I don't get "distinct is not part of function name"
>>> part.
>>> On 27 Oct 2015 19:58, "Reynold Xin"  wrote:
>>>
 Try

 count(distinct columnane)

 In SQL distinct is not part of the function name.

 On Tuesday, October 27, 2015, Shagun Sodhani 
 wrote:

> Oops seems I made a mistake. The error message is : Exception in
> thread "main" org.apache.spark.sql.AnalysisException: undefined function
> countDistinct
> On 27 Oct 2015 15:49, "Shagun Sodhani" 
> wrote:
>
>> Hi! I was trying out some aggregate  functions in SparkSql and I
>> noticed that certain aggregate operators are not working. This includes:
>>
>> approxCountDistinct
>> countDistinct
>> mean
>> sumDistinct
>>
>> For example using countDistinct results in an error saying
>> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> undefined function cosh;*
>>
>> I had a similar issue with cosh operator
>> 
>> as well some time back and it turned out that it was not registered in 
>> the
>> registry:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>
>>
>> *I* *think it is the same issue again and would be glad to send over
>> a PR if someone can confirm if this is an actual bug and not some mistake
>> on my part.*
>>
>>
>> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
>> Spark Version: 10.4
>> SparkSql Version: 1.5.1
>>
>> I am using the standard example of (name, age) schema (though I am
>> setting age as Double and not Int as I am trying out maths functions).
>>
>> The entire error stack can be found here
>> .
>>
>> Thanks!
>>
>
>>
>


Re: Exception when using some aggregate operators

2015-10-27 Thread Shagun Sodhani
Oops seems I made a mistake. The error message is : Exception in thread
"main" org.apache.spark.sql.AnalysisException: undefined function
countDistinct
On 27 Oct 2015 15:49, "Shagun Sodhani"  wrote:

> Hi! I was trying out some aggregate  functions in SparkSql and I noticed
> that certain aggregate operators are not working. This includes:
>
> approxCountDistinct
> countDistinct
> mean
> sumDistinct
>
> For example using countDistinct results in an error saying
> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
> undefined function cosh;*
>
> I had a similar issue with cosh operator
> 
> as well some time back and it turned out that it was not registered in the
> registry:
> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>
>
> *I* *think it is the same issue again and would be glad to send over a PR
> if someone can confirm if this is an actual bug and not some mistake on my
> part.*
>
>
> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
> Spark Version: 10.4
> SparkSql Version: 1.5.1
>
> I am using the standard example of (name, age) schema (though I am setting
> age as Double and not Int as I am trying out maths functions).
>
> The entire error stack can be found here .
>
> Thanks!
>


Re: Exception when using some aggregate operators

2015-10-27 Thread Reynold Xin
Try

count(distinct columnane)

In SQL distinct is not part of the function name.

On Tuesday, October 27, 2015, Shagun Sodhani 
wrote:

> Oops seems I made a mistake. The error message is : Exception in thread
> "main" org.apache.spark.sql.AnalysisException: undefined function
> countDistinct
> On 27 Oct 2015 15:49, "Shagun Sodhani"  > wrote:
>
>> Hi! I was trying out some aggregate  functions in SparkSql and I noticed
>> that certain aggregate operators are not working. This includes:
>>
>> approxCountDistinct
>> countDistinct
>> mean
>> sumDistinct
>>
>> For example using countDistinct results in an error saying
>> *Exception in thread "main" org.apache.spark.sql.AnalysisException:
>> undefined function cosh;*
>>
>> I had a similar issue with cosh operator
>> 
>> as well some time back and it turned out that it was not registered in the
>> registry:
>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala
>>
>>
>> *I* *think it is the same issue again and would be glad to send over a
>> PR if someone can confirm if this is an actual bug and not some mistake on
>> my part.*
>>
>>
>> Query I am using: SELECT countDistinct(`age`) as `data` FROM `table`
>> Spark Version: 10.4
>> SparkSql Version: 1.5.1
>>
>> I am using the standard example of (name, age) schema (though I am
>> setting age as Double and not Int as I am trying out maths functions).
>>
>> The entire error stack can be found here .
>>
>> Thanks!
>>
>