No those are just functions for the DataFrame programming API. On Wed, Oct 28, 2015 at 11:49 AM, Shagun Sodhani <sshagunsodh...@gmail.com> wrote:
> @Reynold I seem to be missing something. Aren't the functions listed here > <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$> > to > be treated as sql operators as well? I do see that these are mentioned as > Functions > available for DataFrame > <http://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/DataFrame.html> > but > it would be great if you can clarify this. > > On Wed, Oct 28, 2015 at 4:12 PM, Reynold Xin <r...@databricks.com> wrote: > >> I don't think these are bugs. The SQL standard for average is "avg", not >> "mean". Similarly, a distinct count is supposed to be written as >> "count(distinct col)", not "countDistinct(col)". >> >> We can, however, make "mean" an alias for "avg" to improve compatibility >> between DataFrame and SQL. >> >> >> On Wed, Oct 28, 2015 at 11:38 AM, Shagun Sodhani < >> sshagunsodh...@gmail.com> wrote: >> >>> Also are the other aggregate functions to be treated as bugs or not? >>> >>> On Wed, Oct 28, 2015 at 4:08 PM, Shagun Sodhani < >>> sshagunsodh...@gmail.com> wrote: >>> >>>> Wouldnt it be: >>>> >>>> + expression[Max]("avg"), >>>> >>>> On Wed, Oct 28, 2015 at 4:06 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>>> >>>>> Since there is already Average, the simplest change is the following: >>>>> >>>>> $ git diff >>>>> sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>> diff --git >>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/Functi >>>>> index 3dce6c1..920f95b 100644 >>>>> --- >>>>> a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>> +++ >>>>> b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>> @@ -184,6 +184,7 @@ object FunctionRegistry { >>>>> expression[Last]("last"), >>>>> expression[Last]("last_value"), >>>>> expression[Max]("max"), >>>>> + expression[Average]("mean"), >>>>> expression[Min]("min"), >>>>> expression[Stddev]("stddev"), >>>>> expression[StddevPop]("stddev_pop"), >>>>> >>>>> FYI >>>>> >>>>> On Wed, Oct 28, 2015 at 2:07 AM, Shagun Sodhani < >>>>> sshagunsodh...@gmail.com> wrote: >>>>> >>>>>> I tried adding the aggregate functions in the registry and they work, >>>>>> other than mean, for which Ted has forwarded some code changes. I will >>>>>> try >>>>>> out those changes and update the status here. >>>>>> >>>>>> On Wed, Oct 28, 2015 at 9:03 AM, Shagun Sodhani < >>>>>> sshagunsodh...@gmail.com> wrote: >>>>>> >>>>>>> Yup avg works good. So we have alternate functions to use in place >>>>>>> on the functions pointed out earlier. But my point is that are those >>>>>>> original aggregate functions not supposed to be used or I am using them >>>>>>> in >>>>>>> the wrong way or is it a bug as I asked in my first mail. >>>>>>> >>>>>>> On Wed, Oct 28, 2015 at 3:20 AM, Ted Yu <yuzhih...@gmail.com> wrote: >>>>>>> >>>>>>>> Have you tried using avg in place of mean ? >>>>>>>> >>>>>>>> (1 to 5).foreach { i => val df = (1 to 1000).map(j => (j, >>>>>>>> s"str$j")).toDF("a", "b").save(s"/tmp/partitioned/i=$i") } >>>>>>>> sqlContext.sql(""" >>>>>>>> CREATE TEMPORARY TABLE partitionedParquet >>>>>>>> USING org.apache.spark.sql.parquet >>>>>>>> OPTIONS ( >>>>>>>> path '/tmp/partitioned' >>>>>>>> )""") >>>>>>>> sqlContext.sql("""select avg(a) from partitionedParquet""").show() >>>>>>>> >>>>>>>> Cheers >>>>>>>> >>>>>>>> On Tue, Oct 27, 2015 at 10:12 AM, Shagun Sodhani < >>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>> >>>>>>>>> So I tried @Reynold's suggestion. I could get countDistinct and >>>>>>>>> sumDistinct running but mean and approxCountDistinct do not >>>>>>>>> work. (I guess I am using the wrong syntax for approxCountDistinct) >>>>>>>>> For >>>>>>>>> mean, I think the registry entry is missing. Can someone clarify that >>>>>>>>> as >>>>>>>>> well? >>>>>>>>> >>>>>>>>> On Tue, Oct 27, 2015 at 8:02 PM, Shagun Sodhani < >>>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>>> >>>>>>>>>> Will try in a while when I get back. I assume this applies to all >>>>>>>>>> functions other than mean. Also countDistinct is defined along with >>>>>>>>>> all >>>>>>>>>> other SQL functions. So I don't get "distinct is not part of >>>>>>>>>> function name" >>>>>>>>>> part. >>>>>>>>>> On 27 Oct 2015 19:58, "Reynold Xin" <r...@databricks.com> wrote: >>>>>>>>>> >>>>>>>>>>> Try >>>>>>>>>>> >>>>>>>>>>> count(distinct columnane) >>>>>>>>>>> >>>>>>>>>>> In SQL distinct is not part of the function name. >>>>>>>>>>> >>>>>>>>>>> On Tuesday, October 27, 2015, Shagun Sodhani < >>>>>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>>>>> >>>>>>>>>>>> Oops seems I made a mistake. The error message is : Exception >>>>>>>>>>>> in thread "main" org.apache.spark.sql.AnalysisException: undefined >>>>>>>>>>>> function >>>>>>>>>>>> countDistinct >>>>>>>>>>>> On 27 Oct 2015 15:49, "Shagun Sodhani" < >>>>>>>>>>>> sshagunsodh...@gmail.com> wrote: >>>>>>>>>>>> >>>>>>>>>>>>> Hi! I was trying out some aggregate functions in SparkSql and >>>>>>>>>>>>> I noticed that certain aggregate operators are not working. This >>>>>>>>>>>>> includes: >>>>>>>>>>>>> >>>>>>>>>>>>> approxCountDistinct >>>>>>>>>>>>> countDistinct >>>>>>>>>>>>> mean >>>>>>>>>>>>> sumDistinct >>>>>>>>>>>>> >>>>>>>>>>>>> For example using countDistinct results in an error saying >>>>>>>>>>>>> *Exception in thread "main" >>>>>>>>>>>>> org.apache.spark.sql.AnalysisException: undefined function cosh;* >>>>>>>>>>>>> >>>>>>>>>>>>> I had a similar issue with cosh operator >>>>>>>>>>>>> <http://apache-spark-developers-list.1001551.n3.nabble.com/Exception-when-using-cosh-td14724.html> >>>>>>>>>>>>> as well some time back and it turned out that it was not >>>>>>>>>>>>> registered in the >>>>>>>>>>>>> registry: >>>>>>>>>>>>> https://github.com/apache/spark/blob/master/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/analysis/FunctionRegistry.scala >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> *I* *think it is the same issue again and would be glad to >>>>>>>>>>>>> send over a PR if someone can confirm if this is an actual bug >>>>>>>>>>>>> and not some >>>>>>>>>>>>> mistake on my part.* >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> Query I am using: SELECT countDistinct(`age`) as `data` FROM >>>>>>>>>>>>> `table` >>>>>>>>>>>>> Spark Version: 10.4 >>>>>>>>>>>>> SparkSql Version: 1.5.1 >>>>>>>>>>>>> >>>>>>>>>>>>> I am using the standard example of (name, age) schema (though >>>>>>>>>>>>> I am setting age as Double and not Int as I am trying out maths >>>>>>>>>>>>> functions). >>>>>>>>>>>>> >>>>>>>>>>>>> The entire error stack can be found here >>>>>>>>>>>>> <http://pastebin.com/G6YzQXnn>. >>>>>>>>>>>>> >>>>>>>>>>>>> Thanks! >>>>>>>>>>>>> >>>>>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>> >>>>>> >>>>> >>>> >>> >> >