Nice reactions. My comments: @Ted.Yu: I see now that count(*) works for what I want @Takeshi: I understand this is the syntax but it was not clear to me what this $"b" column will be used for...
My line of thinking was this: I started with 1) someDF.groupBy("colA").count() and then I realized I need an average of colB per group so I tried 2) someDF.groupBy("colA").agg( avg("colB"), count() ) but it failed because count needs an argument. I understand the situation now. Thank you guys for clarification! However having future generations in mind :) I still want to poke around: - Usages of count in 1) and 2) are still a bit inconsistent to me. If this is the way have 2) works why there is no column arg in 1)? - I would expect a glimpse of all of this would be in scaladoc for the methods. The difference between their scala doc strings are hard to catch: - - usage 1) in org.apache.spark.sql.GroupedData: Count the number of *rows* for each group... - - usage 2) in org.apache.spark.sql.functions: ...returns the number of *items* in a group... Thanks On Wed, Jun 22, 2016 at 6:31 PM, Takeshi Yamamuro <linguin....@gmail.com> wrote: > Hi, > > An argument for `functions.count` is needed for per-column counting; > df.groupBy($"a").agg(count($"b")) > > // maropu > > On Thu, Jun 23, 2016 at 1:27 AM, Ted Yu <yuzhih...@gmail.com> wrote: > >> See the first example in: >> >> http://www.w3schools.com/sql/sql_func_count.asp >> >> On Wed, Jun 22, 2016 at 9:21 AM, Jakub Dubovsky < >> spark.dubovsky.ja...@gmail.com> wrote: >> >>> Hey Ted, >>> >>> thanks for reacting. >>> >>> I am refering to both of them. They both take column as parameter >>> regardless of its type. Intuition here is that count should take no >>> parameter. Or am I missing something? >>> >>> Jakub >>> >>> On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu <yuzhih...@gmail.com> wrote: >>> >>>> Are you referring to the following method in >>>> sql/core/src/main/scala/org/apache/spark/sql/functions.scala : >>>> >>>> def count(e: Column): Column = withAggregateFunction { >>>> >>>> Did you notice this method ? >>>> >>>> def count(columnName: String): TypedColumn[Any, Long] = >>>> >>>> On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky < >>>> spark.dubovsky.ja...@gmail.com> wrote: >>>> >>>>> Hey sparkers, >>>>> >>>>> an aggregate function *count* in *org.apache.spark.sql.functions* >>>>> package takes a *column* as an argument. Is this needed for >>>>> something? I find it confusing that I need to supply a column there. It >>>>> feels like it might be distinct count or something. This can be seen in >>>>> latest >>>>> documentation >>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$> >>>>> . >>>>> >>>>> I am considering filling this in spark bug tracker. Any opinions on >>>>> this? >>>>> >>>>> Thanks >>>>> >>>>> Jakub >>>>> >>>>> >>>> >>> >> > > > -- > --- > Takeshi Yamamuro >