Re: Confusing argument of sql.functions.count

Jakub Dubovsky Wed, 22 Jun 2016 09:50:47 -0700

Nice reactions. My comments:

@Ted.Yu: I see now that count(*) works for what I want
@Takeshi: I understand this is the syntax but it was not clear to me what
this $"b" column will be used for...


My line of thinking was this:

I started with
1) someDF.groupBy("colA").count()

and then I realized I need an average of colB per group so I tried
2) someDF.groupBy("colA").agg( avg("colB"), count() )

but it failed because count needs an argument. I understand the situation
now. Thank you guys for clarification! However having future generations in
mind :) I still want to poke around:

- Usages of count in 1) and 2) are still a bit inconsistent to me. If this
is the way have 2) works why there is no column arg in 1)?
- I would expect a glimpse of all of this would be in scaladoc for the
methods. The difference between their scala doc strings are hard to catch:
- - usage 1) in org.apache.spark.sql.GroupedData: Count the number of *rows*
for each group...
- - usage 2) in org.apache.spark.sql.functions: ...returns the number of
*items* in a group...

Thanks

On Wed, Jun 22, 2016 at 6:31 PM, Takeshi Yamamuro <linguin....@gmail.com>
wrote:

> Hi,
>
> An argument for `functions.count` is needed for per-column counting;
> df.groupBy($"a").agg(count($"b"))
>
> // maropu
>
> On Thu, Jun 23, 2016 at 1:27 AM, Ted Yu <yuzhih...@gmail.com> wrote:
>
>> See the first example in:
>>
>> http://www.w3schools.com/sql/sql_func_count.asp
>>
>> On Wed, Jun 22, 2016 at 9:21 AM, Jakub Dubovsky <
>> spark.dubovsky.ja...@gmail.com> wrote:
>>
>>> Hey Ted,
>>>
>>> thanks for reacting.
>>>
>>> I am refering to both of them. They both take column as parameter
>>> regardless of its type. Intuition here is that count should take no
>>> parameter. Or am I missing something?
>>>
>>> Jakub
>>>
>>> On Wed, Jun 22, 2016 at 6:19 PM, Ted Yu <yuzhih...@gmail.com> wrote:
>>>
>>>> Are you referring to the following method in
>>>> sql/core/src/main/scala/org/apache/spark/sql/functions.scala :
>>>>
>>>>   def count(e: Column): Column = withAggregateFunction {
>>>>
>>>> Did you notice this method ?
>>>>
>>>>   def count(columnName: String): TypedColumn[Any, Long] =
>>>>
>>>> On Wed, Jun 22, 2016 at 9:06 AM, Jakub Dubovsky <
>>>> spark.dubovsky.ja...@gmail.com> wrote:
>>>>
>>>>> Hey sparkers,
>>>>>
>>>>> an aggregate function *count* in *org.apache.spark.sql.functions*
>>>>> package takes a *column* as an argument. Is this needed for
>>>>> something? I find it confusing that I need to supply a column there. It
>>>>> feels like it might be distinct count or something. This can be seen in 
>>>>> latest
>>>>> documentation
>>>>> <http://spark.apache.org/docs/latest/api/scala/index.html#org.apache.spark.sql.functions$>
>>>>> .
>>>>>
>>>>> I am considering filling this in spark bug tracker. Any opinions on
>>>>> this?
>>>>>
>>>>> Thanks
>>>>>
>>>>> Jakub
>>>>>
>>>>>
>>>>
>>>
>>
>
>
> --
> ---
> Takeshi Yamamuro
>

Re: Confusing argument of sql.functions.count

Reply via email to