Re: countByValue on dataframe with multiple columns

Olivier Girardot Tue, 21 Jul 2015 04:26:54 -0700

Yop,
actually the generic part does not work, the countByValue on one column
gives you the count for each value seen in the column.
I would like a generic (multi-column) countByValue to give me the same kind
of output for each column, not considering each n-uples of each column
value as the key (which is what the groupBy is doing by default).


Regards,

Olivier

2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>:

> Ahoy !
>
> Maybe you can get countByValue by using sql.GroupedData :
>
> // some DFval df: DataFrame = 
> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType))))
>
>
> df.groupBy("n").count().show()
>
>
> // generic
> def countByValueDf(df:DataFrame) = {
>
>   val (h :: r) = df.columns.toList
>
>   df.groupBy(h, r:_*).count()
> }
>
> countByValueDf(df).show()
>
>
> Cheers,
> Jon
>
> On 20 July 2015 at 11:28, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Hi,
>> Is there any plan to add the countByValue function to Spark SQL Dataframe
>> ?
>> Even
>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>> is using the RDD part right now, but for ML purposes, being able to get the
>> most frequent categorical value on multiple columns would be very useful.
>>
>>
>> Regards,
>>
>>
>> --
>> *Olivier Girardot* | Associé
>> o.girar...@lateral-thoughts.com
>> +33 6 24 09 17 94
>>
>
>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Re: countByValue on dataframe with multiple columns

Reply via email to