Yop, actually the generic part does not work, the countByValue on one column gives you the count for each value seen in the column. I would like a generic (multi-column) countByValue to give me the same kind of output for each column, not considering each n-uples of each column value as the key (which is what the groupBy is doing by default).
Regards, Olivier 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>: > Ahoy ! > > Maybe you can get countByValue by using sql.GroupedData : > > // some DFval df: DataFrame = > sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", > "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType)))) > > > df.groupBy("n").count().show() > > > // generic > def countByValueDf(df:DataFrame) = { > > val (h :: r) = df.columns.toList > > df.groupBy(h, r:_*).count() > } > > countByValueDf(df).show() > > > Cheers, > Jon > > On 20 July 2015 at 11:28, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi, >> Is there any plan to add the countByValue function to Spark SQL Dataframe >> ? >> Even >> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >> is using the RDD part right now, but for ML purposes, being able to get the >> most frequent categorical value on multiple columns would be very useful. >> >> >> Regards, >> >> >> -- >> *Olivier Girardot* | AssociƩ >> o.girar...@lateral-thoughts.com >> +33 6 24 09 17 94 >> > > -- *Olivier Girardot* | AssociƩ o.girar...@lateral-thoughts.com +33 6 24 09 17 94