Ha ok ! Then generic part would have that signature :
def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] +1 for more work (blog / api) for data quality checks. Cheers, Jonathan TopCMSParams and some other monoids from Algebird are really cool for that : https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> wrote: > I'm guessing you want something like what I put in this blog post. > > > http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ > > This is a very common use case. If there is a +1 I would love to add it > to dataframes. > > Let me know > Ted Malaska > > On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Yop, >> actually the generic part does not work, the countByValue on one column >> gives you the count for each value seen in the column. >> I would like a generic (multi-column) countByValue to give me the same >> kind of output for each column, not considering each n-uples of each column >> value as the key (which is what the groupBy is doing by default). >> >> Regards, >> >> Olivier >> >> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>: >> >>> Ahoy ! >>> >>> Maybe you can get countByValue by using sql.GroupedData : >>> >>> // some DFval df: DataFrame = >>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", >>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType)))) >>> >>> >>> df.groupBy("n").count().show() >>> >>> >>> // generic >>> def countByValueDf(df:DataFrame) = { >>> >>> val (h :: r) = df.columns.toList >>> >>> df.groupBy(h, r:_*).count() >>> } >>> >>> countByValueDf(df).show() >>> >>> >>> Cheers, >>> Jon >>> >>> On 20 July 2015 at 11:28, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> >>>> Hi, >>>> Is there any plan to add the countByValue function to Spark SQL >>>> Dataframe ? >>>> Even >>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >>>> is using the RDD part right now, but for ML purposes, being able to get the >>>> most frequent categorical value on multiple columns would be very useful. >>>> >>>> >>>> Regards, >>>> >>>> >>>> -- >>>> *Olivier Girardot* | AssociƩ >>>> o.girar...@lateral-thoughts.com >>>> +33 6 24 09 17 94 >>>> >>> >>> >> >> >> -- >> *Olivier Girardot* | AssociƩ >> o.girar...@lateral-thoughts.com >> +33 6 24 09 17 94 >> > >