100% I would love to do it. Who a good person to review the design with. All I need is a quick chat about the design and approach and I'll create the jira and push a patch.
Ted Malaska On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi Ted, > The TopNList would be great to see directly in the Dataframe API and my > wish would be to be able to apply it on multiple columns at the same time > and get all these statistics. > the .describe() function is close to what we want to achieve, maybe we > could try to enrich its output. > Anyway, even as a spark-package, if you could package your code for > Dataframes, that would be great. > > Regards, > > Olivier. > > 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>: > >> Ha ok ! >> >> Then generic part would have that signature : >> >> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] >> >> >> +1 for more work (blog / api) for data quality checks. >> >> Cheers, >> Jonathan >> >> >> TopCMSParams and some other monoids from Algebird are really cool for >> that : >> >> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 >> >> >> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> wrote: >> >>> I'm guessing you want something like what I put in this blog post. >>> >>> >>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ >>> >>> This is a very common use case. If there is a +1 I would love to add it >>> to dataframes. >>> >>> Let me know >>> Ted Malaska >>> >>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> >>>> Yop, >>>> actually the generic part does not work, the countByValue on one column >>>> gives you the count for each value seen in the column. >>>> I would like a generic (multi-column) countByValue to give me the same >>>> kind of output for each column, not considering each n-uples of each column >>>> value as the key (which is what the groupBy is doing by default). >>>> >>>> Regards, >>>> >>>> Olivier >>>> >>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com >>>> >: >>>> >>>>> Ahoy ! >>>>> >>>>> Maybe you can get countByValue by using sql.GroupedData : >>>>> >>>>> // some DFval df: DataFrame = >>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", >>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType)))) >>>>> >>>>> >>>>> df.groupBy("n").count().show() >>>>> >>>>> >>>>> // generic >>>>> def countByValueDf(df:DataFrame) = { >>>>> >>>>> val (h :: r) = df.columns.toList >>>>> >>>>> df.groupBy(h, r:_*).count() >>>>> } >>>>> >>>>> countByValueDf(df).show() >>>>> >>>>> >>>>> Cheers, >>>>> Jon >>>>> >>>>> On 20 July 2015 at 11:28, Olivier Girardot < >>>>> o.girar...@lateral-thoughts.com> wrote: >>>>> >>>>>> Hi, >>>>>> Is there any plan to add the countByValue function to Spark SQL >>>>>> Dataframe ? >>>>>> Even >>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >>>>>> is using the RDD part right now, but for ML purposes, being able to get >>>>>> the >>>>>> most frequent categorical value on multiple columns would be very useful. >>>>>> >>>>>> >>>>>> Regards, >>>>>> >>>>>> >>>>>> -- >>>>>> *Olivier Girardot* | Associé >>>>>> o.girar...@lateral-thoughts.com >>>>>> +33 6 24 09 17 94 >>>>>> >>>>> >>>>> >>>> >>>> >>>> -- >>>> *Olivier Girardot* | Associé >>>> o.girar...@lateral-thoughts.com >>>> +33 6 24 09 17 94 >>>> >>> >>> >> > > > -- > *Olivier Girardot* | Associé > o.girar...@lateral-thoughts.com > +33 6 24 09 17 94 >