yes and freqItems does not give you an ordered count (right ?) + the threshold makes it difficult to calibrate it + we noticed some strange behaviour when testing it on small datasets.
2015-07-21 20:30 GMT+02:00 Ted Malaska <ted.mala...@cloudera.com>: > Look at the implementation for frequently items. It is a different from > true count. > On Jul 21, 2015 1:19 PM, "Reynold Xin" <r...@databricks.com> wrote: > >> Is this just frequent items? >> >> >> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 >> >> >> >> On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com> >> wrote: >> >>> 100% I would love to do it. Who a good person to review the design >>> with. All I need is a quick chat about the design and approach and I'll >>> create the jira and push a patch. >>> >>> Ted Malaska >>> >>> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> >>>> Hi Ted, >>>> The TopNList would be great to see directly in the Dataframe API and my >>>> wish would be to be able to apply it on multiple columns at the same time >>>> and get all these statistics. >>>> the .describe() function is close to what we want to achieve, maybe we >>>> could try to enrich its output. >>>> Anyway, even as a spark-package, if you could package your code for >>>> Dataframes, that would be great. >>>> >>>> Regards, >>>> >>>> Olivier. >>>> >>>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com >>>> >: >>>> >>>>> Ha ok ! >>>>> >>>>> Then generic part would have that signature : >>>>> >>>>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] >>>>> >>>>> >>>>> +1 for more work (blog / api) for data quality checks. >>>>> >>>>> Cheers, >>>>> Jonathan >>>>> >>>>> >>>>> TopCMSParams and some other monoids from Algebird are really cool for >>>>> that : >>>>> >>>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 >>>>> >>>>> >>>>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> >>>>> wrote: >>>>> >>>>>> I'm guessing you want something like what I put in this blog post. >>>>>> >>>>>> >>>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ >>>>>> >>>>>> This is a very common use case. If there is a +1 I would love to add >>>>>> it to dataframes. >>>>>> >>>>>> Let me know >>>>>> Ted Malaska >>>>>> >>>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot < >>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>> >>>>>>> Yop, >>>>>>> actually the generic part does not work, the countByValue on one >>>>>>> column gives you the count for each value seen in the column. >>>>>>> I would like a generic (multi-column) countByValue to give me the >>>>>>> same kind of output for each column, not considering each n-uples of >>>>>>> each >>>>>>> column value as the key (which is what the groupBy is doing by default). >>>>>>> >>>>>>> Regards, >>>>>>> >>>>>>> Olivier >>>>>>> >>>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy < >>>>>>> jonathan.wina...@gmail.com>: >>>>>>> >>>>>>>> Ahoy ! >>>>>>>> >>>>>>>> Maybe you can get countByValue by using sql.GroupedData : >>>>>>>> >>>>>>>> // some DFval df: DataFrame = >>>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", >>>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", >>>>>>>> StringType)))) >>>>>>>> >>>>>>>> >>>>>>>> df.groupBy("n").count().show() >>>>>>>> >>>>>>>> >>>>>>>> // generic >>>>>>>> def countByValueDf(df:DataFrame) = { >>>>>>>> >>>>>>>> val (h :: r) = df.columns.toList >>>>>>>> >>>>>>>> df.groupBy(h, r:_*).count() >>>>>>>> } >>>>>>>> >>>>>>>> countByValueDf(df).show() >>>>>>>> >>>>>>>> >>>>>>>> Cheers, >>>>>>>> Jon >>>>>>>> >>>>>>>> On 20 July 2015 at 11:28, Olivier Girardot < >>>>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>>>> >>>>>>>>> Hi, >>>>>>>>> Is there any plan to add the countByValue function to Spark SQL >>>>>>>>> Dataframe ? >>>>>>>>> Even >>>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >>>>>>>>> is using the RDD part right now, but for ML purposes, being able to >>>>>>>>> get the >>>>>>>>> most frequent categorical value on multiple columns would be very >>>>>>>>> useful. >>>>>>>>> >>>>>>>>> >>>>>>>>> Regards, >>>>>>>>> >>>>>>>>> >>>>>>>>> -- >>>>>>>>> *Olivier Girardot* | Associé >>>>>>>>> o.girar...@lateral-thoughts.com >>>>>>>>> +33 6 24 09 17 94 >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> *Olivier Girardot* | Associé >>>>>>> o.girar...@lateral-thoughts.com >>>>>>> +33 6 24 09 17 94 >>>>>>> >>>>>> >>>>>> >>>>> >>>> >>>> >>>> -- >>>> *Olivier Girardot* | Associé >>>> o.girar...@lateral-thoughts.com >>>> +33 6 24 09 17 94 >>>> >>> >>> >> -- *Olivier Girardot* | Associé o.girar...@lateral-thoughts.com +33 6 24 09 17 94