Cool I will make a jira after I check in to my hotel. And try to get a patch early next week. On Jul 21, 2015 5:15 PM, "Olivier Girardot" <o.girar...@lateral-thoughts.com> wrote:
> yes and freqItems does not give you an ordered count (right ?) + the > threshold makes it difficult to calibrate it + we noticed some strange > behaviour when testing it on small datasets. > > 2015-07-21 20:30 GMT+02:00 Ted Malaska <ted.mala...@cloudera.com>: > >> Look at the implementation for frequently items. It is a different from >> true count. >> On Jul 21, 2015 1:19 PM, "Reynold Xin" <r...@databricks.com> wrote: >> >>> Is this just frequent items? >>> >>> >>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97 >>> >>> >>> >>> On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com> >>> wrote: >>> >>>> 100% I would love to do it. Who a good person to review the design >>>> with. All I need is a quick chat about the design and approach and I'll >>>> create the jira and push a patch. >>>> >>>> Ted Malaska >>>> >>>> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot < >>>> o.girar...@lateral-thoughts.com> wrote: >>>> >>>>> Hi Ted, >>>>> The TopNList would be great to see directly in the Dataframe API and >>>>> my wish would be to be able to apply it on multiple columns at the same >>>>> time and get all these statistics. >>>>> the .describe() function is close to what we want to achieve, maybe we >>>>> could try to enrich its output. >>>>> Anyway, even as a spark-package, if you could package your code for >>>>> Dataframes, that would be great. >>>>> >>>>> Regards, >>>>> >>>>> Olivier. >>>>> >>>>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy < >>>>> jonathan.wina...@gmail.com>: >>>>> >>>>>> Ha ok ! >>>>>> >>>>>> Then generic part would have that signature : >>>>>> >>>>>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe] >>>>>> >>>>>> >>>>>> +1 for more work (blog / api) for data quality checks. >>>>>> >>>>>> Cheers, >>>>>> Jonathan >>>>>> >>>>>> >>>>>> TopCMSParams and some other monoids from Algebird are really cool for >>>>>> that : >>>>>> >>>>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590 >>>>>> >>>>>> >>>>>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> >>>>>> wrote: >>>>>> >>>>>>> I'm guessing you want something like what I put in this blog post. >>>>>>> >>>>>>> >>>>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/ >>>>>>> >>>>>>> This is a very common use case. If there is a +1 I would love to >>>>>>> add it to dataframes. >>>>>>> >>>>>>> Let me know >>>>>>> Ted Malaska >>>>>>> >>>>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot < >>>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>>> >>>>>>>> Yop, >>>>>>>> actually the generic part does not work, the countByValue on one >>>>>>>> column gives you the count for each value seen in the column. >>>>>>>> I would like a generic (multi-column) countByValue to give me the >>>>>>>> same kind of output for each column, not considering each n-uples of >>>>>>>> each >>>>>>>> column value as the key (which is what the groupBy is doing by >>>>>>>> default). >>>>>>>> >>>>>>>> Regards, >>>>>>>> >>>>>>>> Olivier >>>>>>>> >>>>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy < >>>>>>>> jonathan.wina...@gmail.com>: >>>>>>>> >>>>>>>>> Ahoy ! >>>>>>>>> >>>>>>>>> Maybe you can get countByValue by using sql.GroupedData : >>>>>>>>> >>>>>>>>> // some DFval df: DataFrame = >>>>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", >>>>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", >>>>>>>>> StringType)))) >>>>>>>>> >>>>>>>>> >>>>>>>>> df.groupBy("n").count().show() >>>>>>>>> >>>>>>>>> >>>>>>>>> // generic >>>>>>>>> def countByValueDf(df:DataFrame) = { >>>>>>>>> >>>>>>>>> val (h :: r) = df.columns.toList >>>>>>>>> >>>>>>>>> df.groupBy(h, r:_*).count() >>>>>>>>> } >>>>>>>>> >>>>>>>>> countByValueDf(df).show() >>>>>>>>> >>>>>>>>> >>>>>>>>> Cheers, >>>>>>>>> Jon >>>>>>>>> >>>>>>>>> On 20 July 2015 at 11:28, Olivier Girardot < >>>>>>>>> o.girar...@lateral-thoughts.com> wrote: >>>>>>>>> >>>>>>>>>> Hi, >>>>>>>>>> Is there any plan to add the countByValue function to Spark SQL >>>>>>>>>> Dataframe ? >>>>>>>>>> Even >>>>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78 >>>>>>>>>> is using the RDD part right now, but for ML purposes, being able to >>>>>>>>>> get the >>>>>>>>>> most frequent categorical value on multiple columns would be very >>>>>>>>>> useful. >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> Regards, >>>>>>>>>> >>>>>>>>>> >>>>>>>>>> -- >>>>>>>>>> *Olivier Girardot* | Associé >>>>>>>>>> o.girar...@lateral-thoughts.com >>>>>>>>>> +33 6 24 09 17 94 >>>>>>>>>> >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> >>>>>>>> -- >>>>>>>> *Olivier Girardot* | Associé >>>>>>>> o.girar...@lateral-thoughts.com >>>>>>>> +33 6 24 09 17 94 >>>>>>>> >>>>>>> >>>>>>> >>>>>> >>>>> >>>>> >>>>> -- >>>>> *Olivier Girardot* | Associé >>>>> o.girar...@lateral-thoughts.com >>>>> +33 6 24 09 17 94 >>>>> >>>> >>>> >>> > > > -- > *Olivier Girardot* | Associé > o.girar...@lateral-thoughts.com > +33 6 24 09 17 94 >