Re: countByValue on dataframe with multiple columns

Olivier Girardot Tue, 21 Jul 2015 14:16:42 -0700

yes and freqItems does not give you an ordered count (right ?) + the
threshold makes it difficult to calibrate it + we noticed some strange
behaviour when testing it on small datasets.


2015-07-21 20:30 GMT+02:00 Ted Malaska <ted.mala...@cloudera.com>:

> Look at the implementation for frequently items.  It is a different from
> true count.
> On Jul 21, 2015 1:19 PM, "Reynold Xin" <r...@databricks.com> wrote:
>
>> Is this just frequent items?
>>
>>
>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97
>>
>>
>>
>> On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com>
>> wrote:
>>
>>> 100% I would love to do it.  Who a good person to review the design
>>> with.  All I need is a quick chat about the design and approach and I'll
>>> create the jira and push a patch.
>>>
>>> Ted Malaska
>>>
>>> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
>>>> Hi Ted,
>>>> The TopNList would be great to see directly in the Dataframe API and my
>>>> wish would be to be able to apply it on multiple columns at the same time
>>>> and get all these statistics.
>>>> the .describe() function is close to what we want to achieve, maybe we
>>>> could try to enrich its output.
>>>> Anyway, even as a spark-package, if you could package your code for
>>>> Dataframes, that would be great.
>>>>
>>>> Regards,
>>>>
>>>> Olivier.
>>>>
>>>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com
>>>> >:
>>>>
>>>>> Ha ok !
>>>>>
>>>>> Then generic part would have that signature :
>>>>>
>>>>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]
>>>>>
>>>>>
>>>>> +1 for more work (blog / api) for data quality checks.
>>>>>
>>>>> Cheers,
>>>>> Jonathan
>>>>>
>>>>>
>>>>> TopCMSParams and some other monoids from Algebird are really cool for
>>>>> that :
>>>>>
>>>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590
>>>>>
>>>>>
>>>>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com>
>>>>> wrote:
>>>>>
>>>>>> I'm guessing you want something like what I put in this blog post.
>>>>>>
>>>>>>
>>>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
>>>>>>
>>>>>> This is a very common use case.  If there is a +1 I would love to add
>>>>>> it to dataframes.
>>>>>>
>>>>>> Let me know
>>>>>> Ted Malaska
>>>>>>
>>>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
>>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>>
>>>>>>> Yop,
>>>>>>> actually the generic part does not work, the countByValue on one
>>>>>>> column gives you the count for each value seen in the column.
>>>>>>> I would like a generic (multi-column) countByValue to give me the
>>>>>>> same kind of output for each column, not considering each n-uples of 
>>>>>>> each
>>>>>>> column value as the key (which is what the groupBy is doing by default).
>>>>>>>
>>>>>>> Regards,
>>>>>>>
>>>>>>> Olivier
>>>>>>>
>>>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <
>>>>>>> jonathan.wina...@gmail.com>:
>>>>>>>
>>>>>>>> Ahoy !
>>>>>>>>
>>>>>>>> Maybe you can get countByValue by using sql.GroupedData :
>>>>>>>>
>>>>>>>> // some DFval df: DataFrame = 
>>>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
>>>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", 
>>>>>>>> StringType))))
>>>>>>>>
>>>>>>>>
>>>>>>>> df.groupBy("n").count().show()
>>>>>>>>
>>>>>>>>
>>>>>>>> // generic
>>>>>>>> def countByValueDf(df:DataFrame) = {
>>>>>>>>
>>>>>>>>   val (h :: r) = df.columns.toList
>>>>>>>>
>>>>>>>>   df.groupBy(h, r:_*).count()
>>>>>>>> }
>>>>>>>>
>>>>>>>> countByValueDf(df).show()
>>>>>>>>
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>> Jon
>>>>>>>>
>>>>>>>> On 20 July 2015 at 11:28, Olivier Girardot <
>>>>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>> Is there any plan to add the countByValue function to Spark SQL
>>>>>>>>> Dataframe ?
>>>>>>>>> Even
>>>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>>>>>>>> is using the RDD part right now, but for ML purposes, being able to 
>>>>>>>>> get the
>>>>>>>>> most frequent categorical value on multiple columns would be very 
>>>>>>>>> useful.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Regards,
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> --
>>>>>>>>> *Olivier Girardot* | Associé
>>>>>>>>> o.girar...@lateral-thoughts.com
>>>>>>>>> +33 6 24 09 17 94
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> *Olivier Girardot* | Associé
>>>>>>> o.girar...@lateral-thoughts.com
>>>>>>> +33 6 24 09 17 94
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> *Olivier Girardot* | Associé
>>>> o.girar...@lateral-thoughts.com
>>>> +33 6 24 09 17 94
>>>>
>>>
>>>
>>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Re: countByValue on dataframe with multiple columns

Reply via email to