Re: countByValue on dataframe with multiple columns

Ted Malaska Tue, 21 Jul 2015 16:54:06 -0700

Cool I will make a jira after I check in to my hotel.  And try to get a
patch early next week.
On Jul 21, 2015 5:15 PM, "Olivier Girardot" <o.girar...@lateral-thoughts.com>
wrote:


> yes and freqItems does not give you an ordered count (right ?) + the
> threshold makes it difficult to calibrate it + we noticed some strange
> behaviour when testing it on small datasets.
>
> 2015-07-21 20:30 GMT+02:00 Ted Malaska <ted.mala...@cloudera.com>:
>
>> Look at the implementation for frequently items.  It is a different from
>> true count.
>> On Jul 21, 2015 1:19 PM, "Reynold Xin" <r...@databricks.com> wrote:
>>
>>> Is this just frequent items?
>>>
>>>
>>> https://github.com/apache/spark/blob/master/sql/core/src/main/scala/org/apache/spark/sql/DataFrameStatFunctions.scala#L97
>>>
>>>
>>>
>>> On Tue, Jul 21, 2015 at 7:39 AM, Ted Malaska <ted.mala...@cloudera.com>
>>> wrote:
>>>
>>>> 100% I would love to do it.  Who a good person to review the design
>>>> with.  All I need is a quick chat about the design and approach and I'll
>>>> create the jira and push a patch.
>>>>
>>>> Ted Malaska
>>>>
>>>> On Tue, Jul 21, 2015 at 10:19 AM, Olivier Girardot <
>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>
>>>>> Hi Ted,
>>>>> The TopNList would be great to see directly in the Dataframe API and
>>>>> my wish would be to be able to apply it on multiple columns at the same
>>>>> time and get all these statistics.
>>>>> the .describe() function is close to what we want to achieve, maybe we
>>>>> could try to enrich its output.
>>>>> Anyway, even as a spark-package, if you could package your code for
>>>>> Dataframes, that would be great.
>>>>>
>>>>> Regards,
>>>>>
>>>>> Olivier.
>>>>>
>>>>> 2015-07-21 15:08 GMT+02:00 Jonathan Winandy <
>>>>> jonathan.wina...@gmail.com>:
>>>>>
>>>>>> Ha ok !
>>>>>>
>>>>>> Then generic part would have that signature :
>>>>>>
>>>>>> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]
>>>>>>
>>>>>>
>>>>>> +1 for more work (blog / api) for data quality checks.
>>>>>>
>>>>>> Cheers,
>>>>>> Jonathan
>>>>>>
>>>>>>
>>>>>> TopCMSParams and some other monoids from Algebird are really cool for
>>>>>> that :
>>>>>>
>>>>>> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590
>>>>>>
>>>>>>
>>>>>> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com>
>>>>>> wrote:
>>>>>>
>>>>>>> I'm guessing you want something like what I put in this blog post.
>>>>>>>
>>>>>>>
>>>>>>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
>>>>>>>
>>>>>>> This is a very common use case.  If there is a +1 I would love to
>>>>>>> add it to dataframes.
>>>>>>>
>>>>>>> Let me know
>>>>>>> Ted Malaska
>>>>>>>
>>>>>>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
>>>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>>>
>>>>>>>> Yop,
>>>>>>>> actually the generic part does not work, the countByValue on one
>>>>>>>> column gives you the count for each value seen in the column.
>>>>>>>> I would like a generic (multi-column) countByValue to give me the
>>>>>>>> same kind of output for each column, not considering each n-uples of 
>>>>>>>> each
>>>>>>>> column value as the key (which is what the groupBy is doing by 
>>>>>>>> default).
>>>>>>>>
>>>>>>>> Regards,
>>>>>>>>
>>>>>>>> Olivier
>>>>>>>>
>>>>>>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <
>>>>>>>> jonathan.wina...@gmail.com>:
>>>>>>>>
>>>>>>>>> Ahoy !
>>>>>>>>>
>>>>>>>>> Maybe you can get countByValue by using sql.GroupedData :
>>>>>>>>>
>>>>>>>>> // some DFval df: DataFrame = 
>>>>>>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
>>>>>>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", 
>>>>>>>>> StringType))))
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> df.groupBy("n").count().show()
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> // generic
>>>>>>>>> def countByValueDf(df:DataFrame) = {
>>>>>>>>>
>>>>>>>>>   val (h :: r) = df.columns.toList
>>>>>>>>>
>>>>>>>>>   df.groupBy(h, r:_*).count()
>>>>>>>>> }
>>>>>>>>>
>>>>>>>>> countByValueDf(df).show()
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Cheers,
>>>>>>>>> Jon
>>>>>>>>>
>>>>>>>>> On 20 July 2015 at 11:28, Olivier Girardot <
>>>>>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>>>>>
>>>>>>>>>> Hi,
>>>>>>>>>> Is there any plan to add the countByValue function to Spark SQL
>>>>>>>>>> Dataframe ?
>>>>>>>>>> Even
>>>>>>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>>>>>>>>> is using the RDD part right now, but for ML purposes, being able to 
>>>>>>>>>> get the
>>>>>>>>>> most frequent categorical value on multiple columns would be very 
>>>>>>>>>> useful.
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> Regards,
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>> --
>>>>>>>>>> *Olivier Girardot* | Associé
>>>>>>>>>> o.girar...@lateral-thoughts.com
>>>>>>>>>> +33 6 24 09 17 94
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>
>>>>>>>> --
>>>>>>>> *Olivier Girardot* | Associé
>>>>>>>> o.girar...@lateral-thoughts.com
>>>>>>>> +33 6 24 09 17 94
>>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> *Olivier Girardot* | Associé
>>>>> o.girar...@lateral-thoughts.com
>>>>> +33 6 24 09 17 94
>>>>>
>>>>
>>>>
>>>
>
>
> --
> *Olivier Girardot* | Associé
> o.girar...@lateral-thoughts.com
> +33 6 24 09 17 94
>

Re: countByValue on dataframe with multiple columns

Reply via email to