Re: countByValue on dataframe with multiple columns

Jonathan Winandy Tue, 21 Jul 2015 06:10:10 -0700

Ha ok !

Then generic part would have that signature :


def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]


+1 for more work (blog / api) for data quality checks.

Cheers,
Jonathan


TopCMSParams and some other monoids from Algebird are really cool for that :
https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590


On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> wrote:

> I'm guessing you want something like what I put in this blog post.
>
>
> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
>
> This is a very common use case.  If there is a +1 I would love to add it
> to dataframes.
>
> Let me know
> Ted Malaska
>
> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Yop,
>> actually the generic part does not work, the countByValue on one column
>> gives you the count for each value seen in the column.
>> I would like a generic (multi-column) countByValue to give me the same
>> kind of output for each column, not considering each n-uples of each column
>> value as the key (which is what the groupBy is doing by default).
>>
>> Regards,
>>
>> Olivier
>>
>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>:
>>
>>> Ahoy !
>>>
>>> Maybe you can get countByValue by using sql.GroupedData :
>>>
>>> // some DFval df: DataFrame = 
>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType))))
>>>
>>>
>>> df.groupBy("n").count().show()
>>>
>>>
>>> // generic
>>> def countByValueDf(df:DataFrame) = {
>>>
>>>   val (h :: r) = df.columns.toList
>>>
>>>   df.groupBy(h, r:_*).count()
>>> }
>>>
>>> countByValueDf(df).show()
>>>
>>>
>>> Cheers,
>>> Jon
>>>
>>> On 20 July 2015 at 11:28, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
>>>> Hi,
>>>> Is there any plan to add the countByValue function to Spark SQL
>>>> Dataframe ?
>>>> Even
>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>>> is using the RDD part right now, but for ML purposes, being able to get the
>>>> most frequent categorical value on multiple columns would be very useful.
>>>>
>>>>
>>>> Regards,
>>>>
>>>>
>>>> --
>>>> *Olivier Girardot* | Associé
>>>> o.girar...@lateral-thoughts.com
>>>> +33 6 24 09 17 94
>>>>
>>>
>>>
>>
>>
>> --
>> *Olivier Girardot* | Associé
>> o.girar...@lateral-thoughts.com
>> +33 6 24 09 17 94
>>
>
>

Re: countByValue on dataframe with multiple columns

Reply via email to