Hi Ted,
The TopNList would be great to see directly in the Dataframe API and my
wish would be to be able to apply it on multiple columns at the same time
and get all these statistics.
the .describe() function is close to what we want to achieve, maybe we
could try to enrich its output.
Anyway, even as a spark-package, if you could package your code for
Dataframes, that would be great.

Regards,

Olivier.

2015-07-21 15:08 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>:

> Ha ok !
>
> Then generic part would have that signature :
>
> def countColsByValue(df:Dataframe):Map[String /* colname */,Dataframe]
>
>
> +1 for more work (blog / api) for data quality checks.
>
> Cheers,
> Jonathan
>
>
> TopCMSParams and some other monoids from Algebird are really cool for that
> :
>
> https://github.com/twitter/algebird/blob/develop/algebird-core/src/main/scala/com/twitter/algebird/CountMinSketch.scala#L590
>
>
> On 21 July 2015 at 13:40, Ted Malaska <ted.mala...@cloudera.com> wrote:
>
>> I'm guessing you want something like what I put in this blog post.
>>
>>
>> http://blog.cloudera.com/blog/2015/07/how-to-do-data-quality-checks-using-apache-spark-dataframes/
>>
>> This is a very common use case.  If there is a +1 I would love to add it
>> to dataframes.
>>
>> Let me know
>> Ted Malaska
>>
>> On Tue, Jul 21, 2015 at 7:24 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Yop,
>>> actually the generic part does not work, the countByValue on one column
>>> gives you the count for each value seen in the column.
>>> I would like a generic (multi-column) countByValue to give me the same
>>> kind of output for each column, not considering each n-uples of each column
>>> value as the key (which is what the groupBy is doing by default).
>>>
>>> Regards,
>>>
>>> Olivier
>>>
>>> 2015-07-20 14:18 GMT+02:00 Jonathan Winandy <jonathan.wina...@gmail.com>
>>> :
>>>
>>>> Ahoy !
>>>>
>>>> Maybe you can get countByValue by using sql.GroupedData :
>>>>
>>>> // some DFval df: DataFrame = 
>>>> sqlContext.createDataFrame(sc.parallelize(List("A","B", "B", 
>>>> "A")).map(Row.apply(_)), StructType(List(StructField("n", StringType))))
>>>>
>>>>
>>>> df.groupBy("n").count().show()
>>>>
>>>>
>>>> // generic
>>>> def countByValueDf(df:DataFrame) = {
>>>>
>>>>   val (h :: r) = df.columns.toList
>>>>
>>>>   df.groupBy(h, r:_*).count()
>>>> }
>>>>
>>>> countByValueDf(df).show()
>>>>
>>>>
>>>> Cheers,
>>>> Jon
>>>>
>>>> On 20 July 2015 at 11:28, Olivier Girardot <
>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>
>>>>> Hi,
>>>>> Is there any plan to add the countByValue function to Spark SQL
>>>>> Dataframe ?
>>>>> Even
>>>>> https://github.com/apache/spark/blob/master/mllib/src/main/scala/org/apache/spark/ml/feature/StringIndexer.scala#L78
>>>>> is using the RDD part right now, but for ML purposes, being able to get 
>>>>> the
>>>>> most frequent categorical value on multiple columns would be very useful.
>>>>>
>>>>>
>>>>> Regards,
>>>>>
>>>>>
>>>>> --
>>>>> *Olivier Girardot* | Associé
>>>>> o.girar...@lateral-thoughts.com
>>>>> +33 6 24 09 17 94
>>>>>
>>>>
>>>>
>>>
>>>
>>> --
>>> *Olivier Girardot* | Associé
>>> o.girar...@lateral-thoughts.com
>>> +33 6 24 09 17 94
>>>
>>
>>
>


-- 
*Olivier Girardot* | Associé
o.girar...@lateral-thoughts.com
+33 6 24 09 17 94

Reply via email to