Re: Compute Median in Spark Dataframe

Holden Karau Tue, 02 Jun 2015 09:05:22 -0700

Not super easily, the GroupedData class uses a strToExpr function which has
a pretty limited set of functions so we cant pass in the name of an
arbitrary hive UDAF (unless I'm missing something). We can instead
construct an column with the expression you want and then pass it in to
agg() that way (although then you need to call the hive UDAF there). There
are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
SQL AggregateExpressions, but they are private.


On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> I've finally come to the same conclusion, but isn't there any way to call
> this Hive UDAFs from the agg("percentile(key,0.5)") ??
>
> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska <yana.kadiy...@gmail.com> a
> écrit :
>
>> Like this...sqlContext should be a HiveContext instance
>>
>> case class KeyValue(key: Int, value: String)
>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
>> df.registerTempTable("table")
>> sqlContext.sql("select percentile(key,0.5) from table").show()
>>
>> 
>>
>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> Is there any way to compute a median on a column using Spark's
>>> Dataframe. I know you can use stats in a RDD but I'd rather stay within a
>>> dataframe.
>>> Hive seems to imply that using ntile one can compute percentiles,
>>> quartiles and therefore a median.
>>> Does anyone have experience with this ?
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>
>>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau

Re: Compute Median in Spark Dataframe

Reply via email to