Re: Compute Median in Spark Dataframe

Deenar Toraskar Thu, 04 Jun 2015 06:30:34 -0700

Hi Holden, Olivier


>>So for column you need to pass in a Java function, I have some sample
code which does this but it does terrible things to access Spark internals.
I also need to call a Hive UDAF in a dataframe agg function. Are there any
examples of what Column expects?

Deenar

On 2 June 2015 at 21:13, Holden Karau <hol...@pigscanfly.ca> wrote:

> So for column you need to pass in a Java function, I have some sample code
> which does this but it does terrible things to access Spark internals.
>
>
> On Tuesday, June 2, 2015, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Nice to hear from you Holden ! I ended up trying exactly that (Column) -
>> but I may have done it wrong :
>>
>> In [*5*]: g.agg(Column("percentile(value, 0.5)"))
>> Py4JError: An error occurred while calling o97.agg. Trace:
>> py4j.Py4JException: Method agg([class java.lang.String, class
>> scala.collection.immutable.Nil$]) does not exist
>> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>
>> Any idea ?
>>
>> Olivier.
>> Le mar. 2 juin 2015 à 18:02, Holden Karau <hol...@pigscanfly.ca> a
>> écrit :
>>
>>> Not super easily, the GroupedData class uses a strToExpr function which
>>> has a pretty limited set of functions so we cant pass in the name of an
>>> arbitrary hive UDAF (unless I'm missing something). We can instead
>>> construct an column with the expression you want and then pass it in to
>>> agg() that way (although then you need to call the hive UDAF there). There
>>> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
>>> SQL AggregateExpressions, but they are private.
>>>
>>> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
>>>> I've finally come to the same conclusion, but isn't there any way to
>>>> call this Hive UDAFs from the agg("percentile(key,0.5)") ??
>>>>
>>>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska <yana.kadiy...@gmail.com> a
>>>> écrit :
>>>>
>>>>> Like this...sqlContext should be a HiveContext instance
>>>>>
>>>>> case class KeyValue(key: Int, value: String)
>>>>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
>>>>> df.registerTempTable("table")
>>>>> sqlContext.sql("select percentile(key,0.5) from table").show()
>>>>>
>>>>> 
>>>>>
>>>>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
>>>>> o.girar...@lateral-thoughts.com> wrote:
>>>>>
>>>>>> Hi everyone,
>>>>>> Is there any way to compute a median on a column using Spark's
>>>>>> Dataframe. I know you can use stats in a RDD but I'd rather stay within a
>>>>>> dataframe.
>>>>>> Hive seems to imply that using ntile one can compute percentiles,
>>>>>> quartiles and therefore a median.
>>>>>> Does anyone have experience with this ?
>>>>>>
>>>>>> Regards,
>>>>>>
>>>>>> Olivier.
>>>>>>
>>>>>
>>>>>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>> Linked In: https://www.linkedin.com/in/holdenkarau
>>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>

Re: Compute Median in Spark Dataframe

Reply via email to