Hi Holden, Olivier
>>So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. I also need to call a Hive UDAF in a dataframe agg function. Are there any examples of what Column expects? Deenar On 2 June 2015 at 21:13, Holden Karau <hol...@pigscanfly.ca> wrote: > So for column you need to pass in a Java function, I have some sample code > which does this but it does terrible things to access Spark internals. > > > On Tuesday, June 2, 2015, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Nice to hear from you Holden ! I ended up trying exactly that (Column) - >> but I may have done it wrong : >> >> In [*5*]: g.agg(Column("percentile(value, 0.5)")) >> Py4JError: An error occurred while calling o97.agg. Trace: >> py4j.Py4JException: Method agg([class java.lang.String, class >> scala.collection.immutable.Nil$]) does not exist >> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) >> >> Any idea ? >> >> Olivier. >> Le mar. 2 juin 2015 à 18:02, Holden Karau <hol...@pigscanfly.ca> a >> écrit : >> >>> Not super easily, the GroupedData class uses a strToExpr function which >>> has a pretty limited set of functions so we cant pass in the name of an >>> arbitrary hive UDAF (unless I'm missing something). We can instead >>> construct an column with the expression you want and then pass it in to >>> agg() that way (although then you need to call the hive UDAF there). There >>> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark >>> SQL AggregateExpressions, but they are private. >>> >>> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> >>>> I've finally come to the same conclusion, but isn't there any way to >>>> call this Hive UDAFs from the agg("percentile(key,0.5)") ?? >>>> >>>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska <yana.kadiy...@gmail.com> a >>>> écrit : >>>> >>>>> Like this...sqlContext should be a HiveContext instance >>>>> >>>>> case class KeyValue(key: Int, value: String) >>>>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF >>>>> df.registerTempTable("table") >>>>> sqlContext.sql("select percentile(key,0.5) from table").show() >>>>> >>>>> >>>>> >>>>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < >>>>> o.girar...@lateral-thoughts.com> wrote: >>>>> >>>>>> Hi everyone, >>>>>> Is there any way to compute a median on a column using Spark's >>>>>> Dataframe. I know you can use stats in a RDD but I'd rather stay within a >>>>>> dataframe. >>>>>> Hive seems to imply that using ntile one can compute percentiles, >>>>>> quartiles and therefore a median. >>>>>> Does anyone have experience with this ? >>>>>> >>>>>> Regards, >>>>>> >>>>>> Olivier. >>>>>> >>>>> >>>>> >>> >>> >>> -- >>> Cell : 425-233-8271 >>> Twitter: https://twitter.com/holdenkarau >>> Linked In: https://www.linkedin.com/in/holdenkarau >>> >> > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > Linked In: https://www.linkedin.com/in/holdenkarau > >