So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals.
On Tuesday, June 2, 2015, Olivier Girardot <o.girar...@lateral-thoughts.com> wrote: > Nice to hear from you Holden ! I ended up trying exactly that (Column) - > but I may have done it wrong : > > In [*5*]: g.agg(Column("percentile(value, 0.5)")) > Py4JError: An error occurred while calling o97.agg. Trace: > py4j.Py4JException: Method agg([class java.lang.String, class > scala.collection.immutable.Nil$]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) > > Any idea ? > > Olivier. > Le mar. 2 juin 2015 à 18:02, Holden Karau <hol...@pigscanfly.ca > <javascript:_e(%7B%7D,'cvml','hol...@pigscanfly.ca');>> a écrit : > >> Not super easily, the GroupedData class uses a strToExpr function which >> has a pretty limited set of functions so we cant pass in the name of an >> arbitrary hive UDAF (unless I'm missing something). We can instead >> construct an column with the expression you want and then pass it in to >> agg() that way (although then you need to call the hive UDAF there). There >> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark >> SQL AggregateExpressions, but they are private. >> >> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < >> o.girar...@lateral-thoughts.com >> <javascript:_e(%7B%7D,'cvml','o.girar...@lateral-thoughts.com');>> wrote: >> >>> I've finally come to the same conclusion, but isn't there any way to >>> call this Hive UDAFs from the agg("percentile(key,0.5)") ?? >>> >>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska <yana.kadiy...@gmail.com >>> <javascript:_e(%7B%7D,'cvml','yana.kadiy...@gmail.com');>> a écrit : >>> >>>> Like this...sqlContext should be a HiveContext instance >>>> >>>> case class KeyValue(key: Int, value: String) >>>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF >>>> df.registerTempTable("table") >>>> sqlContext.sql("select percentile(key,0.5) from table").show() >>>> >>>> >>>> >>>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < >>>> o.girar...@lateral-thoughts.com >>>> <javascript:_e(%7B%7D,'cvml','o.girar...@lateral-thoughts.com');>> >>>> wrote: >>>> >>>>> Hi everyone, >>>>> Is there any way to compute a median on a column using Spark's >>>>> Dataframe. I know you can use stats in a RDD but I'd rather stay within a >>>>> dataframe. >>>>> Hive seems to imply that using ntile one can compute percentiles, >>>>> quartiles and therefore a median. >>>>> Does anyone have experience with this ? >>>>> >>>>> Regards, >>>>> >>>>> Olivier. >>>>> >>>> >>>> >> >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau >> Linked In: https://www.linkedin.com/in/holdenkarau >> > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau