Re: Compute Median in Spark Dataframe
Many thanks, will look into this. I dont want to particularly reuse the custom Hive UDAF I have, would prefer writing a new one it that is cleaner. I am just using the JVM. On 5 June 2015 at 00:03, Holden Karau wrote: > My current example doesn't use a Hive UDAF, but you would do something > pretty similar (it calls a new user defined UDAF, and there are wrappers to > make Spark SQL UDAFs from Hive UDAFs but they are private). So this is > doable, but since it pokes at internals it will likely break between > versions of Spark. If you want to see the WIP PR I have with Sparkling > Pandas its at > https://github.com/sparklingpandas/sparklingpandas/pull/90/files . If > your doing this in JVM and just want to know how to wrap the Hive UDAF, you > can grep/look in sql/hive/ in Spark, but I'd encourage you to see if there > is another way to accomplish what you want (since poking at the internals > is kind of dangerous). > > On Thu, Jun 4, 2015 at 6:28 AM, Deenar Toraskar > wrote: > >> Hi Holden, Olivier >> >> >> >>So for column you need to pass in a Java function, I have some sample >> code which does this but it does terrible things to access Spark internals. >> I also need to call a Hive UDAF in a dataframe agg function. Are there >> any examples of what Column expects? >> >> Deenar >> >> On 2 June 2015 at 21:13, Holden Karau wrote: >> >>> So for column you need to pass in a Java function, I have some sample >>> code which does this but it does terrible things to access Spark internals. >>> >>> >>> On Tuesday, June 2, 2015, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> Nice to hear from you Holden ! I ended up trying exactly that (Column) - but I may have done it wrong : In [*5*]: g.agg(Column("percentile(value, 0.5)")) Py4JError: An error occurred while calling o97.agg. Trace: py4j.Py4JException: Method agg([class java.lang.String, class scala.collection.immutable.Nil$]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) Any idea ? Olivier. Le mar. 2 juin 2015 à 18:02, Holden Karau a écrit : > Not super easily, the GroupedData class uses a strToExpr function > which has a pretty limited set of functions so we cant pass in the name of > an arbitrary hive UDAF (unless I'm missing something). We can instead > construct an column with the expression you want and then pass it in to > agg() that way (although then you need to call the hive UDAF there). There > are some private classes in hiveUdfs.scala which expose hiveUdaf's as > Spark > SQL AggregateExpressions, but they are private. > > On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> I've finally come to the same conclusion, but isn't there any way to >> call this Hive UDAFs from the agg("percentile(key,0.5)") ?? >> >> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska >> a écrit : >> >>> Like this...sqlContext should be a HiveContext instance >>> >>> case class KeyValue(key: Int, value: String) >>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF >>> df.registerTempTable("table") >>> sqlContext.sql("select percentile(key,0.5) from table").show() >>> >>> >>> >>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does anyone have experience with this ? Regards, Olivier. >>> >>> > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > Linked In: https://www.linkedin.com/in/holdenkarau > >>> >>> -- >>> Cell : 425-233-8271 >>> Twitter: https://twitter.com/holdenkarau >>> Linked In: https://www.linkedin.com/in/holdenkarau >>> >>> >> > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > Linked In: https://www.linkedin.com/in/holdenkarau >
Re: Compute Median in Spark Dataframe
My current example doesn't use a Hive UDAF, but you would do something pretty similar (it calls a new user defined UDAF, and there are wrappers to make Spark SQL UDAFs from Hive UDAFs but they are private). So this is doable, but since it pokes at internals it will likely break between versions of Spark. If you want to see the WIP PR I have with Sparkling Pandas its at https://github.com/sparklingpandas/sparklingpandas/pull/90/files . If your doing this in JVM and just want to know how to wrap the Hive UDAF, you can grep/look in sql/hive/ in Spark, but I'd encourage you to see if there is another way to accomplish what you want (since poking at the internals is kind of dangerous). On Thu, Jun 4, 2015 at 6:28 AM, Deenar Toraskar wrote: > Hi Holden, Olivier > > > >>So for column you need to pass in a Java function, I have some sample > code which does this but it does terrible things to access Spark internals. > I also need to call a Hive UDAF in a dataframe agg function. Are there any > examples of what Column expects? > > Deenar > > On 2 June 2015 at 21:13, Holden Karau wrote: > >> So for column you need to pass in a Java function, I have some sample >> code which does this but it does terrible things to access Spark internals. >> >> >> On Tuesday, June 2, 2015, Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >>> Nice to hear from you Holden ! I ended up trying exactly that (Column) - >>> but I may have done it wrong : >>> >>> In [*5*]: g.agg(Column("percentile(value, 0.5)")) >>> Py4JError: An error occurred while calling o97.agg. Trace: >>> py4j.Py4JException: Method agg([class java.lang.String, class >>> scala.collection.immutable.Nil$]) does not exist >>> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) >>> >>> Any idea ? >>> >>> Olivier. >>> Le mar. 2 juin 2015 à 18:02, Holden Karau a >>> écrit : >>> Not super easily, the GroupedData class uses a strToExpr function which has a pretty limited set of functions so we cant pass in the name of an arbitrary hive UDAF (unless I'm missing something). We can instead construct an column with the expression you want and then pass it in to agg() that way (although then you need to call the hive UDAF there). There are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark SQL AggregateExpressions, but they are private. On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > I've finally come to the same conclusion, but isn't there any way to > call this Hive UDAFs from the agg("percentile(key,0.5)") ?? > > Le mar. 2 juin 2015 à 15:37, Yana Kadiyska > a écrit : > >> Like this...sqlContext should be a HiveContext instance >> >> case class KeyValue(key: Int, value: String) >> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF >> df.registerTempTable("table") >> sqlContext.sql("select percentile(key,0.5) from table").show() >> >> >> >> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >>> Hi everyone, >>> Is there any way to compute a median on a column using Spark's >>> Dataframe. I know you can use stats in a RDD but I'd rather stay within >>> a >>> dataframe. >>> Hive seems to imply that using ntile one can compute percentiles, >>> quartiles and therefore a median. >>> Does anyone have experience with this ? >>> >>> Regards, >>> >>> Olivier. >>> >> >> -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau >>> >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau >> Linked In: https://www.linkedin.com/in/holdenkarau >> >> > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau
Re: Compute Median in Spark Dataframe
Hi Holden, Olivier >>So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. I also need to call a Hive UDAF in a dataframe agg function. Are there any examples of what Column expects? Deenar On 2 June 2015 at 21:13, Holden Karau wrote: > So for column you need to pass in a Java function, I have some sample code > which does this but it does terrible things to access Spark internals. > > > On Tuesday, June 2, 2015, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Nice to hear from you Holden ! I ended up trying exactly that (Column) - >> but I may have done it wrong : >> >> In [*5*]: g.agg(Column("percentile(value, 0.5)")) >> Py4JError: An error occurred while calling o97.agg. Trace: >> py4j.Py4JException: Method agg([class java.lang.String, class >> scala.collection.immutable.Nil$]) does not exist >> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) >> >> Any idea ? >> >> Olivier. >> Le mar. 2 juin 2015 à 18:02, Holden Karau a >> écrit : >> >>> Not super easily, the GroupedData class uses a strToExpr function which >>> has a pretty limited set of functions so we cant pass in the name of an >>> arbitrary hive UDAF (unless I'm missing something). We can instead >>> construct an column with the expression you want and then pass it in to >>> agg() that way (although then you need to call the hive UDAF there). There >>> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark >>> SQL AggregateExpressions, but they are private. >>> >>> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> I've finally come to the same conclusion, but isn't there any way to call this Hive UDAFs from the agg("percentile(key,0.5)") ?? Le mar. 2 juin 2015 à 15:37, Yana Kadiyska a écrit : > Like this...sqlContext should be a HiveContext instance > > case class KeyValue(key: Int, value: String) > val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF > df.registerTempTable("table") > sqlContext.sql("select percentile(key,0.5) from table").show() > > > > On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi everyone, >> Is there any way to compute a median on a column using Spark's >> Dataframe. I know you can use stats in a RDD but I'd rather stay within a >> dataframe. >> Hive seems to imply that using ntile one can compute percentiles, >> quartiles and therefore a median. >> Does anyone have experience with this ? >> >> Regards, >> >> Olivier. >> > > >>> >>> >>> -- >>> Cell : 425-233-8271 >>> Twitter: https://twitter.com/holdenkarau >>> Linked In: https://www.linkedin.com/in/holdenkarau >>> >> > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > Linked In: https://www.linkedin.com/in/holdenkarau > >
Re: Compute Median in Spark Dataframe
So for column you need to pass in a Java function, I have some sample code which does this but it does terrible things to access Spark internals. On Tuesday, June 2, 2015, Olivier Girardot wrote: > Nice to hear from you Holden ! I ended up trying exactly that (Column) - > but I may have done it wrong : > > In [*5*]: g.agg(Column("percentile(value, 0.5)")) > Py4JError: An error occurred while calling o97.agg. Trace: > py4j.Py4JException: Method agg([class java.lang.String, class > scala.collection.immutable.Nil$]) does not exist > at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) > > Any idea ? > > Olivier. > Le mar. 2 juin 2015 à 18:02, Holden Karau > a écrit : > >> Not super easily, the GroupedData class uses a strToExpr function which >> has a pretty limited set of functions so we cant pass in the name of an >> arbitrary hive UDAF (unless I'm missing something). We can instead >> construct an column with the expression you want and then pass it in to >> agg() that way (although then you need to call the hive UDAF there). There >> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark >> SQL AggregateExpressions, but they are private. >> >> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < >> o.girar...@lateral-thoughts.com >> > wrote: >> >>> I've finally come to the same conclusion, but isn't there any way to >>> call this Hive UDAFs from the agg("percentile(key,0.5)") ?? >>> >>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska >> > a écrit : >>> Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int, value: String) val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF df.registerTempTable("table") sqlContext.sql("select percentile(key,0.5) from table").show() On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < o.girar...@lateral-thoughts.com > wrote: > Hi everyone, > Is there any way to compute a median on a column using Spark's > Dataframe. I know you can use stats in a RDD but I'd rather stay within a > dataframe. > Hive seems to imply that using ntile one can compute percentiles, > quartiles and therefore a median. > Does anyone have experience with this ? > > Regards, > > Olivier. > >> >> >> -- >> Cell : 425-233-8271 >> Twitter: https://twitter.com/holdenkarau >> Linked In: https://www.linkedin.com/in/holdenkarau >> > -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau
Re: Compute Median in Spark Dataframe
Nice to hear from you Holden ! I ended up trying exactly that (Column) - but I may have done it wrong : In [*5*]: g.agg(Column("percentile(value, 0.5)")) Py4JError: An error occurred while calling o97.agg. Trace: py4j.Py4JException: Method agg([class java.lang.String, class scala.collection.immutable.Nil$]) does not exist at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333) Any idea ? Olivier. Le mar. 2 juin 2015 à 18:02, Holden Karau a écrit : > Not super easily, the GroupedData class uses a strToExpr function which > has a pretty limited set of functions so we cant pass in the name of an > arbitrary hive UDAF (unless I'm missing something). We can instead > construct an column with the expression you want and then pass it in to > agg() that way (although then you need to call the hive UDAF there). There > are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark > SQL AggregateExpressions, but they are private. > > On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> I've finally come to the same conclusion, but isn't there any way to call >> this Hive UDAFs from the agg("percentile(key,0.5)") ?? >> >> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska a >> écrit : >> >>> Like this...sqlContext should be a HiveContext instance >>> >>> case class KeyValue(key: Int, value: String) >>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF >>> df.registerTempTable("table") >>> sqlContext.sql("select percentile(key,0.5) from table").show() >>> >>> >>> >>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < >>> o.girar...@lateral-thoughts.com> wrote: >>> Hi everyone, Is there any way to compute a median on a column using Spark's Dataframe. I know you can use stats in a RDD but I'd rather stay within a dataframe. Hive seems to imply that using ntile one can compute percentiles, quartiles and therefore a median. Does anyone have experience with this ? Regards, Olivier. >>> >>> > > > -- > Cell : 425-233-8271 > Twitter: https://twitter.com/holdenkarau > Linked In: https://www.linkedin.com/in/holdenkarau >
Re: Compute Median in Spark Dataframe
Not super easily, the GroupedData class uses a strToExpr function which has a pretty limited set of functions so we cant pass in the name of an arbitrary hive UDAF (unless I'm missing something). We can instead construct an column with the expression you want and then pass it in to agg() that way (although then you need to call the hive UDAF there). There are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark SQL AggregateExpressions, but they are private. On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > I've finally come to the same conclusion, but isn't there any way to call > this Hive UDAFs from the agg("percentile(key,0.5)") ?? > > Le mar. 2 juin 2015 à 15:37, Yana Kadiyska a > écrit : > >> Like this...sqlContext should be a HiveContext instance >> >> case class KeyValue(key: Int, value: String) >> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF >> df.registerTempTable("table") >> sqlContext.sql("select percentile(key,0.5) from table").show() >> >> >> >> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < >> o.girar...@lateral-thoughts.com> wrote: >> >>> Hi everyone, >>> Is there any way to compute a median on a column using Spark's >>> Dataframe. I know you can use stats in a RDD but I'd rather stay within a >>> dataframe. >>> Hive seems to imply that using ntile one can compute percentiles, >>> quartiles and therefore a median. >>> Does anyone have experience with this ? >>> >>> Regards, >>> >>> Olivier. >>> >> >> -- Cell : 425-233-8271 Twitter: https://twitter.com/holdenkarau Linked In: https://www.linkedin.com/in/holdenkarau
Re: Compute Median in Spark Dataframe
I've finally come to the same conclusion, but isn't there any way to call this Hive UDAFs from the agg("percentile(key,0.5)") ?? Le mar. 2 juin 2015 à 15:37, Yana Kadiyska a écrit : > Like this...sqlContext should be a HiveContext instance > > case class KeyValue(key: Int, value: String) > val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF > df.registerTempTable("table") > sqlContext.sql("select percentile(key,0.5) from table").show() > > > > On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < > o.girar...@lateral-thoughts.com> wrote: > >> Hi everyone, >> Is there any way to compute a median on a column using Spark's Dataframe. >> I know you can use stats in a RDD but I'd rather stay within a dataframe. >> Hive seems to imply that using ntile one can compute percentiles, >> quartiles and therefore a median. >> Does anyone have experience with this ? >> >> Regards, >> >> Olivier. >> > >
Re: Compute Median in Spark Dataframe
Like this...sqlContext should be a HiveContext instance case class KeyValue(key: Int, value: String) val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF df.registerTempTable("table") sqlContext.sql("select percentile(key,0.5) from table").show() On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot < o.girar...@lateral-thoughts.com> wrote: > Hi everyone, > Is there any way to compute a median on a column using Spark's Dataframe. > I know you can use stats in a RDD but I'd rather stay within a dataframe. > Hive seems to imply that using ntile one can compute percentiles, > quartiles and therefore a median. > Does anyone have experience with this ? > > Regards, > > Olivier. >