Re: Compute Median in Spark Dataframe

2015-06-22 Thread Deenar Toraskar
Many thanks, will look into this. I dont want to particularly reuse the
custom Hive UDAF I have, would prefer writing a new one it that is
cleaner.  I am just using the JVM.

On 5 June 2015 at 00:03, Holden Karau  wrote:

> My current example doesn't use a Hive UDAF, but you would  do something
> pretty similar (it calls a new user defined UDAF, and there are wrappers to
> make Spark SQL UDAFs from Hive UDAFs but they are private). So this is
> doable, but since it pokes at internals it will likely break between
> versions of Spark. If you want to see the WIP PR I have with Sparkling
> Pandas its at
> https://github.com/sparklingpandas/sparklingpandas/pull/90/files . If
> your doing this in JVM and just want to know how to wrap the Hive UDAF, you
> can grep/look in sql/hive/ in Spark, but I'd encourage you to see if there
> is another way to accomplish what you want (since poking at the internals
> is kind of dangerous).
>
> On Thu, Jun 4, 2015 at 6:28 AM, Deenar Toraskar  > wrote:
>
>> Hi Holden, Olivier
>>
>>
>> >>So for column you need to pass in a Java function, I have some sample
>> code which does this but it does terrible things to access Spark internals.
>> I also need to call a Hive UDAF in a dataframe agg function. Are there
>> any examples of what Column expects?
>>
>> Deenar
>>
>> On 2 June 2015 at 21:13, Holden Karau  wrote:
>>
>>> So for column you need to pass in a Java function, I have some sample
>>> code which does this but it does terrible things to access Spark internals.
>>>
>>>
>>> On Tuesday, June 2, 2015, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
 Nice to hear from you Holden ! I ended up trying exactly that (Column)
 - but I may have done it wrong :

 In [*5*]: g.agg(Column("percentile(value, 0.5)"))
 Py4JError: An error occurred while calling o97.agg. Trace:
 py4j.Py4JException: Method agg([class java.lang.String, class
 scala.collection.immutable.Nil$]) does not exist
 at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)

 Any idea ?

 Olivier.
 Le mar. 2 juin 2015 à 18:02, Holden Karau  a
 écrit :

> Not super easily, the GroupedData class uses a strToExpr function
> which has a pretty limited set of functions so we cant pass in the name of
> an arbitrary hive UDAF (unless I'm missing something). We can instead
> construct an column with the expression you want and then pass it in to
> agg() that way (although then you need to call the hive UDAF there). There
> are some private classes in hiveUdfs.scala which expose hiveUdaf's as 
> Spark
> SQL AggregateExpressions, but they are private.
>
> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> I've finally come to the same conclusion, but isn't there any way to
>> call this Hive UDAFs from the agg("percentile(key,0.5)") ??
>>
>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska 
>> a écrit :
>>
>>> Like this...sqlContext should be a HiveContext instance
>>>
>>> case class KeyValue(key: Int, value: String)
>>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
>>> df.registerTempTable("table")
>>> sqlContext.sql("select percentile(key,0.5) from table").show()
>>>
>>> ​
>>>
>>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
 Hi everyone,
 Is there any way to compute a median on a column using Spark's
 Dataframe. I know you can use stats in a RDD but I'd rather stay 
 within a
 dataframe.
 Hive seems to imply that using ntile one can compute percentiles,
 quartiles and therefore a median.
 Does anyone have experience with this ?

 Regards,

 Olivier.

>>>
>>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>

>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>> Linked In: https://www.linkedin.com/in/holdenkarau
>>>
>>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>


Re: Compute Median in Spark Dataframe

2015-06-04 Thread Holden Karau
My current example doesn't use a Hive UDAF, but you would  do something
pretty similar (it calls a new user defined UDAF, and there are wrappers to
make Spark SQL UDAFs from Hive UDAFs but they are private). So this is
doable, but since it pokes at internals it will likely break between
versions of Spark. If you want to see the WIP PR I have with Sparkling
Pandas its at
https://github.com/sparklingpandas/sparklingpandas/pull/90/files . If your
doing this in JVM and just want to know how to wrap the Hive UDAF, you can
grep/look in sql/hive/ in Spark, but I'd encourage you to see if there is
another way to accomplish what you want (since poking at the internals is
kind of dangerous).

On Thu, Jun 4, 2015 at 6:28 AM, Deenar Toraskar 
wrote:

> Hi Holden, Olivier
>
>
> >>So for column you need to pass in a Java function, I have some sample
> code which does this but it does terrible things to access Spark internals.
> I also need to call a Hive UDAF in a dataframe agg function. Are there any
> examples of what Column expects?
>
> Deenar
>
> On 2 June 2015 at 21:13, Holden Karau  wrote:
>
>> So for column you need to pass in a Java function, I have some sample
>> code which does this but it does terrible things to access Spark internals.
>>
>>
>> On Tuesday, June 2, 2015, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Nice to hear from you Holden ! I ended up trying exactly that (Column) -
>>> but I may have done it wrong :
>>>
>>> In [*5*]: g.agg(Column("percentile(value, 0.5)"))
>>> Py4JError: An error occurred while calling o97.agg. Trace:
>>> py4j.Py4JException: Method agg([class java.lang.String, class
>>> scala.collection.immutable.Nil$]) does not exist
>>> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>>
>>> Any idea ?
>>>
>>> Olivier.
>>> Le mar. 2 juin 2015 à 18:02, Holden Karau  a
>>> écrit :
>>>
 Not super easily, the GroupedData class uses a strToExpr function which
 has a pretty limited set of functions so we cant pass in the name of an
 arbitrary hive UDAF (unless I'm missing something). We can instead
 construct an column with the expression you want and then pass it in to
 agg() that way (although then you need to call the hive UDAF there). There
 are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
 SQL AggregateExpressions, but they are private.

 On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
 o.girar...@lateral-thoughts.com> wrote:

> I've finally come to the same conclusion, but isn't there any way to
> call this Hive UDAFs from the agg("percentile(key,0.5)") ??
>
> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska 
> a écrit :
>
>> Like this...sqlContext should be a HiveContext instance
>>
>> case class KeyValue(key: Int, value: String)
>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
>> df.registerTempTable("table")
>> sqlContext.sql("select percentile(key,0.5) from table").show()
>>
>> ​
>>
>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> Is there any way to compute a median on a column using Spark's
>>> Dataframe. I know you can use stats in a RDD but I'd rather stay within 
>>> a
>>> dataframe.
>>> Hive seems to imply that using ntile one can compute percentiles,
>>> quartiles and therefore a median.
>>> Does anyone have experience with this ?
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>
>>


 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau
 Linked In: https://www.linkedin.com/in/holdenkarau

>>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>> Linked In: https://www.linkedin.com/in/holdenkarau
>>
>>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Re: Compute Median in Spark Dataframe

2015-06-04 Thread Deenar Toraskar
Hi Holden, Olivier


>>So for column you need to pass in a Java function, I have some sample
code which does this but it does terrible things to access Spark internals.
I also need to call a Hive UDAF in a dataframe agg function. Are there any
examples of what Column expects?

Deenar

On 2 June 2015 at 21:13, Holden Karau  wrote:

> So for column you need to pass in a Java function, I have some sample code
> which does this but it does terrible things to access Spark internals.
>
>
> On Tuesday, June 2, 2015, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Nice to hear from you Holden ! I ended up trying exactly that (Column) -
>> but I may have done it wrong :
>>
>> In [*5*]: g.agg(Column("percentile(value, 0.5)"))
>> Py4JError: An error occurred while calling o97.agg. Trace:
>> py4j.Py4JException: Method agg([class java.lang.String, class
>> scala.collection.immutable.Nil$]) does not exist
>> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>>
>> Any idea ?
>>
>> Olivier.
>> Le mar. 2 juin 2015 à 18:02, Holden Karau  a
>> écrit :
>>
>>> Not super easily, the GroupedData class uses a strToExpr function which
>>> has a pretty limited set of functions so we cant pass in the name of an
>>> arbitrary hive UDAF (unless I'm missing something). We can instead
>>> construct an column with the expression you want and then pass it in to
>>> agg() that way (although then you need to call the hive UDAF there). There
>>> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
>>> SQL AggregateExpressions, but they are private.
>>>
>>> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
 I've finally come to the same conclusion, but isn't there any way to
 call this Hive UDAFs from the agg("percentile(key,0.5)") ??

 Le mar. 2 juin 2015 à 15:37, Yana Kadiyska  a
 écrit :

> Like this...sqlContext should be a HiveContext instance
>
> case class KeyValue(key: Int, value: String)
> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
> df.registerTempTable("table")
> sqlContext.sql("select percentile(key,0.5) from table").show()
>
> ​
>
> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Hi everyone,
>> Is there any way to compute a median on a column using Spark's
>> Dataframe. I know you can use stats in a RDD but I'd rather stay within a
>> dataframe.
>> Hive seems to imply that using ntile one can compute percentiles,
>> quartiles and therefore a median.
>> Does anyone have experience with this ?
>>
>> Regards,
>>
>> Olivier.
>>
>
>
>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>> Linked In: https://www.linkedin.com/in/holdenkarau
>>>
>>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>
>


Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
So for column you need to pass in a Java function, I have some sample code
which does this but it does terrible things to access Spark internals.

On Tuesday, June 2, 2015, Olivier Girardot 
wrote:

> Nice to hear from you Holden ! I ended up trying exactly that (Column) -
> but I may have done it wrong :
>
> In [*5*]: g.agg(Column("percentile(value, 0.5)"))
> Py4JError: An error occurred while calling o97.agg. Trace:
> py4j.Py4JException: Method agg([class java.lang.String, class
> scala.collection.immutable.Nil$]) does not exist
> at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)
>
> Any idea ?
>
> Olivier.
> Le mar. 2 juin 2015 à 18:02, Holden Karau  > a écrit :
>
>> Not super easily, the GroupedData class uses a strToExpr function which
>> has a pretty limited set of functions so we cant pass in the name of an
>> arbitrary hive UDAF (unless I'm missing something). We can instead
>> construct an column with the expression you want and then pass it in to
>> agg() that way (although then you need to call the hive UDAF there). There
>> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
>> SQL AggregateExpressions, but they are private.
>>
>> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com
>> > wrote:
>>
>>> I've finally come to the same conclusion, but isn't there any way to
>>> call this Hive UDAFs from the agg("percentile(key,0.5)") ??
>>>
>>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska >> > a écrit :
>>>
 Like this...sqlContext should be a HiveContext instance

 case class KeyValue(key: Int, value: String)
 val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
 df.registerTempTable("table")
 sqlContext.sql("select percentile(key,0.5) from table").show()

 ​

 On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
 o.girar...@lateral-thoughts.com
 >
 wrote:

> Hi everyone,
> Is there any way to compute a median on a column using Spark's
> Dataframe. I know you can use stats in a RDD but I'd rather stay within a
> dataframe.
> Hive seems to imply that using ntile one can compute percentiles,
> quartiles and therefore a median.
> Does anyone have experience with this ?
>
> Regards,
>
> Olivier.
>


>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>> Linked In: https://www.linkedin.com/in/holdenkarau
>>
>

-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
Nice to hear from you Holden ! I ended up trying exactly that (Column) -
but I may have done it wrong :

In [*5*]: g.agg(Column("percentile(value, 0.5)"))
Py4JError: An error occurred while calling o97.agg. Trace:
py4j.Py4JException: Method agg([class java.lang.String, class
scala.collection.immutable.Nil$]) does not exist
at py4j.reflection.ReflectionEngine.getMethod(ReflectionEngine.java:333)

Any idea ?

Olivier.
Le mar. 2 juin 2015 à 18:02, Holden Karau  a écrit :

> Not super easily, the GroupedData class uses a strToExpr function which
> has a pretty limited set of functions so we cant pass in the name of an
> arbitrary hive UDAF (unless I'm missing something). We can instead
> construct an column with the expression you want and then pass it in to
> agg() that way (although then you need to call the hive UDAF there). There
> are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
> SQL AggregateExpressions, but they are private.
>
> On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> I've finally come to the same conclusion, but isn't there any way to call
>> this Hive UDAFs from the agg("percentile(key,0.5)") ??
>>
>> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska  a
>> écrit :
>>
>>> Like this...sqlContext should be a HiveContext instance
>>>
>>> case class KeyValue(key: Int, value: String)
>>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
>>> df.registerTempTable("table")
>>> sqlContext.sql("select percentile(key,0.5) from table").show()
>>>
>>> ​
>>>
>>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
>>> o.girar...@lateral-thoughts.com> wrote:
>>>
 Hi everyone,
 Is there any way to compute a median on a column using Spark's
 Dataframe. I know you can use stats in a RDD but I'd rather stay within a
 dataframe.
 Hive seems to imply that using ntile one can compute percentiles,
 quartiles and therefore a median.
 Does anyone have experience with this ?

 Regards,

 Olivier.

>>>
>>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
> Linked In: https://www.linkedin.com/in/holdenkarau
>


Re: Compute Median in Spark Dataframe

2015-06-02 Thread Holden Karau
Not super easily, the GroupedData class uses a strToExpr function which has
a pretty limited set of functions so we cant pass in the name of an
arbitrary hive UDAF (unless I'm missing something). We can instead
construct an column with the expression you want and then pass it in to
agg() that way (although then you need to call the hive UDAF there). There
are some private classes in hiveUdfs.scala which expose hiveUdaf's as Spark
SQL AggregateExpressions, but they are private.

On Tue, Jun 2, 2015 at 8:28 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> I've finally come to the same conclusion, but isn't there any way to call
> this Hive UDAFs from the agg("percentile(key,0.5)") ??
>
> Le mar. 2 juin 2015 à 15:37, Yana Kadiyska  a
> écrit :
>
>> Like this...sqlContext should be a HiveContext instance
>>
>> case class KeyValue(key: Int, value: String)
>> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
>> df.registerTempTable("table")
>> sqlContext.sql("select percentile(key,0.5) from table").show()
>>
>> ​
>>
>> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
>> o.girar...@lateral-thoughts.com> wrote:
>>
>>> Hi everyone,
>>> Is there any way to compute a median on a column using Spark's
>>> Dataframe. I know you can use stats in a RDD but I'd rather stay within a
>>> dataframe.
>>> Hive seems to imply that using ntile one can compute percentiles,
>>> quartiles and therefore a median.
>>> Does anyone have experience with this ?
>>>
>>> Regards,
>>>
>>> Olivier.
>>>
>>
>>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau
Linked In: https://www.linkedin.com/in/holdenkarau


Re: Compute Median in Spark Dataframe

2015-06-02 Thread Olivier Girardot
I've finally come to the same conclusion, but isn't there any way to call
this Hive UDAFs from the agg("percentile(key,0.5)") ??

Le mar. 2 juin 2015 à 15:37, Yana Kadiyska  a
écrit :

> Like this...sqlContext should be a HiveContext instance
>
> case class KeyValue(key: Int, value: String)
> val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
> df.registerTempTable("table")
> sqlContext.sql("select percentile(key,0.5) from table").show()
>
> ​
>
> On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
> o.girar...@lateral-thoughts.com> wrote:
>
>> Hi everyone,
>> Is there any way to compute a median on a column using Spark's Dataframe.
>> I know you can use stats in a RDD but I'd rather stay within a dataframe.
>> Hive seems to imply that using ntile one can compute percentiles,
>> quartiles and therefore a median.
>> Does anyone have experience with this ?
>>
>> Regards,
>>
>> Olivier.
>>
>
>


Re: Compute Median in Spark Dataframe

2015-06-02 Thread Yana Kadiyska
Like this...sqlContext should be a HiveContext instance

case class KeyValue(key: Int, value: String)
val df=sc.parallelize(1 to 50).map(i=>KeyValue(i, i.toString)).toDF
df.registerTempTable("table")
sqlContext.sql("select percentile(key,0.5) from table").show()

​

On Tue, Jun 2, 2015 at 8:07 AM, Olivier Girardot <
o.girar...@lateral-thoughts.com> wrote:

> Hi everyone,
> Is there any way to compute a median on a column using Spark's Dataframe.
> I know you can use stats in a RDD but I'd rather stay within a dataframe.
> Hive seems to imply that using ntile one can compute percentiles,
> quartiles and therefore a median.
> Does anyone have experience with this ?
>
> Regards,
>
> Olivier.
>