Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
I don't think the Spark configuration is what you want to focus on. It's
hard to say without knowing the specifics of the job or the data volume,
but you should be able to accomplish this with the percent_rank function in
SparkSQL and a smart partitioning of the data. If your data has a lot of
skew, you can end up with a situation in which some executors are waiting
around to do work while others are stuck with processing larger partitions,
so you'll need to take a look at the actual stats of your data and figure
out if there's a more efficient partitioning strategy that you can use.

On Mon, Nov 11, 2019 at 10:34 AM Tzahi File  wrote:

> Currently, I'm using the percentile approx function with Hive.
> I'm looking for a better way to run this function or another way to get
> the same result with spark, but faster and not using gigantic instances..
>
> I'm trying to optimize this job by changing the Spark configuration. If
> you have any ideas how to approach this, it would be great (like instance
> type, number of instances, number of executers etc.)
>
>
> On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy 
> wrote:
>
>> Depending on your tolerance for error you could also use
>> percentile_approx().
>>
>> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov 
>> wrote:
>>
>>> Do you mean that you are trying to compute the percent rank of some
>>> data? You can use the SparkSQL percent_rank function for that, but I don't
>>> think that's going to give you any improvement over calling the percentRank
>>> function on the data frame. Are you currently using a user-defined function
>>> for this task? Because I bet that's what's slowing you down.
>>>
>>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File 
>>> wrote:
>>>
>>>> Hi,
>>>>
>>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>>> percentile function. I'm trying to improve this job by moving it to run
>>>> with spark SQL.
>>>>
>>>> Any suggestions on how to use a percentile function in Spark?
>>>>
>>>>
>>>> Thanks,
>>>> --
>>>> Tzahi File
>>>> Data Engineer
>>>> [image: ironSource] <http://www.ironsrc.com/>
>>>>
>>>> email tzahi.f...@ironsrc.com
>>>> mobile +972-546864835
>>>> fax +972-77-5448273
>>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>>> ironsrc.com <http://www.ironsrc.com/>
>>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>>> <https://plus.google.com/+ironsrc>
>>>> This email (including any attachments) is for the sole use of the
>>>> intended recipient and may contain confidential information which may be
>>>> protected by legal privilege. If you are not the intended recipient, or the
>>>> employee or agent responsible for delivering it to the intended recipient,
>>>> you are hereby notified that any use, dissemination, distribution or
>>>> copying of this communication and/or its content is strictly prohibited. If
>>>> you are not the intended recipient, please immediately notify us by reply
>>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>>
>>>
>>>
>>> --
>>> http://www.google.com/profiles/grapesmoker
>>>
>>
>>
>> --
>>
>>
>> *Patrick McCarthy  *
>>
>> Senior Data Scientist, Machine Learning Engineering
>>
>> Dstillery
>>
>> 470 Park Ave South, 17th Floor, NYC 10016
>>
>
>
> --
> Tzahi File
> Data Engineer
> [image: ironSource] <http://www.ironsrc.com/>
>
> email tzahi.f...@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>


-- 
http://www.google.com/profiles/grapesmoker


Re: Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
Currently, I'm using the percentile approx function with Hive.
I'm looking for a better way to run this function or another way to get the
same result with spark, but faster and not using gigantic instances..

I'm trying to optimize this job by changing the Spark configuration. If you
have any ideas how to approach this, it would be great (like instance type,
number of instances, number of executers etc.)


On Mon, Nov 11, 2019 at 5:16 PM Patrick McCarthy 
wrote:

> Depending on your tolerance for error you could also use
> percentile_approx().
>
> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov 
> wrote:
>
>> Do you mean that you are trying to compute the percent rank of some data?
>> You can use the SparkSQL percent_rank function for that, but I don't think
>> that's going to give you any improvement over calling the percentRank
>> function on the data frame. Are you currently using a user-defined function
>> for this task? Because I bet that's what's slowing you down.
>>
>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File 
>> wrote:
>>
>>> Hi,
>>>
>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>> percentile function. I'm trying to improve this job by moving it to run
>>> with spark SQL.
>>>
>>> Any suggestions on how to use a percentile function in Spark?
>>>
>>>
>>> Thanks,
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] <http://www.ironsrc.com/>
>>>
>>> email tzahi.f...@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com <http://www.ironsrc.com/>
>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>> <https://plus.google.com/+ironsrc>
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


-- 
Tzahi File
Data Engineer
[image: ironSource] <http://www.ironsrc.com/>

email tzahi.f...@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.


Re: Using Percentile in Spark SQL

2019-11-11 Thread Muthu Jayakumar
If you would require higher precision, you may have to write a custom udaf.
In my case, I ended up storing the data as a key-value ordered list of
histograms.

Thanks
Muthu

On Mon, Nov 11, 2019, 20:46 Patrick McCarthy
 wrote:

> Depending on your tolerance for error you could also use
> percentile_approx().
>
> On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov 
> wrote:
>
>> Do you mean that you are trying to compute the percent rank of some data?
>> You can use the SparkSQL percent_rank function for that, but I don't think
>> that's going to give you any improvement over calling the percentRank
>> function on the data frame. Are you currently using a user-defined function
>> for this task? Because I bet that's what's slowing you down.
>>
>> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File 
>> wrote:
>>
>>> Hi,
>>>
>>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>>> percentile function. I'm trying to improve this job by moving it to run
>>> with spark SQL.
>>>
>>> Any suggestions on how to use a percentile function in Spark?
>>>
>>>
>>> Thanks,
>>> --
>>> Tzahi File
>>> Data Engineer
>>> [image: ironSource] <http://www.ironsrc.com/>
>>>
>>> email tzahi.f...@ironsrc.com
>>> mobile +972-546864835
>>> fax +972-77-5448273
>>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>>> ironsrc.com <http://www.ironsrc.com/>
>>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>>> twitter] <https://twitter.com/ironsource>[image: facebook]
>>> <https://www.facebook.com/ironSource>[image: googleplus]
>>> <https://plus.google.com/+ironsrc>
>>> This email (including any attachments) is for the sole use of the
>>> intended recipient and may contain confidential information which may be
>>> protected by legal privilege. If you are not the intended recipient, or the
>>> employee or agent responsible for delivering it to the intended recipient,
>>> you are hereby notified that any use, dissemination, distribution or
>>> copying of this communication and/or its content is strictly prohibited. If
>>> you are not the intended recipient, please immediately notify us by reply
>>> email or by telephone, delete this email and destroy any copies. Thank you.
>>>
>>
>>
>> --
>> http://www.google.com/profiles/grapesmoker
>>
>
>
> --
>
>
> *Patrick McCarthy  *
>
> Senior Data Scientist, Machine Learning Engineering
>
> Dstillery
>
> 470 Park Ave South, 17th Floor, NYC 10016
>


Re: Using Percentile in Spark SQL

2019-11-11 Thread Patrick McCarthy
Depending on your tolerance for error you could also use
percentile_approx().

On Mon, Nov 11, 2019 at 10:14 AM Jerry Vinokurov 
wrote:

> Do you mean that you are trying to compute the percent rank of some data?
> You can use the SparkSQL percent_rank function for that, but I don't think
> that's going to give you any improvement over calling the percentRank
> function on the data frame. Are you currently using a user-defined function
> for this task? Because I bet that's what's slowing you down.
>
> On Mon, Nov 11, 2019 at 9:46 AM Tzahi File  wrote:
>
>> Hi,
>>
>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
>> percentile function. I'm trying to improve this job by moving it to run
>> with spark SQL.
>>
>> Any suggestions on how to use a percentile function in Spark?
>>
>>
>> Thanks,
>> --
>> Tzahi File
>> Data Engineer
>> [image: ironSource] <http://www.ironsrc.com/>
>>
>> email tzahi.f...@ironsrc.com
>> mobile +972-546864835
>> fax +972-77-5448273
>> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
>> ironsrc.com <http://www.ironsrc.com/>
>> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
>> twitter] <https://twitter.com/ironsource>[image: facebook]
>> <https://www.facebook.com/ironSource>[image: googleplus]
>> <https://plus.google.com/+ironsrc>
>> This email (including any attachments) is for the sole use of the
>> intended recipient and may contain confidential information which may be
>> protected by legal privilege. If you are not the intended recipient, or the
>> employee or agent responsible for delivering it to the intended recipient,
>> you are hereby notified that any use, dissemination, distribution or
>> copying of this communication and/or its content is strictly prohibited. If
>> you are not the intended recipient, please immediately notify us by reply
>> email or by telephone, delete this email and destroy any copies. Thank you.
>>
>
>
> --
> http://www.google.com/profiles/grapesmoker
>


-- 


*Patrick McCarthy  *

Senior Data Scientist, Machine Learning Engineering

Dstillery

470 Park Ave South, 17th Floor, NYC 10016


Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
Do you mean that you are trying to compute the percent rank of some data?
You can use the SparkSQL percent_rank function for that, but I don't think
that's going to give you any improvement over calling the percentRank
function on the data frame. Are you currently using a user-defined function
for this task? Because I bet that's what's slowing you down.

On Mon, Nov 11, 2019 at 9:46 AM Tzahi File  wrote:

> Hi,
>
> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
> percentile function. I'm trying to improve this job by moving it to run
> with spark SQL.
>
> Any suggestions on how to use a percentile function in Spark?
>
>
> Thanks,
> --
> Tzahi File
> Data Engineer
> [image: ironSource] <http://www.ironsrc.com/>
>
> email tzahi.f...@ironsrc.com
> mobile +972-546864835
> fax +972-77-5448273
> ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
> ironsrc.com <http://www.ironsrc.com/>
> [image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
> twitter] <https://twitter.com/ironsource>[image: facebook]
> <https://www.facebook.com/ironSource>[image: googleplus]
> <https://plus.google.com/+ironsrc>
> This email (including any attachments) is for the sole use of the intended
> recipient and may contain confidential information which may be protected
> by legal privilege. If you are not the intended recipient, or the employee
> or agent responsible for delivering it to the intended recipient, you are
> hereby notified that any use, dissemination, distribution or copying of
> this communication and/or its content is strictly prohibited. If you are
> not the intended recipient, please immediately notify us by reply email or
> by telephone, delete this email and destroy any copies. Thank you.
>


-- 
http://www.google.com/profiles/grapesmoker


Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
Hi,

Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a
percentile function. I'm trying to improve this job by moving it to run
with spark SQL.

Any suggestions on how to use a percentile function in Spark?


Thanks,
-- 
Tzahi File
Data Engineer
[image: ironSource] <http://www.ironsrc.com/>

email tzahi.f...@ironsrc.com
mobile +972-546864835
fax +972-77-5448273
ironSource HQ - 121 Derech Menachem Begin st. Tel Aviv
ironsrc.com <http://www.ironsrc.com/>
[image: linkedin] <https://www.linkedin.com/company/ironsource>[image:
twitter] <https://twitter.com/ironsource>[image: facebook]
<https://www.facebook.com/ironSource>[image: googleplus]
<https://plus.google.com/+ironsrc>
This email (including any attachments) is for the sole use of the intended
recipient and may contain confidential information which may be protected
by legal privilege. If you are not the intended recipient, or the employee
or agent responsible for delivering it to the intended recipient, you are
hereby notified that any use, dissemination, distribution or copying of
this communication and/or its content is strictly prohibited. If you are
not the intended recipient, please immediately notify us by reply email or
by telephone, delete this email and destroy any copies. Thank you.


Re: Percentile calculation in spark 1.6

2016-02-23 Thread Ted Yu
Please take a look at the following if you can utilize Hive hdf:
sql/hive/src/test/scala/org/apache/spark/sql/hive/execution/HiveUdfSuite.scala

On Tue, Feb 23, 2016 at 6:28 AM, Chandeep Singh  wrote:

> This should help -
> http://stackoverflow.com/questions/28805602/how-to-compute-percentiles-in-apache-spark
>
> On Feb 23, 2016, at 10:08 AM, Arunkumar Pillai 
> wrote:
>
> How to calculate percentile in spark 1.6 ?
>
>
> --
> Thanks and Regards
> Arun
>
>
>


Re: Percentile calculation in spark 1.6

2016-02-23 Thread Chandeep Singh
This should help - 
http://stackoverflow.com/questions/28805602/how-to-compute-percentiles-in-apache-spark
 
<http://stackoverflow.com/questions/28805602/how-to-compute-percentiles-in-apache-spark>
> On Feb 23, 2016, at 10:08 AM, Arunkumar Pillai  
> wrote:
> 
> How to calculate percentile in spark 1.6 ?
> 
> 
> -- 
> Thanks and Regards
> Arun



Percentile calculation in spark 1.6

2016-02-23 Thread Arunkumar Pillai
How to calculate percentile in spark 1.6 ?


-- 
Thanks and Regards
Arun


[Spark 1.5.1] percentile in spark

2016-02-08 Thread Arunkumar Pillai
Hi

I'm using sql query find the percentile value. Is there any pre defined
functions for percentile calculation

-- 
Thanks and Regards
Arun


Re: How to calculate percentile of a column of DataFrame?

2015-10-14 Thread Umesh Kacha
On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>>>>>> wrote:
>>>>>>
>>>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>>>> efforts. Do we need to register Hive UDFs?
>>>>>>>
>>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>>>
>>>>>>> I am calling Hive UDF percentile_approx in the following manner
>>>>>>> which gives compilation error
>>>>>>>
>>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>>> error
>>>>>>>
>>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>>> arguments.
>>>>>>>
>>>>>>> Please guide. Thanks much.
>>>>>>>
>>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>>
>>>>>>>>
>>>>>>>> SQL context available as sqlContext.
>>>>>>>>
>>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>>> "value")
>>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>>
>>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) =>
>>>>>>>> v * v + cnst)
>>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>>>>
>>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>>> lit(25))).show()
>>>>>>>> +---++
>>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>>> +---++
>>>>>>>> |id1|  26|
>>>>>>>> |id2|  41|
>>>>>>>> |id3|  50|
>>>>>>>> +---++
>>>>>>>>
>>>>>>>> Which Spark release are you using ?
>>>>>>>>
>>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>>> code it does not work.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>>
>>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <
>>>>>>>>> umesh.ka...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>>> mich...@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>>
>>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>>
>>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just 
>>>>>>>>>>> saying to
>>>>>>>>>>>

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
it(0.25)));//compile
>>>>>> error
>>>>>>
>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>> arguments.
>>>>>>
>>>>>> Please guide. Thanks much.
>>>>>>
>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>>>>
>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>
>>>>>>>
>>>>>>> SQL context available as sqlContext.
>>>>>>>
>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>> "value")
>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>
>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>>> * v + cnst)
>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>>>
>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>> lit(25))).show()
>>>>>>> +---++
>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>> +---++
>>>>>>> |id1|  26|
>>>>>>> |id2|  41|
>>>>>>> |id3|  50|
>>>>>>> +---++
>>>>>>>
>>>>>>> Which Spark release are you using ?
>>>>>>>
>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>> code it does not work.
>>>>>>>>
>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>
>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha >>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>> mich...@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>
>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>
>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just 
>>>>>>>>>> saying to
>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>
>>>>>>>>>>> Saif
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
gt;
>>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) =>
>>>>>>>> v * v + cnst)
>>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>>>>
>>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>>> lit(25))).show()
>>>>>>>> +---++
>>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>>> +---++
>>>>>>>> |id1|  26|
>>>>>>>> |id2|  41|
>>>>>>>> |id3|  50|
>>>>>>>> +---++
>>>>>>>>
>>>>>>>> Which Spark release are you using ?
>>>>>>>>
>>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>>
>>>>>>>> Cheers
>>>>>>>>
>>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>>> code it does not work.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>>
>>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <
>>>>>>>>> umesh.ka...@gmail.com> wrote:
>>>>>>>>>
>>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>>
>>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>>> mich...@databricks.com> wrote:
>>>>>>>>>>
>>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>>
>>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>>
>>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just 
>>>>>>>>>>> saying to
>>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>>
>>>>>>>>>>>> Saif
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>>>
>>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>>> DataFrame?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>>>> percent
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> public static Column 
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>>  lit(Object literal)
>>>>>>>>>>>>
>>>>>>>>>>>> Creates a Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>>  of
>>>>>>>>>>>> literal value.
>>>>>>>>>>>>
>>>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>>>> Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>>  also.
>>>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>>  is
>>>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> Thanks
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>>>> *To:* unk1102
>>>>>>>>>>>> *Cc:* user
>>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>>> DataFrame?
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>>>> from dataframes.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 
>>>>>>>>>>>> wrote:
>>>>>>>>>>>>
>>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I
>>>>>>>>>>>> cant find any
>>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>>>> e.g. in Hive
>>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>>>
>>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>>>> myTable);
>>>>>>>>>>>>
>>>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>>>> results same as
>>>>>>>>>>>> above query please guide.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> --
>>>>>>>>>>>> View this message in context:
>>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> -
>>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>


Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
t; | id|'simpleUDF(value,25)|
>>>>>>> +---++
>>>>>>> |id1|  26|
>>>>>>> |id2|  41|
>>>>>>> |id3|  50|
>>>>>>> +---++
>>>>>>>
>>>>>>> Which Spark release are you using ?
>>>>>>>
>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>>
>>>>>>> Cheers
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>>>>> wrote:
>>>>>>>
>>>>>>>> I have a doubt Michael I tried to use callUDF in  the following
>>>>>>>> code it does not work.
>>>>>>>>
>>>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>>>
>>>>>>>> Above code does not compile because callUdf() takes only two
>>>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>>>
>>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha >>>>>>> > wrote:
>>>>>>>>
>>>>>>>>> thanks much Michael let me try.
>>>>>>>>>
>>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>>>>> mich...@databricks.com> wrote:
>>>>>>>>>
>>>>>>>>>> This is confusing because I made a typo...
>>>>>>>>>>
>>>>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>>>>
>>>>>>>>>> The first argument is the name of the UDF, all other arguments
>>>>>>>>>> need to be columns that are passed in as arguments.  lit is just 
>>>>>>>>>> saying to
>>>>>>>>>> make a literal column that always has the value 0.25.
>>>>>>>>>>
>>>>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, 
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>> Yes but I mean, this is rather curious. How is def
>>>>>>>>>>> lit(literal:Any) --> becomes a percentile function lit(25)
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks for clarification
>>>>>>>>>>>
>>>>>>>>>>> Saif
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>>>>>> *To:* Ellafi, Saif A.
>>>>>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>>>>>
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> I found it in 1.3 documentation lit says something else not
>>>>>>>>>>> percent
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> public static Column 
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>  lit(Object literal)
>>>>>>>>>>>
>>>>>>>>>>> Creates a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>  of
>>>>>>>>>>> literal value.
>>>>>>>>>>>
>>>>>>>>>>> The passed in object is returned directly if it is already a
>>>>>>>>>>> Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>  also.
>>>>>>>>>>> Otherwise, a new Column
>>>>>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>>>>>  is
>>>>>>>>>>> created to represent the literal value.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Where can we find other available functions such as lit() ? I
>>>>>>>>>>> can’t find lit in the api.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> Thanks
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>>>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>>>>>> *To:* unk1102
>>>>>>>>>>> *Cc:* user
>>>>>>>>>>> *Subject:* Re: How to calculate percentile of a column of
>>>>>>>>>>> DataFrame?
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs
>>>>>>>>>>> from dataframes.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 
>>>>>>>>>>> wrote:
>>>>>>>>>>>
>>>>>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I
>>>>>>>>>>> cant find any
>>>>>>>>>>> percentile_approx function in Spark aggregation functions. For
>>>>>>>>>>> e.g. in Hive
>>>>>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>>>>>
>>>>>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>>>>>> myTable);
>>>>>>>>>>>
>>>>>>>>>>> I can see ntile function but not sure how it is gonna give
>>>>>>>>>>> results same as
>>>>>>>>>>> above query please guide.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> --
>>>>>>>>>>> View this message in context:
>>>>>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>>>>>> Nabble.com <http://nabble.com>.
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>> -
>>>>>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>
>>>>>>
>>>>>
>>>>


Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted I am using the following line of code I can't paste entire code
sorry but the following only line doesn't compile in my spark job

 sourceframe.select(callUDF("percentile_approx",col("mycol"), lit(0.25)))

I am using Intellij editor java and maven dependencies of spark core spark
sql spark hive version 1.5.1
On Oct 13, 2015 18:21, "Ted Yu"  wrote:

> Can you pastebin your Java code and the command you used to compile ?
>
> Thanks
>
> On Oct 13, 2015, at 1:42 AM, Umesh Kacha  wrote:
>
> Hi Ted if fix went after 1.5.1 release then how come it's working with
> 1.5.1 binary in spark-shell.
> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>
>> Looks like the fix went in after 1.5.1 was released.
>>
>> You may verify using master branch build.
>>
>> Cheers
>>
>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>
>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>> 1.5.1 maven libraries it still complains same that callUdf can have string
>> and column types only. Please guide.
>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>> lit(0.25))).show()
>>> +--+
>>> |'percentile_approx(value,0.25)|
>>> +--+
>>> |   1.0|
>>> +--+
>>>
>>> Can you upgrade to 1.5.1 ?
>>>
>>> Cheers
>>>
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
>>> wrote:
>>>
>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>> available in Spark 1.4.0 as per JAvadocx
>>>>
>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>>>> wrote:
>>>>
>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>> efforts. Do we need to register Hive UDFs?
>>>>>
>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>
>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>> gives compilation error
>>>>>
>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>> error
>>>>>
>>>>> //compile error because callUdf() takes String and Column* as
>>>>> arguments.
>>>>>
>>>>> Please guide. Thanks much.
>>>>>
>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>>>
>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>
>>>>>>
>>>>>> SQL context available as sqlContext.
>>>>>>
>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>> "value")
>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>
>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>> * v + cnst)
>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>>
>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>> lit(25))).show()
>>>>>> +---++
>>>>>> | id|'simpleUDF(value,25)|
>>>>>> +---++
>>>>>> |id1|  26|
>>>>>> |id2|  41|
>>>>>> |id3|  50|
>>>>>> +---++
>>>>>>
>>>>>> Which Spark release are you using ?
>>>>>>
>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>
>>>>>> 

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Can you pastebin your Java code and the command you used to compile ?

Thanks

> On Oct 13, 2015, at 1:42 AM, Umesh Kacha  wrote:
> 
> Hi Ted if fix went after 1.5.1 release then how come it's working with 1.5.1 
> binary in spark-shell.
> 
>> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>> Looks like the fix went in after 1.5.1 was released. 
>> 
>> You may verify using master branch build. 
>> 
>> Cheers
>> 
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>> 
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you 
>>> mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 
>>> maven libraries it still complains same that callUdf can have string and 
>>> column types only. Please guide.
>>> 
>>>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>>> SQL context available as sqlContext.
>>>> 
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>> 
>>>> scala> df.select(callUDF("percentile_approx",col("value"), 
>>>> lit(0.25))).show()
>>>> +--+
>>>> |'percentile_approx(value,0.25)|
>>>> +--+
>>>> |   1.0|
>>>> +--+
>>>> 
>>>> Can you upgrade to 1.5.1 ?
>>>> 
>>>> Cheers
>>>> 
>>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha  
>>>>> wrote:
>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available 
>>>>> in Spark 1.4.0 as per JAvadocx
>>>>> 
>>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha  
>>>>>> wrote:
>>>>>> Hi Ted thanks much for the detailed answer and appreciate your efforts. 
>>>>>> Do we need to register Hive UDFs?
>>>>>> 
>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>> 
>>>>>> I am calling Hive UDF percentile_approx in the following manner which 
>>>>>> gives compilation error
>>>>>> 
>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>>  error
>>>>>> 
>>>>>> //compile error because callUdf() takes String and Column* as arguments.
>>>>>> 
>>>>>> Please guide. Thanks much.
>>>>>> 
>>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>> 
>>>>>>> 
>>>>>>> SQL context available as sqlContext.
>>>>>>> 
>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", 
>>>>>>> "value")
>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>> 
>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * 
>>>>>>> v + cnst)
>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction = 
>>>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>>> 
>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>>>>> +---++
>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>> +---++
>>>>>>> |id1|  26|
>>>>>>> |id2|  41|
>>>>>>> |id3|  50|
>>>>>>> +---++
>>>>>>> 
>>>>>>> Which Spark release are you using ?
>>>>>>> 
>>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>> 
>>>>>>> Cheers
>>>>>>> 
>>>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha  
>>>>>>>

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
OK thanks much Ted looks like some issue while using maven dependencies in
Java code for 1.5.1. I am still not able to understand if spark 1.5.1
binary in spark-shell can recognize callUdf then why not callUdf not
getting compiled while using maven build.
On Oct 13, 2015 2:20 PM, "Ted Yu"  wrote:

> Pardon me.
> I didn't read your previous response clearly.
>
> I will try to reproduce the compilation error on master branch.
> Right now, I have some other high priority task on hand.
>
> BTW I was looking at SPARK-10671
>
> FYI
>
> On Tue, Oct 13, 2015 at 1:42 AM, Umesh Kacha 
> wrote:
>
>> Hi Ted if fix went after 1.5.1 release then how come it's working with
>> 1.5.1 binary in spark-shell.
>> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>>
>>> Looks like the fix went in after 1.5.1 was released.
>>>
>>> You may verify using master branch build.
>>>
>>> Cheers
>>>
>>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>>
>>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>>> 1.5.1 maven libraries it still complains same that callUdf can have string
>>> and column types only. Please guide.
>>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>>> lit(0.25))).show()
>>>> +--+
>>>> |'percentile_approx(value,0.25)|
>>>> +--+
>>>> |   1.0|
>>>> +--+
>>>>
>>>> Can you upgrade to 1.5.1 ?
>>>>
>>>> Cheers
>>>>
>>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
>>>> wrote:
>>>>
>>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>>> available in Spark 1.4.0 as per JAvadocx
>>>>>
>>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>>>>> wrote:
>>>>>
>>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>>> efforts. Do we need to register Hive UDFs?
>>>>>>
>>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>>
>>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>>> gives compilation error
>>>>>>
>>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>>> error
>>>>>>
>>>>>> //compile error because callUdf() takes String and Column* as
>>>>>> arguments.
>>>>>>
>>>>>> Please guide. Thanks much.
>>>>>>
>>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>>>>
>>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>>
>>>>>>>
>>>>>>> SQL context available as sqlContext.
>>>>>>>
>>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>>> "value")
>>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>>
>>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>>> * v + cnst)
>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>>>
>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>>> lit(25))).show()
>>>>>>> +---++
>>>>>>> | id|'simpleUDF(value,25)|
>>>>>>> +---++
>>>>>>> |id1| 

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Pardon me.
I didn't read your previous response clearly.

I will try to reproduce the compilation error on master branch.
Right now, I have some other high priority task on hand.

BTW I was looking at SPARK-10671

FYI

On Tue, Oct 13, 2015 at 1:42 AM, Umesh Kacha  wrote:

> Hi Ted if fix went after 1.5.1 release then how come it's working with
> 1.5.1 binary in spark-shell.
> On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:
>
>> Looks like the fix went in after 1.5.1 was released.
>>
>> You may verify using master branch build.
>>
>> Cheers
>>
>> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>>
>> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
>> you mentioned it works using 1.5.1 but it doesn't compile in Java using
>> 1.5.1 maven libraries it still complains same that callUdf can have string
>> and column types only. Please guide.
>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> df.select(callUDF("percentile_approx",col("value"),
>>> lit(0.25))).show()
>>> +--+
>>> |'percentile_approx(value,0.25)|
>>> +--+
>>> |   1.0|
>>> +--+
>>>
>>> Can you upgrade to 1.5.1 ?
>>>
>>> Cheers
>>>
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
>>> wrote:
>>>
>>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is
>>>> available in Spark 1.4.0 as per JAvadocx
>>>>
>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>>>> wrote:
>>>>
>>>>> Hi Ted thanks much for the detailed answer and appreciate your
>>>>> efforts. Do we need to register Hive UDFs?
>>>>>
>>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>>
>>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>>> gives compilation error
>>>>>
>>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>> error
>>>>>
>>>>> //compile error because callUdf() takes String and Column* as
>>>>> arguments.
>>>>>
>>>>> Please guide. Thanks much.
>>>>>
>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>>>
>>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>>
>>>>>>
>>>>>> SQL context available as sqlContext.
>>>>>>
>>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>>> "value")
>>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>>
>>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v
>>>>>> * v + cnst)
>>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>>
>>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value",
>>>>>> lit(25))).show()
>>>>>> +---++
>>>>>> | id|'simpleUDF(value,25)|
>>>>>> +---++
>>>>>> |id1|  26|
>>>>>> |id2|  41|
>>>>>> |id3|  50|
>>>>>> +---++
>>>>>>
>>>>>> Which Spark release are you using ?
>>>>>>
>>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>>
>>>>>> Cheers
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>>>> wrote:
>>>>>>
>>>>>>> I have a doubt Michael I tried to use callUDF in  the following c

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted if fix went after 1.5.1 release then how come it's working with
1.5.1 binary in spark-shell.
On Oct 13, 2015 1:32 PM, "Ted Yu"  wrote:

> Looks like the fix went in after 1.5.1 was released.
>
> You may verify using master branch build.
>
> Cheers
>
> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
>
> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like
> you mentioned it works using 1.5.1 but it doesn't compile in Java using
> 1.5.1 maven libraries it still complains same that callUdf can have string
> and column types only. Please guide.
> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> df.select(callUDF("percentile_approx",col("value"),
>> lit(0.25))).show()
>> +--+
>> |'percentile_approx(value,0.25)|
>> +--+
>> |   1.0|
>> +--+
>>
>> Can you upgrade to 1.5.1 ?
>>
>> Cheers
>>
>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
>> wrote:
>>
>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>>> in Spark 1.4.0 as per JAvadocx
>>>
>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>>> wrote:
>>>
>>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>>> Do we need to register Hive UDFs?
>>>>
>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>
>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>> gives compilation error
>>>>
>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>> error
>>>>
>>>> //compile error because callUdf() takes String and Column* as arguments.
>>>>
>>>> Please guide. Thanks much.
>>>>
>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>>
>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>
>>>>>
>>>>> SQL context available as sqlContext.
>>>>>
>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>> "value")
>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>
>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>>> v + cnst)
>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>
>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>>> +---++
>>>>> | id|'simpleUDF(value,25)|
>>>>> +---++
>>>>> |id1|  26|
>>>>> |id2|  41|
>>>>> |id3|  50|
>>>>> +---++
>>>>>
>>>>> Which Spark release are you using ?
>>>>>
>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>>> wrote:
>>>>>
>>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>>> it does not work.
>>>>>>
>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>
>>>>>> Above code does not compile because callUdf() takes only two
>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>>>>> wrote:
>>>>>>
>>>>>>> thanks much Michael let me try.
>>>>>>>
>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>&

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
Looks like the fix went in after 1.5.1 was released. 

You may verify using master branch build. 

Cheers

> On Oct 13, 2015, at 12:21 AM, Umesh Kacha  wrote:
> 
> Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you 
> mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1 
> maven libraries it still complains same that callUdf can have string and 
> column types only. Please guide.
> 
>> On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:
>> SQL context available as sqlContext.
>> 
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>> 
>> scala> df.select(callUDF("percentile_approx",col("value"), lit(0.25))).show()
>> +--+
>> |'percentile_approx(value,0.25)|
>> +--+
>> |   1.0|
>> +--+
>> 
>> Can you upgrade to 1.5.1 ?
>> 
>> Cheers
>> 
>>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha  wrote:
>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in 
>>> Spark 1.4.0 as per JAvadocx
>>> 
>>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha  
>>>> wrote:
>>>> Hi Ted thanks much for the detailed answer and appreciate your efforts. Do 
>>>> we need to register Hive UDFs?
>>>> 
>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>> 
>>>> I am calling Hive UDF percentile_approx in the following manner which 
>>>> gives compilation error
>>>> 
>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>>  error
>>>> 
>>>> //compile error because callUdf() takes String and Column* as arguments.
>>>> 
>>>> Please guide. Thanks much.
>>>> 
>>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>> 
>>>>> 
>>>>> SQL context available as sqlContext.
>>>>> 
>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", 
>>>>> "value")
>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>> 
>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v 
>>>>> + cnst)
>>>>> res0: org.apache.spark.sql.UserDefinedFunction = 
>>>>> UserDefinedFunction(,IntegerType,List())
>>>>> 
>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>>> +---++
>>>>> | id|'simpleUDF(value,25)|
>>>>> +---++
>>>>> |id1|  26|
>>>>> |id2|  41|
>>>>> |id3|  50|
>>>>> +---++
>>>>> 
>>>>> Which Spark release are you using ?
>>>>> 
>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>> 
>>>>> Cheers
>>>>> 
>>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha  
>>>>>> wrote:
>>>>>> I have a doubt Michael I tried to use callUDF in  the following code it 
>>>>>> does not work. 
>>>>>> 
>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>> 
>>>>>> Above code does not compile because callUdf() takes only two arguments 
>>>>>> function name in String and Column class type. Please guide.
>>>>>> 
>>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha  
>>>>>>> wrote:
>>>>>>> thanks much Michael let me try. 
>>>>>>> 
>>>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust 
>>>>>>>>  wrote:
>>>>>>>> This is confusing because I made a typo...
>>>>>>>> 
>>>>>>>> callUDF("percentile_approx", co

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
Hi Ted, thanks much I tried using percentile_approx in Spark-shell like you
mentioned it works using 1.5.1 but it doesn't compile in Java using 1.5.1
maven libraries it still complains same that callUdf can have string and
column types only. Please guide.
On Oct 13, 2015 12:34 AM, "Ted Yu"  wrote:

> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> df.select(callUDF("percentile_approx",col("value"),
> lit(0.25))).show()
> +--+
> |'percentile_approx(value,0.25)|
> +--+
> |   1.0|
> +--+
>
> Can you upgrade to 1.5.1 ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
> wrote:
>
>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>> in Spark 1.4.0 as per JAvadocx
>>
>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>> wrote:
>>
>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>> Do we need to register Hive UDFs?
>>>
>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>
>>> I am calling Hive UDF percentile_approx in the following manner which
>>> gives compilation error
>>>
>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>> error
>>>
>>> //compile error because callUdf() takes String and Column* as arguments.
>>>
>>> Please guide. Thanks much.
>>>
>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>
>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>
>>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>> v + cnst)
>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>> UserDefinedFunction(,IntegerType,List())
>>>>
>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>> +---++
>>>> | id|'simpleUDF(value,25)|
>>>> +---++
>>>> |id1|  26|
>>>> |id2|  41|
>>>> |id3|  50|
>>>> +---++
>>>>
>>>> Which Spark release are you using ?
>>>>
>>>> Can you pastebin the full stack trace where you got the error ?
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>> wrote:
>>>>
>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>> it does not work.
>>>>>
>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>
>>>>> Above code does not compile because callUdf() takes only two arguments
>>>>> function name in String and Column class type. Please guide.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>>>> wrote:
>>>>>
>>>>>> thanks much Michael let me try.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>>> This is confusing because I made a typo...
>>>>>>>
>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>
>>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>>> to be columns that are passed in as arguments.  lit is just saying to 
>>>>>>> make
>>>>>>> a literal column that always has the value 0.25.
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, 
>>>>>>> wrote:
>

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
I would suggest using http://search-hadoop.com/ to find literature on the empty
partitions directory problem.

If there is no answer there, please start a new thread with the following
information:

release of Spark
release of hadoop
code snippet
symptom

Cheers

On Mon, Oct 12, 2015 at 12:08 PM, Umesh Kacha  wrote:

> Hi Ted thanks much are you saying above code will work in only 1.5.1? I
> tried upgrading to 1.5.1 but I have found potential bug my Spark job
> creates hive partitions using hiveContext.sql("insert into partitions")
> when I use Spark 1.5.1 I cant see any partitions files orc files getting
> created in HDFS I can see empty partitions directory under Hive table along
> with many staging files created by spark.
>
> On Tue, Oct 13, 2015 at 12:34 AM, Ted Yu  wrote:
>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> df.select(callUDF("percentile_approx",col("value"),
>> lit(0.25))).show()
>> +--+
>> |'percentile_approx(value,0.25)|
>> +--+
>> |   1.0|
>> +--+
>>
>> Can you upgrade to 1.5.1 ?
>>
>> Cheers
>>
>> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
>> wrote:
>>
>>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>>> in Spark 1.4.0 as per JAvadocx
>>>
>>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>>> wrote:
>>>
>>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>>> Do we need to register Hive UDFs?
>>>>
>>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>>
>>>> I am calling Hive UDF percentile_approx in the following manner which
>>>> gives compilation error
>>>>
>>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>>> error
>>>>
>>>> //compile error because callUdf() takes String and Column* as arguments.
>>>>
>>>> Please guide. Thanks much.
>>>>
>>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>>
>>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>>
>>>>>
>>>>> SQL context available as sqlContext.
>>>>>
>>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>>> "value")
>>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>>
>>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>>> v + cnst)
>>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>>> UserDefinedFunction(,IntegerType,List())
>>>>>
>>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>>> +---++
>>>>> | id|'simpleUDF(value,25)|
>>>>> +---++
>>>>> |id1|  26|
>>>>> |id2|  41|
>>>>> |id3|  50|
>>>>> +---++
>>>>>
>>>>> Which Spark release are you using ?
>>>>>
>>>>> Can you pastebin the full stack trace where you got the error ?
>>>>>
>>>>> Cheers
>>>>>
>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>>> wrote:
>>>>>
>>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>>> it does not work.
>>>>>>
>>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>>
>>>>>> Above code does not compile because callUdf() takes only two
>>>>>> arguments function name in String and Column class type. Please guide.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>>>>> wrote:
>>>>>>
>>>>>>> thanks much Michael l

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi Ted thanks much are you saying above code will work in only 1.5.1? I
tried upgrading to 1.5.1 but I have found potential bug my Spark job
creates hive partitions using hiveContext.sql("insert into partitions")
when I use Spark 1.5.1 I cant see any partitions files orc files getting
created in HDFS I can see empty partitions directory under Hive table along
with many staging files created by spark.

On Tue, Oct 13, 2015 at 12:34 AM, Ted Yu  wrote:

> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> df.select(callUDF("percentile_approx",col("value"),
> lit(0.25))).show()
> +--+
> |'percentile_approx(value,0.25)|
> +--+
> |   1.0|
> +--+
>
> Can you upgrade to 1.5.1 ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha 
> wrote:
>
>> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
>> in Spark 1.4.0 as per JAvadocx
>>
>> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
>> wrote:
>>
>>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>>> Do we need to register Hive UDFs?
>>>
>>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>>
>>> I am calling Hive UDF percentile_approx in the following manner which
>>> gives compilation error
>>>
>>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>>> error
>>>
>>> //compile error because callUdf() takes String and Column* as arguments.
>>>
>>> Please guide. Thanks much.
>>>
>>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>>
>>>> Using spark-shell, I did the following exercise (master branch) :
>>>>
>>>>
>>>> SQL context available as sqlContext.
>>>>
>>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>>> "value")
>>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>>
>>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v *
>>>> v + cnst)
>>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>>> UserDefinedFunction(,IntegerType,List())
>>>>
>>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>>> +---++
>>>> | id|'simpleUDF(value,25)|
>>>> +---++
>>>> |id1|  26|
>>>> |id2|  41|
>>>> |id3|  50|
>>>> +---++
>>>>
>>>> Which Spark release are you using ?
>>>>
>>>> Can you pastebin the full stack trace where you got the error ?
>>>>
>>>> Cheers
>>>>
>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>>> wrote:
>>>>
>>>>> I have a doubt Michael I tried to use callUDF in  the following code
>>>>> it does not work.
>>>>>
>>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>>
>>>>> Above code does not compile because callUdf() takes only two arguments
>>>>> function name in String and Column class type. Please guide.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>>>> wrote:
>>>>>
>>>>>> thanks much Michael let me try.
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>>> mich...@databricks.com> wrote:
>>>>>>
>>>>>>> This is confusing because I made a typo...
>>>>>>>
>>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>>
>>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>>> to be columns that are passed in as arguments.  lit is just saying to 
>>>>>>> make
>>>>>>> a literal column that always h

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
SQL context available as sqlContext.

scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: string, value: int]

scala> df.select(callUDF("percentile_approx",col("value"),
lit(0.25))).show()
+--+
|'percentile_approx(value,0.25)|
+--+
|   1.0|
+--+

Can you upgrade to 1.5.1 ?

Cheers

On Mon, Oct 12, 2015 at 11:55 AM, Umesh Kacha  wrote:

> Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available
> in Spark 1.4.0 as per JAvadocx
>
> On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha 
> wrote:
>
>> Hi Ted thanks much for the detailed answer and appreciate your efforts.
>> Do we need to register Hive UDFs?
>>
>> sqlContext.udf.register("percentile_approx");???//is it valid?
>>
>> I am calling Hive UDF percentile_approx in the following manner which
>> gives compilation error
>>
>> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
>> error
>>
>> //compile error because callUdf() takes String and Column* as arguments.
>>
>> Please guide. Thanks much.
>>
>> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>>
>>> Using spark-shell, I did the following exercise (master branch) :
>>>
>>>
>>> SQL context available as sqlContext.
>>>
>>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>>> "value")
>>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>>
>>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v
>>> + cnst)
>>> res0: org.apache.spark.sql.UserDefinedFunction =
>>> UserDefinedFunction(,IntegerType,List())
>>>
>>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>>> +---++
>>> | id|'simpleUDF(value,25)|
>>> +---++
>>> |id1|  26|
>>> |id2|  41|
>>> |id3|  50|
>>> +---++
>>>
>>> Which Spark release are you using ?
>>>
>>> Can you pastebin the full stack trace where you got the error ?
>>>
>>> Cheers
>>>
>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>>> wrote:
>>>
>>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>>> does not work.
>>>>
>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>
>>>> Above code does not compile because callUdf() takes only two arguments
>>>> function name in String and Column class type. Please guide.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>>> wrote:
>>>>
>>>>> thanks much Michael let me try.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>>>>>
>>>>>> This is confusing because I made a typo...
>>>>>>
>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>
>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>> to be columns that are passed in as arguments.  lit is just saying to 
>>>>>> make
>>>>>> a literal column that always has the value 0.25.
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, 
>>>>>> wrote:
>>>>>>
>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>>> --> becomes a percentile function lit(25)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for clarification
>>>>>>>
>>>>>>> Saif
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>> *To:* Ellafi, Saif A.
>>>>>>> *Cc:* Michael Armb

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi Ted thanks much for the detailed answer and appreciate your efforts. Do
we need to register Hive UDFs?

sqlContext.udf.register("percentile_approx");???//is it valid?

I am calling Hive UDF percentile_approx in the following manner which gives
compilation error

df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
error

//compile error because callUdf() takes String and Column* as arguments.

Please guide. Thanks much.

On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:

> Using spark-shell, I did the following exercise (master branch) :
>
>
> SQL context available as sqlContext.
>
> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>
> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v +
> cnst)
> res0: org.apache.spark.sql.UserDefinedFunction =
> UserDefinedFunction(,IntegerType,List())
>
> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
> +---++
> | id|'simpleUDF(value,25)|
> +---++
> |id1|  26|
> |id2|  41|
> |id3|  50|
> +---++
>
> Which Spark release are you using ?
>
> Can you pastebin the full stack trace where you got the error ?
>
> Cheers
>
> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha  wrote:
>
>> I have a doubt Michael I tried to use callUDF in  the following code it
>> does not work.
>>
>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>
>> Above code does not compile because callUdf() takes only two arguments
>> function name in String and Column class type. Please guide.
>>
>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>> wrote:
>>
>>> thanks much Michael let me try.
>>>
>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> This is confusing because I made a typo...
>>>>
>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>
>>>> The first argument is the name of the UDF, all other arguments need to
>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>> literal column that always has the value 0.25.
>>>>
>>>> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>>>>
>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>> --> becomes a percentile function lit(25)
>>>>>
>>>>>
>>>>>
>>>>> Thanks for clarification
>>>>>
>>>>> Saif
>>>>>
>>>>>
>>>>>
>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>> *To:* Ellafi, Saif A.
>>>>> *Cc:* Michael Armbrust; user
>>>>>
>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>
>>>>>
>>>>>
>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>
>>>>>
>>>>>
>>>>> public static Column 
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>  lit(Object literal)
>>>>>
>>>>> Creates a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>  of
>>>>> literal value.
>>>>>
>>>>> The passed in object is returned directly if it is already a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>  also.
>>>>> Otherwise, a new Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>  is
>>>>> created to represent the literal value.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Oct 10, 2015 at 12:39 AM, 
>>>>> w

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Sorry forgot to tell that I am using Spark 1.4.1 as callUdf is available in
Spark 1.4.0 as per JAvadocx

On Tue, Oct 13, 2015 at 12:22 AM, Umesh Kacha  wrote:

> Hi Ted thanks much for the detailed answer and appreciate your efforts. Do
> we need to register Hive UDFs?
>
> sqlContext.udf.register("percentile_approx");???//is it valid?
>
> I am calling Hive UDF percentile_approx in the following manner which
> gives compilation error
>
> df.select("col1").groupby("col1").agg(callUdf("percentile_approx",col("col1"),lit(0.25)));//compile
> error
>
> //compile error because callUdf() takes String and Column* as arguments.
>
> Please guide. Thanks much.
>
> On Mon, Oct 12, 2015 at 11:44 PM, Ted Yu  wrote:
>
>> Using spark-shell, I did the following exercise (master branch) :
>>
>>
>> SQL context available as sqlContext.
>>
>> scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id",
>> "value")
>> df: org.apache.spark.sql.DataFrame = [id: string, value: int]
>>
>> scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v
>> + cnst)
>> res0: org.apache.spark.sql.UserDefinedFunction =
>> UserDefinedFunction(,IntegerType,List())
>>
>> scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
>> +---++
>> | id|'simpleUDF(value,25)|
>> +---++
>> |id1|  26|
>> |id2|  41|
>> |id3|  50|
>> +---++
>>
>> Which Spark release are you using ?
>>
>> Can you pastebin the full stack trace where you got the error ?
>>
>> Cheers
>>
>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha 
>> wrote:
>>
>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>> does not work.
>>>
>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>
>>> Above code does not compile because callUdf() takes only two arguments
>>> function name in String and Column class type. Please guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>> wrote:
>>>
>>>> thanks much Michael let me try.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> This is confusing because I made a typo...
>>>>>
>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>
>>>>> The first argument is the name of the UDF, all other arguments need to
>>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>>> literal column that always has the value 0.25.
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>>>>>
>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>> --> becomes a percentile function lit(25)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for clarification
>>>>>>
>>>>>> Saif
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>> *To:* Ellafi, Saif A.
>>>>>> *Cc:* Michael Armbrust; user
>>>>>>
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>
>>>>>>
>>>>>>
>>>>>> public static Column 
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  lit(Object literal)
>>>>>>
>>>>>> Creates a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  of
>>>>>> literal value.
>>>>>>
>>>>>> The passed in object is returned directly if it is already a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
Using spark-shell, I did the following exercise (master branch) :


SQL context available as sqlContext.

scala> val df = Seq(("id1", 1), ("id2", 4), ("id3", 5)).toDF("id", "value")
df: org.apache.spark.sql.DataFrame = [id: string, value: int]

scala> sqlContext.udf.register("simpleUDF", (v: Int, cnst: Int) => v * v +
cnst)
res0: org.apache.spark.sql.UserDefinedFunction =
UserDefinedFunction(,IntegerType,List())

scala> df.select($"id", callUDF("simpleUDF", $"value", lit(25))).show()
+---++
| id|'simpleUDF(value,25)|
+---++
|id1|  26|
|id2|  41|
|id3|  50|
+---++

Which Spark release are you using ?

Can you pastebin the full stack trace where you got the error ?

Cheers

On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha  wrote:

> I have a doubt Michael I tried to use callUDF in  the following code it
> does not work.
>
> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>
> Above code does not compile because callUdf() takes only two arguments
> function name in String and Column class type. Please guide.
>
> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
> wrote:
>
>> thanks much Michael let me try.
>>
>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust > > wrote:
>>
>>> This is confusing because I made a typo...
>>>
>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>
>>> The first argument is the name of the UDF, all other arguments need to
>>> be columns that are passed in as arguments.  lit is just saying to make a
>>> literal column that always has the value 0.25.
>>>
>>> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>>>
>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
>>>> becomes a percentile function lit(25)
>>>>
>>>>
>>>>
>>>> Thanks for clarification
>>>>
>>>> Saif
>>>>
>>>>
>>>>
>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>> *To:* Ellafi, Saif A.
>>>> *Cc:* Michael Armbrust; user
>>>>
>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>
>>>>
>>>>
>>>> I found it in 1.3 documentation lit says something else not percent
>>>>
>>>>
>>>>
>>>> public static Column 
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>  lit(Object literal)
>>>>
>>>> Creates a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>  of
>>>> literal value.
>>>>
>>>> The passed in object is returned directly if it is already a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>> If the object is a Scala Symbol, it is converted into a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>  also.
>>>> Otherwise, a new Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>  is
>>>> created to represent the literal value.
>>>>
>>>>
>>>>
>>>> On Sat, Oct 10, 2015 at 12:39 AM,  wrote:
>>>>
>>>> Where can we find other available functions such as lit() ? I can’t
>>>> find lit in the api.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>> *To:* unk1102
>>>> *Cc:* user
>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>
>>>>
>>>>
>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>> dataframes.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102  wrote:
>>>>
>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>>> any
>>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>>> Hive
>>>> we have percentile_approx and we can use it in the following way
>>>>
>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>
>>>> I can see ntile function but not sure how it is gonna give results same
>>>> as
>>>> above query please guide.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>


Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi Ted thanks if I dont pass lit function then how can I tell
percentile_approx function to give me 25% or 50% like we do in Hive
percentile_approx(mycol,0.25).

Regards

On Mon, Oct 12, 2015 at 7:20 PM, Ted Yu  wrote:

> Umesh:
> Have you tried calling callUdf without the lit() parameter ?
>
> Cheers
>
> On Mon, Oct 12, 2015 at 6:27 AM, Umesh Kacha 
> wrote:
>
>> Hi if you can help it would be great as I am stuck don't know how to
>> remove compilation error in callUdf when we pass three parameters function
>> name string column name as col and lit function please guide
>> On Oct 11, 2015 1:05 AM, "Umesh Kacha"  wrote:
>>
>>> Hi any idea? how do I call percentlie_approx using callUdf() please
>>> guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha 
>>> wrote:
>>>
>>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>>> does not work.
>>>>
>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>>
>>>> Above code does not compile because callUdf() takes only two arguments
>>>> function name in String and Column class type. Please guide.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>>> wrote:
>>>>
>>>>> thanks much Michael let me try.
>>>>>
>>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>>> mich...@databricks.com> wrote:
>>>>>
>>>>>> This is confusing because I made a typo...
>>>>>>
>>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>>
>>>>>> The first argument is the name of the UDF, all other arguments need
>>>>>> to be columns that are passed in as arguments.  lit is just saying to 
>>>>>> make
>>>>>> a literal column that always has the value 0.25.
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:16 PM, 
>>>>>> wrote:
>>>>>>
>>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>>> --> becomes a percentile function lit(25)
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks for clarification
>>>>>>>
>>>>>>> Saif
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>>> *To:* Ellafi, Saif A.
>>>>>>> *Cc:* Michael Armbrust; user
>>>>>>>
>>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> public static Column 
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>  lit(Object literal)
>>>>>>>
>>>>>>> Creates a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>  of
>>>>>>> literal value.
>>>>>>>
>>>>>>> The passed in object is returned directly if it is already a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>  also.
>>>>>>> Otherwise, a new Column
>>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>>  is
>>>>>>> created to represent the literal value.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>>> find lit in the api.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> Thanks
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>>> *To:* unk1102
>>>>>>> *Cc:* user
>>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>>>> dataframes.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 
>>>>>>> wrote:
>>>>>>>
>>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>>> find any
>>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>>> in Hive
>>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>>
>>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from
>>>>>>> myTable);
>>>>>>>
>>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>>> same as
>>>>>>> above query please guide.
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>> --
>>>>>>> View this message in context:
>>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>>> Nabble.com.
>>>>>>>
>>>>>>> -
>>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>


Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
Umesh:
Have you tried calling callUdf without the lit() parameter ?

Cheers

On Mon, Oct 12, 2015 at 6:27 AM, Umesh Kacha  wrote:

> Hi if you can help it would be great as I am stuck don't know how to
> remove compilation error in callUdf when we pass three parameters function
> name string column name as col and lit function please guide
> On Oct 11, 2015 1:05 AM, "Umesh Kacha"  wrote:
>
>> Hi any idea? how do I call percentlie_approx using callUdf() please guide.
>>
>> On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha 
>> wrote:
>>
>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>> does not work.
>>>
>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>
>>> Above code does not compile because callUdf() takes only two arguments
>>> function name in String and Column class type. Please guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>> wrote:
>>>
>>>> thanks much Michael let me try.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> This is confusing because I made a typo...
>>>>>
>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>
>>>>> The first argument is the name of the UDF, all other arguments need to
>>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>>> literal column that always has the value 0.25.
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>>>>>
>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>> --> becomes a percentile function lit(25)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for clarification
>>>>>>
>>>>>> Saif
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>> *To:* Ellafi, Saif A.
>>>>>> *Cc:* Michael Armbrust; user
>>>>>>
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>
>>>>>>
>>>>>>
>>>>>> public static Column 
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  lit(Object literal)
>>>>>>
>>>>>> Creates a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  of
>>>>>> literal value.
>>>>>>
>>>>>> The passed in object is returned directly if it is already a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  also.
>>>>>> Otherwise, a new Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  is
>>>>>> created to represent the literal value.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, 
>>>>>> wrote:
>>>>>>
>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>> find lit in the api.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>> *To:* unk1102
>>>>>> *Cc:* user
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>>> dataframes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 
>>>>>> wrote:
>>>>>>
>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>> find any
>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>> in Hive
>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>
>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>>
>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>> same as
>>>>>> above query please guide.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>


Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Richard Eggert
I think the problem may be that callUDF takes a DataType indicating the
return type of the UDF as its second argument.
On Oct 12, 2015 9:27 AM, "Umesh Kacha"  wrote:

> Hi if you can help it would be great as I am stuck don't know how to
> remove compilation error in callUdf when we pass three parameters function
> name string column name as col and lit function please guide
> On Oct 11, 2015 1:05 AM, "Umesh Kacha"  wrote:
>
>> Hi any idea? how do I call percentlie_approx using callUdf() please guide.
>>
>> On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha 
>> wrote:
>>
>>> I have a doubt Michael I tried to use callUDF in  the following code it
>>> does not work.
>>>
>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>>
>>> Above code does not compile because callUdf() takes only two arguments
>>> function name in String and Column class type. Please guide.
>>>
>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>>> wrote:
>>>
>>>> thanks much Michael let me try.
>>>>
>>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>>> mich...@databricks.com> wrote:
>>>>
>>>>> This is confusing because I made a typo...
>>>>>
>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>>
>>>>> The first argument is the name of the UDF, all other arguments need to
>>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>>> literal column that always has the value 0.25.
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>>>>>
>>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>>> --> becomes a percentile function lit(25)
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks for clarification
>>>>>>
>>>>>> Saif
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>>> *To:* Ellafi, Saif A.
>>>>>> *Cc:* Michael Armbrust; user
>>>>>>
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>>
>>>>>>
>>>>>>
>>>>>> public static Column 
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  lit(Object literal)
>>>>>>
>>>>>> Creates a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  of
>>>>>> literal value.
>>>>>>
>>>>>> The passed in object is returned directly if it is already a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  also.
>>>>>> Otherwise, a new Column
>>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>>  is
>>>>>> created to represent the literal value.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Sat, Oct 10, 2015 at 12:39 AM, 
>>>>>> wrote:
>>>>>>
>>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>>> find lit in the api.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Thanks
>>>>>>
>>>>>>
>>>>>>
>>>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>>> *To:* unk1102
>>>>>> *Cc:* user
>>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>>
>>>>>>
>>>>>>
>>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>>> dataframes.
>>>>>>
>>>>>>
>>>>>>
>>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 
>>>>>> wrote:
>>>>>>
>>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant
>>>>>> find any
>>>>>> percentile_approx function in Spark aggregation functions. For e.g.
>>>>>> in Hive
>>>>>> we have percentile_approx and we can use it in the following way
>>>>>>
>>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>>
>>>>>> I can see ntile function but not sure how it is gonna give results
>>>>>> same as
>>>>>> above query please guide.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>
>>


Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
Hi if you can help it would be great as I am stuck don't know how to remove
compilation error in callUdf when we pass three parameters function name
string column name as col and lit function please guide
On Oct 11, 2015 1:05 AM, "Umesh Kacha"  wrote:

> Hi any idea? how do I call percentlie_approx using callUdf() please guide.
>
> On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha 
> wrote:
>
>> I have a doubt Michael I tried to use callUDF in  the following code it
>> does not work.
>>
>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>>
>> Above code does not compile because callUdf() takes only two arguments
>> function name in String and Column class type. Please guide.
>>
>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
>> wrote:
>>
>>> thanks much Michael let me try.
>>>
>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust <
>>> mich...@databricks.com> wrote:
>>>
>>>> This is confusing because I made a typo...
>>>>
>>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>>
>>>> The first argument is the name of the UDF, all other arguments need to
>>>> be columns that are passed in as arguments.  lit is just saying to make a
>>>> literal column that always has the value 0.25.
>>>>
>>>> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>>>>
>>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any)
>>>>> --> becomes a percentile function lit(25)
>>>>>
>>>>>
>>>>>
>>>>> Thanks for clarification
>>>>>
>>>>> Saif
>>>>>
>>>>>
>>>>>
>>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>>> *To:* Ellafi, Saif A.
>>>>> *Cc:* Michael Armbrust; user
>>>>>
>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>
>>>>>
>>>>>
>>>>> I found it in 1.3 documentation lit says something else not percent
>>>>>
>>>>>
>>>>>
>>>>> public static Column 
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>  lit(Object literal)
>>>>>
>>>>> Creates a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>  of
>>>>> literal value.
>>>>>
>>>>> The passed in object is returned directly if it is already a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>>> If the object is a Scala Symbol, it is converted into a Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>  also.
>>>>> Otherwise, a new Column
>>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>>  is
>>>>> created to represent the literal value.
>>>>>
>>>>>
>>>>>
>>>>> On Sat, Oct 10, 2015 at 12:39 AM, 
>>>>> wrote:
>>>>>
>>>>> Where can we find other available functions such as lit() ? I can’t
>>>>> find lit in the api.
>>>>>
>>>>>
>>>>>
>>>>> Thanks
>>>>>
>>>>>
>>>>>
>>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>>> *To:* unk1102
>>>>> *Cc:* user
>>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>>
>>>>>
>>>>>
>>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>>> dataframes.
>>>>>
>>>>>
>>>>>
>>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102 
>>>>> wrote:
>>>>>
>>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>>>> any
>>>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>>>> Hive
>>>>> we have percentile_approx and we can use it in the following way
>>>>>
>>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>>
>>>>> I can see ntile function but not sure how it is gonna give results
>>>>> same as
>>>>> above query please guide.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>
>


Re: How to calculate percentile of a column of DataFrame?

2015-10-10 Thread Umesh Kacha
Hi any idea? how do I call percentlie_approx using callUdf() please guide.

On Sat, Oct 10, 2015 at 1:39 AM, Umesh Kacha  wrote:

> I have a doubt Michael I tried to use callUDF in  the following code it
> does not work.
>
> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))
>
> Above code does not compile because callUdf() takes only two arguments
> function name in String and Column class type. Please guide.
>
> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha 
> wrote:
>
>> thanks much Michael let me try.
>>
>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust > > wrote:
>>
>>> This is confusing because I made a typo...
>>>
>>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>>
>>> The first argument is the name of the UDF, all other arguments need to
>>> be columns that are passed in as arguments.  lit is just saying to make a
>>> literal column that always has the value 0.25.
>>>
>>> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>>>
>>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
>>>> becomes a percentile function lit(25)
>>>>
>>>>
>>>>
>>>> Thanks for clarification
>>>>
>>>> Saif
>>>>
>>>>
>>>>
>>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>>> *To:* Ellafi, Saif A.
>>>> *Cc:* Michael Armbrust; user
>>>>
>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>
>>>>
>>>>
>>>> I found it in 1.3 documentation lit says something else not percent
>>>>
>>>>
>>>>
>>>> public static Column 
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>  lit(Object literal)
>>>>
>>>> Creates a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>  of
>>>> literal value.
>>>>
>>>> The passed in object is returned directly if it is already a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>>> If the object is a Scala Symbol, it is converted into a Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>  also.
>>>> Otherwise, a new Column
>>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>>  is
>>>> created to represent the literal value.
>>>>
>>>>
>>>>
>>>> On Sat, Oct 10, 2015 at 12:39 AM,  wrote:
>>>>
>>>> Where can we find other available functions such as lit() ? I can’t
>>>> find lit in the api.
>>>>
>>>>
>>>>
>>>> Thanks
>>>>
>>>>
>>>>
>>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>>> *To:* unk1102
>>>> *Cc:* user
>>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>>
>>>>
>>>>
>>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>>> dataframes.
>>>>
>>>>
>>>>
>>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102  wrote:
>>>>
>>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>>> any
>>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>>> Hive
>>>> we have percentile_approx and we can use it in the following way
>>>>
>>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>>
>>>> I can see ntile function but not sure how it is gonna give results same
>>>> as
>>>> above query please guide.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>


Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
I have a doubt Michael I tried to use callUDF in  the following code it
does not work.

sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25)))

Above code does not compile because callUdf() takes only two arguments
function name in String and Column class type. Please guide.

On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha  wrote:

> thanks much Michael let me try.
>
> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust 
> wrote:
>
>> This is confusing because I made a typo...
>>
>> callUDF("percentile_approx", col("mycol"), lit(0.25))
>>
>> The first argument is the name of the UDF, all other arguments need to be
>> columns that are passed in as arguments.  lit is just saying to make a
>> literal column that always has the value 0.25.
>>
>> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>>
>>> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
>>> becomes a percentile function lit(25)
>>>
>>>
>>>
>>> Thanks for clarification
>>>
>>> Saif
>>>
>>>
>>>
>>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>>> *Sent:* Friday, October 09, 2015 4:10 PM
>>> *To:* Ellafi, Saif A.
>>> *Cc:* Michael Armbrust; user
>>>
>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>
>>>
>>>
>>> I found it in 1.3 documentation lit says something else not percent
>>>
>>>
>>>
>>> public static Column 
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>  lit(Object literal)
>>>
>>> Creates a Column
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>  of
>>> literal value.
>>>
>>> The passed in object is returned directly if it is already a Column
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>>> If the object is a Scala Symbol, it is converted into a Column
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>  also.
>>> Otherwise, a new Column
>>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>>  is
>>> created to represent the literal value.
>>>
>>>
>>>
>>> On Sat, Oct 10, 2015 at 12:39 AM,  wrote:
>>>
>>> Where can we find other available functions such as lit() ? I can’t find
>>> lit in the api.
>>>
>>>
>>>
>>> Thanks
>>>
>>>
>>>
>>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>>> *Sent:* Friday, October 09, 2015 4:04 PM
>>> *To:* unk1102
>>> *Cc:* user
>>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>>
>>>
>>>
>>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>>> dataframes.
>>>
>>>
>>>
>>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102  wrote:
>>>
>>> Hi how to calculate percentile of a column in a DataFrame? I cant find
>>> any
>>> percentile_approx function in Spark aggregation functions. For e.g. in
>>> Hive
>>> we have percentile_approx and we can use it in the following way
>>>
>>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>>
>>> I can see ntile function but not sure how it is gonna give results same
>>> as
>>> above query please guide.
>>>
>>>
>>>
>>> --
>>> View this message in context:
>>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>
>>> -
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>
>>>
>>>
>>>
>>>
>>
>>
>


Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
thanks much Michael let me try.

On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust 
wrote:

> This is confusing because I made a typo...
>
> callUDF("percentile_approx", col("mycol"), lit(0.25))
>
> The first argument is the name of the UDF, all other arguments need to be
> columns that are passed in as arguments.  lit is just saying to make a
> literal column that always has the value 0.25.
>
> On Fri, Oct 9, 2015 at 12:16 PM,  wrote:
>
>> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
>> becomes a percentile function lit(25)
>>
>>
>>
>> Thanks for clarification
>>
>> Saif
>>
>>
>>
>> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
>> *Sent:* Friday, October 09, 2015 4:10 PM
>> *To:* Ellafi, Saif A.
>> *Cc:* Michael Armbrust; user
>>
>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>
>>
>>
>> I found it in 1.3 documentation lit says something else not percent
>>
>>
>>
>> public static Column 
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>  lit(Object literal)
>>
>> Creates a Column
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>  of
>> literal value.
>>
>> The passed in object is returned directly if it is already a Column
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
>> If the object is a Scala Symbol, it is converted into a Column
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>  also.
>> Otherwise, a new Column
>> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>>  is
>> created to represent the literal value.
>>
>>
>>
>> On Sat, Oct 10, 2015 at 12:39 AM,  wrote:
>>
>> Where can we find other available functions such as lit() ? I can’t find
>> lit in the api.
>>
>>
>>
>> Thanks
>>
>>
>>
>> *From:* Michael Armbrust [mailto:mich...@databricks.com]
>> *Sent:* Friday, October 09, 2015 4:04 PM
>> *To:* unk1102
>> *Cc:* user
>> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>>
>>
>>
>> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
>> dataframes.
>>
>>
>>
>> On Fri, Oct 9, 2015 at 12:01 PM, unk1102  wrote:
>>
>> Hi how to calculate percentile of a column in a DataFrame? I cant find any
>> percentile_approx function in Spark aggregation functions. For e.g. in
>> Hive
>> we have percentile_approx and we can use it in the following way
>>
>> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>>
>> I can see ntile function but not sure how it is gonna give results same as
>> above query please guide.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>>
>>
>>
>
>


Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
This is confusing because I made a typo...

callUDF("percentile_approx", col("mycol"), lit(0.25))

The first argument is the name of the UDF, all other arguments need to be
columns that are passed in as arguments.  lit is just saying to make a
literal column that always has the value 0.25.

On Fri, Oct 9, 2015 at 12:16 PM,  wrote:

> Yes but I mean, this is rather curious. How is def lit(literal:Any) -->
> becomes a percentile function lit(25)
>
>
>
> Thanks for clarification
>
> Saif
>
>
>
> *From:* Umesh Kacha [mailto:umesh.ka...@gmail.com]
> *Sent:* Friday, October 09, 2015 4:10 PM
> *To:* Ellafi, Saif A.
> *Cc:* Michael Armbrust; user
>
> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>
>
>
> I found it in 1.3 documentation lit says something else not percent
>
>
>
> public static Column 
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>  lit(Object literal)
>
> Creates a Column
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>  of
> literal value.
>
> The passed in object is returned directly if it is already a Column
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
> If the object is a Scala Symbol, it is converted into a Column
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>  also.
> Otherwise, a new Column
> <https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
>  is
> created to represent the literal value.
>
>
>
> On Sat, Oct 10, 2015 at 12:39 AM,  wrote:
>
> Where can we find other available functions such as lit() ? I can’t find
> lit in the api.
>
>
>
> Thanks
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Friday, October 09, 2015 4:04 PM
> *To:* unk1102
> *Cc:* user
> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>
>
>
> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
> dataframes.
>
>
>
> On Fri, Oct 9, 2015 at 12:01 PM, unk1102  wrote:
>
> Hi how to calculate percentile of a column in a DataFrame? I cant find any
> percentile_approx function in Spark aggregation functions. For e.g. in Hive
> we have percentile_approx and we can use it in the following way
>
> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>
> I can see ntile function but not sure how it is gonna give results same as
> above query please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>
>
>


RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes 
a percentile function lit(25)

Thanks for clarification
Saif

From: Umesh Kacha [mailto:umesh.ka...@gmail.com]
Sent: Friday, October 09, 2015 4:10 PM
To: Ellafi, Saif A.
Cc: Michael Armbrust; user
Subject: Re: How to calculate percentile of a column of DataFrame?

I found it in 1.3 documentation lit says something else not percent


public static 
Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
 lit(Object literal)
Creates a 
Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
 of literal value.

The passed in object is returned directly if it is already a 
Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
 If the object is a Scala Symbol, it is converted into a 
Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
 also. Otherwise, a new 
Column<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
 is created to represent the literal value.

On Sat, Oct 10, 2015 at 12:39 AM, 
mailto:saif.a.ell...@wellsfargo.com>> wrote:
Where can we find other available functions such as lit() ? I can’t find lit in 
the api.

Thanks

From: Michael Armbrust 
[mailto:mich...@databricks.com<mailto:mich...@databricks.com>]
Sent: Friday, October 09, 2015 4:04 PM
To: unk1102
Cc: user
Subject: Re: How to calculate percentile of a column of DataFrame?

You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes.

On Fri, Oct 9, 2015 at 12:01 PM, unk1102 
mailto:umesh.ka...@gmail.com>> wrote:
Hi how to calculate percentile of a column in a DataFrame? I cant find any
percentile_approx function in Spark aggregation functions. For e.g. in Hive
we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);

I can see ntile function but not sure how it is gonna give results same as
above query please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>




Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
I found it in 1.3 documentation lit says something else not percent

public static Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
lit(Object literal)

Creates a Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
of
literal value.

The passed in object is returned directly if it is already a Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>.
If the object is a Scala Symbol, it is converted into a Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
also.
Otherwise, a new Column
<https://spark.apache.org/docs/1.3.1/api/java/org/apache/spark/sql/Column.html>
is
created to represent the literal value.

On Sat, Oct 10, 2015 at 12:39 AM,  wrote:

> Where can we find other available functions such as lit() ? I can’t find
> lit in the api.
>
>
>
> Thanks
>
>
>
> *From:* Michael Armbrust [mailto:mich...@databricks.com]
> *Sent:* Friday, October 09, 2015 4:04 PM
> *To:* unk1102
> *Cc:* user
> *Subject:* Re: How to calculate percentile of a column of DataFrame?
>
>
>
> You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
> dataframes.
>
>
>
> On Fri, Oct 9, 2015 at 12:01 PM, unk1102  wrote:
>
> Hi how to calculate percentile of a column in a DataFrame? I cant find any
> percentile_approx function in Spark aggregation functions. For e.g. in Hive
> we have percentile_approx and we can use it in the following way
>
> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>
> I can see ntile function but not sure how it is gonna give results same as
> above query please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>
>


RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Where can we find other available functions such as lit() ? I can’t find lit in 
the api.

Thanks

From: Michael Armbrust [mailto:mich...@databricks.com]
Sent: Friday, October 09, 2015 4:04 PM
To: unk1102
Cc: user
Subject: Re: How to calculate percentile of a column of DataFrame?

You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes.

On Fri, Oct 9, 2015 at 12:01 PM, unk1102 
mailto:umesh.ka...@gmail.com>> wrote:
Hi how to calculate percentile of a column in a DataFrame? I cant find any
percentile_approx function in Spark aggregation functions. For e.g. in Hive
we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);

I can see ntile function but not sure how it is gonna give results same as
above query please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: 
user-unsubscr...@spark.apache.org<mailto:user-unsubscr...@spark.apache.org>
For additional commands, e-mail: 
user-h...@spark.apache.org<mailto:user-h...@spark.apache.org>



Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from
dataframes.

On Fri, Oct 9, 2015 at 12:01 PM, unk1102  wrote:

> Hi how to calculate percentile of a column in a DataFrame? I cant find any
> percentile_approx function in Spark aggregation functions. For e.g. in Hive
> we have percentile_approx and we can use it in the following way
>
> hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);
>
> I can see ntile function but not sure how it is gonna give results same as
> above query please guide.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


How to calculate percentile of a column of DataFrame?

2015-10-09 Thread unk1102
Hi how to calculate percentile of a column in a DataFrame? I cant find any
percentile_approx function in Spark aggregation functions. For e.g. in Hive
we have percentile_approx and we can use it in the following way

hiveContext.sql("select percentile_approx("mycol",0.25) from myTable);

I can see ntile function but not sure how it is gonna give results same as
above query please guide.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-calculate-percentile-of-a-column-of-DataFrame-tp25000.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Kohler, that's very interesting approach. I never used Spark SQL and not 
sure whether my cluster was configured well for it. But will definitely have a 
try. 😃

From: c.koh...@elsevier.com
To: myl...@hotmail.com; user@spark.apache.org
Subject: Re: Percentile example
Date: Tue, 17 Feb 2015 17:41:53 +













The best approach I’ve found to calculate Percentiles in Spark is to leverage 
SparkSQL.  If you use the Hive Query Language support, you can use the UDAFs 
for percentiles (as of Spark 1.2)



Something like this (Note: syntax not guaranteed to run but should give you the 
gist of what you need to do):  




JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaHiveContext hsc = new JavaHiveContext(sc);
//Get your Data into a SchemaRDD and register the Table



// Query it

String hql =  "SELECT FIELD1, FIELD2, percentile(FIELD3, 0.05) AS
 ptile5 from TABLE-NAME GROUP BY FIELD1, FIELD2;”

JavaSchemaRDD result = hsc.hql(hql);
List grp = result.collect();





for (int z = 2; z 
 < row.length(); z++) {
  // Do something with the results
}




Curt









From: SiMaYunRui 

Date: Sunday, February 15, 2015 at 10:37 AM

To: "user@spark.apache.org" 

Subject: Percentile example







hello, 



I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order,
 but not pretty sure how can I get value of , for example, percentile 66. 



Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches. 



Can anyone shed some light on this problem? 



Regards






  

RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Imran for very detailed explanations and options. I think for now 
T-Digest is what I want. 

From: iras...@cloudera.com
Date: Tue, 17 Feb 2015 08:39:48 -0600
Subject: Re: Percentile example
To: myl...@hotmail.com
CC: user@spark.apache.org

(trying to repost to the list w/out URLs -- rejected as spam earlier)
Hi,
Using take() is not a good idea, as you have noted it will pull a lot of data 
down to the driver so its not scalable.  Here are some more scalable 
alternatives:
1. Approximate solutions
1a. Sample the data.  Just sample some of the data to the driver, sort that 
data in memory, and take the 66th percentile of that sample.
1b.  Make a histogram with pre-determined buckets.  Eg., if you know your data 
ranges from 0 to 1 and is uniform-ish, you could make buckets every 0.01.  Then 
count how many data points go into each bucket.  Or if you only care about 
relative error and you have integers (often the case if your data is counts), 
then you can span the full range of integers with a relatively small number of 
buckets.  Eg., you only need 200 buckets for 5% error.  See the Histogram class 
in twitter's Ostrich library
The problem is, if you have no idea what the distribution of your data is, its 
very hard to come up with good buckets; you could have an arbitrary amount of 
data going to one bucket, and thus tons of error.
1c.  Use a TDigest , a compact & scalable data structure for approximating 
distributions, and performs reasonably across a wide range of distributions.  
You would make one TDigest for each partition (with mapPartitions), and then 
merge all of the TDigests together.  I wrote up a little more detail on this 
earlier, you can search the spark-user on nabble for "tdigest"
2. Exact solutions.  There are also a few options here, but I'll give one that 
is a variant of what you suggested.  Start out by doing a sortByKey.  Then 
figure out how many records you have in each partitions (with mapPartitions).  
Figure out which partition the 66th percentile would be in.  Then just read the 
one partition you want, and go down to the Nth record in that partition.
To read the one partition you want, you can either (a) use 
mapPartitionsWithIndex, and just ignore every partition that isnt' the one you 
want or (b) use PartitionPruningRDD.  PartitionPruningRDD will avoid launching 
empty tasks on the other partitions, so it will be slightly more efficient, but 
its also a developer api, so perhaps not worth going to that level of detail.
Note that internally, sortByKey will sample your data to get an approximate 
distribution, to figure out what data to put in each partition.  However, your 
still getting an exact answer this way -- the approximation is only important 
for distributing work among all executors.  Even if the approximation is 
inaccurate, you'll still correct for it, you will just have unequal partitions.
Imran On Sun, Feb 15, 2015 at 9:37 AM, SiMaYunRui  wrote:



hello, 
I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order, but not pretty sure how can I get value of , for example, 
percentile 66. 
Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches. 
Can anyone shed some light on this problem? 
Regards
  



  

Re: Percentile example

2015-02-17 Thread Kohler, Curt E (ELS-STL)


The best approach I've found to calculate Percentiles in Spark is to leverage 
SparkSQL.  If you use the Hive Query Language support, you can use the UDAFs 
for percentiles (as of Spark 1.2)

Something like this (Note: syntax not guaranteed to run but should give you the 
gist of what you need to do):


JavaSparkContext sc = new JavaSparkContext(sparkConf);

JavaHiveContext hsc = new JavaHiveContext(sc);

//Get your Data into a SchemaRDD and register the Table


// Query it

String hql =  "SELECT FIELD1, FIELD2, percentile(FIELD3, 0.05) AS ptile5 from 
TABLE-NAME GROUP BY FIELD1, FIELD2;"

JavaSchemaRDD result = hsc.hql(hql);

List grp = result.collect();


for (int z = 2; z  < row.length(); z++) {

  // Do something with the results

}

Curt


From: SiMaYunRui mailto:myl...@hotmail.com>>
Date: Sunday, February 15, 2015 at 10:37 AM
To: "user@spark.apache.org<mailto:user@spark.apache.org>" 
mailto:user@spark.apache.org>>
Subject: Percentile example

hello,

I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order, but not pretty sure how can I get value of , for example, 
percentile 66.

Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches.

Can anyone shed some light on this problem?

Regards



Re: Percentile example

2015-02-17 Thread Imran Rashid
(trying to repost to the list w/out URLs -- rejected as spam earlier)

Hi,

Using take() is not a good idea, as you have noted it will pull a lot of
data down to the driver so its not scalable.  Here are some more scalable
alternatives:

1. Approximate solutions

1a. Sample the data.  Just sample some of the data to the driver, sort that
data in memory, and take the 66th percentile of that sample.

1b.  Make a histogram with pre-determined buckets.  Eg., if you know your
data ranges from 0 to 1 and is uniform-ish, you could make buckets every
0.01.  Then count how many data points go into each bucket.  Or if you only
care about relative error and you have integers (often the case if your
data is counts), then you can span the full range of integers with a
relatively small number of buckets.  Eg., you only need 200 buckets for 5%
error.  See the Histogram class in twitter's Ostrich library

The problem is, if you have no idea what the distribution of your data is,
its very hard to come up with good buckets; you could have an arbitrary
amount of data going to one bucket, and thus tons of error.

1c.  Use a TDigest , a compact & scalable data structure for approximating
distributions, and performs reasonably across a wide range of
distributions.  You would make one TDigest for each partition (with
mapPartitions), and then merge all of the TDigests together.  I wrote up a
little more detail on this earlier, you can search the spark-user on nabble
for "tdigest"

2. Exact solutions.  There are also a few options here, but I'll give one
that is a variant of what you suggested.  Start out by doing a sortByKey.
Then figure out how many records you have in each partitions (with
mapPartitions).  Figure out which partition the 66th percentile would be
in.  Then just read the one partition you want, and go down to the Nth
record in that partition.

To read the one partition you want, you can either (a) use
mapPartitionsWithIndex, and just ignore every partition that isnt' the one
you want or (b) use PartitionPruningRDD.  PartitionPruningRDD will avoid
launching empty tasks on the other partitions, so it will be slightly more
efficient, but its also a developer api, so perhaps not worth going to that
level of detail.

Note that internally, sortByKey will sample your data to get an approximate
distribution, to figure out what data to put in each partition.  However,
your still getting an exact answer this way -- the approximation is only
important for distributing work among all executors.  Even if the
approximation is inaccurate, you'll still correct for it, you will just
have unequal partitions.

Imran


> On Sun, Feb 15, 2015 at 9:37 AM, SiMaYunRui  wrote:
>
>> hello,
>>
>> I am a newbie to spark and trying to figure out how to get percentile
>> against a big data set. Actually, I googled this topic but not find any
>> very useful code example and explanation. Seems that I can use transformer
>> SortBykey to get my data set in order, but not pretty sure how can I get
>> value of , for example, percentile 66.
>>
>> Should I use take() to pick up the value of percentile 66? I don't
>> believe any machine can load my data set in memory. I believe there must be
>> more efficient approaches.
>>
>> Can anyone shed some light on this problem?
>>
>> *Regards*
>>
>>
>


Percentile example

2015-02-15 Thread SiMaYunRui
hello, 
I am a newbie to spark and trying to figure out how to get percentile against a 
big data set. Actually, I googled this topic but not find any very useful code 
example and explanation. Seems that I can use transformer SortBykey to get my 
data set in order, but not pretty sure how can I get value of , for example, 
percentile 66. 
Should I use take() to pick up the value of percentile 66? I don't believe any 
machine can load my data set in memory. I believe there must be more efficient 
approaches. 
Can anyone shed some light on this problem? 
Regards
  

Re: Percentile Calculation

2015-01-28 Thread Kohler, Curt E (ELS-STL)
When I looked at this last fall, the only way that seemed to be available was 
to transform my data into SchemaRDDs, register them as tables and then use the 
Hive processor to calculate them with its built in percentile UDFs that were 
added in 1.2.


Curt


From: kundan kumar 
Sent: Wednesday, January 28, 2015 8:13 AM
To: user@spark.apache.org
Subject: Percentile Calculation

Is there any inbuilt function for calculating percentile over a dataset ?

I want to calculate the percentiles for each column in my data.


Regards,
Kundan


Percentile Calculation

2015-01-28 Thread kundan kumar
Is there any inbuilt function for calculating percentile over a dataset ?

I want to calculate the percentiles for each column in my data.


Regards,
Kundan


Re: status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-12 Thread Kevin Burton
Great. I’d love to help out. Is there any documentation on what you’re
working on that I can take a look at?

My biggest issue is that I need some way to compute the position of an
entry when used by ORDER BY… which I can do with the RANK operator.

What I essentially need is:

select source, indegree, rank() over (order by in degree desc) from foo
order by indegree desc,

This would give me the position of the record in the whole index and the
table sorted by indegree desc.

I was using RANK in pig but we’re ditching hadoop/pig in favor of spark.

I assume you’re implementing something similar to this:

https://issues.apache.org/jira/browse/SPARK-1442

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

https://issues.apache.org/jira/browse/HIVE-4197

https://issues.apache.org/jira/browse/HIVE-896

On Sat, Jan 10, 2015 at 5:00 PM, Will Benton  wrote:

> Hi Kevin,
>
> I'm currently working on implementing windowing.  If you'd like to see
> something that's not covered by a JIRA, please file one!
>
>
> best,
> wb
>
> - Original Message -
> > From: "Kevin Burton" 
> > To: user@spark.apache.org
> > Sent: Saturday, January 10, 2015 12:12:38 PM
> > Subject: status of spark analytics functions? over, rank, percentile,
> row_number, etc.
> >
> > I’m curious what the status of implementing hive analytics functions in
> > spark.
> >
> >
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
> >
> > Many of these seem missing.  I’m assuming they’re not implemented yet?
> >
> > Is there an ETA on them?
> >
> > or am I the first to bring this up? :-P
> >
> > --
> >
> > Founder/CEO Spinn3r.com
> > Location: *San Francisco, CA*
> > blog: http://burtonator.wordpress.com
> > … or check out my Google+ profile
> > <https://plus.google.com/102718274791889610666/posts>
> > <http://spinn3r.com>
> >
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile
<https://plus.google.com/102718274791889610666/posts>
<http://spinn3r.com>


Re: status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Will Benton
Hi Kevin,

I'm currently working on implementing windowing.  If you'd like to see 
something that's not covered by a JIRA, please file one!


best,
wb

- Original Message -
> From: "Kevin Burton" 
> To: user@spark.apache.org
> Sent: Saturday, January 10, 2015 12:12:38 PM
> Subject: status of spark analytics functions? over, rank, percentile, 
> row_number, etc.
> 
> I’m curious what the status of implementing hive analytics functions in
> spark.
> 
> https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics
> 
> Many of these seem missing.  I’m assuming they’re not implemented yet?
> 
> Is there an ETA on them?
> 
> or am I the first to bring this up? :-P
> 
> --
> 
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> <https://plus.google.com/102718274791889610666/posts>
> <http://spinn3r.com>
> 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Kevin Burton
I’m curious what the status of implementing hive analytics functions in
spark.

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics

Many of these seem missing.  I’m assuming they’re not implemented yet?

Is there an ETA on them?

or am I the first to bring this up? :-P

-- 

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile




Re: Percentile

2014-11-29 Thread Imran Rashid
Hi Franco,

As a fast approximate way to get probability distributions, you might be
interested in t-digests:

https://github.com/tdunning/t-digest

In one pass, you could make a t-digest for each variable, to get its
distribution.  And after that, you could make another pass to map each data
point to its percentile in the distribution.

to create the tdigests, you would do something like this:

val myDataRDD = ...

myDataRDD.mapPartitions{itr =>
  xDistribution = TDigest.createArrayDigest(32, 100)
  yDistribution = TDigest.createArrayDigest(32, 100)
  ...
  itr.foreach{ data =>
xDistribution.add(data.x)
yDistribution.add(data.y)
...
  }

  Seq(
"x" -> xDistribution,
"y" -> yDistribution
  ).toIterator.map{case(k,v) =>
val arr = new Array[Byte](t.byteSize)
v.asBytes(ByteBuffer.wrap(arr))
k -> arr
  }
}.reduceByKey{case(t1Arr,t2Arr) =>
  val merged =
ArrayDigest.fromBytes(ByteBuffer.wrap(t1Arr)).add(ArrayDigest.fromBytes(ByteBuffer.wrap(t2Arr))
  val arr = new Array[Byte](merged.byteSize)
  merged.asBytes(ByteBuffer.wrap(arr))
}


(the complication there is just that tdigests are not directly
serializable, so I need to do the manual work of converting to and from an
array of bytes).


On Thu, Nov 27, 2014 at 9:28 AM, Franco Barrientos <
franco.barrien...@exalitica.com> wrote:

> Hi folks!,
>
>
>
> Anyone known how can I calculate for each elements of a variable in a RDD
> its percentile? I tried to calculate trough Spark SQL with subqueries but I
> think that is imposible in Spark SQL. Any idea will be welcome.
>
>
>
> Thanks in advance,
>
>
>
> *Franco Barrientos*
> Data Scientist
>
> Málaga #115, Of. 1003, Las Condes.
> Santiago, Chile.
> (+562)-29699649
> (+569)-76347893
>
> franco.barrien...@exalitica.com
>
> www.exalitica.com
>
> [image: http://exalitica.com/web/img/frim.png]
>
>
>


Re: Percentile

2014-11-27 Thread Akhil Das
Hi Franco,

Hive percentile UDAF is added in the master branch. You can have a look at
it. I think it would work like "select percentile(col_name, 1) from
sigmoid_logs"

Thanks
Best Regards

On Thu, Nov 27, 2014 at 8:58 PM, Franco Barrientos <
franco.barrien...@exalitica.com> wrote:

> Hi folks!,
>
>
>
> Anyone known how can I calculate for each elements of a variable in a RDD
> its percentile? I tried to calculate trough Spark SQL with subqueries but I
> think that is imposible in Spark SQL. Any idea will be welcome.
>
>
>
> Thanks in advance,
>
>
>
> *Franco Barrientos*
> Data Scientist
>
> Málaga #115, Of. 1003, Las Condes.
> Santiago, Chile.
> (+562)-29699649
> (+569)-76347893
>
> franco.barrien...@exalitica.com
>
> www.exalitica.com
>
> [image: http://exalitica.com/web/img/frim.png]
>
>
>


Percentile

2014-11-27 Thread Franco Barrientos
Hi folks!,

 

Anyone known how can I calculate for each elements of a variable in a RDD
its percentile? I tried to calculate trough Spark SQL with subqueries but I
think that is imposible in Spark SQL. Any idea will be welcome.

 

Thanks in advance,

 

Franco Barrientos
Data Scientist

Málaga #115, Of. 1003, Las Condes.
Santiago, Chile.
(+562)-29699649
(+569)-76347893

 <mailto:franco.barrien...@exalitica.com> franco.barrien...@exalitica.com 

 <http://www.exalitica.com/> www.exalitica.com


  <http://exalitica.com/web/img/frim.png> 

 



Re: [SQL] PERCENTILE is not working

2014-11-05 Thread Yin Huai
Hello Kevin,

https://issues.apache.org/jira/browse/SPARK-3891 will fix this bug.

Thanks,

Yin

On Wed, Nov 5, 2014 at 8:06 PM, Cheng, Hao  wrote:

> Which version are you using? I can reproduce that in the latest code, but
> with different exception.
> I've filed an bug https://issues.apache.org/jira/browse/SPARK-4263, can
> you also add some information there?
>
> Thanks,
> Cheng Hao
>
> -Original Message-
> From: Kevin Paul [mailto:kevinpaulap...@gmail.com]
> Sent: Thursday, November 6, 2014 7:09 AM
> To: user
> Subject: [SQL] PERCENTILE is not working
>
> Hi all, I encounter this error when execute the query
>
> sqlContext.sql("select percentile(age, array(0, 0.5, 1)) from
> people").collect()
>
> java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer
> cannot be cast to [Ljava.lang.Object;
>
> at
> org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)
>
> at
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)
>
> at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)
>
> at
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)
>
> at org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)
>
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:135)
>
> at
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)
>
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
>
> at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
>
> at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>
> at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>
> at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>
> Thanks,
> Kevin Paul
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional
> commands, e-mail: user-h...@spark.apache.org
>
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


RE: [SQL] PERCENTILE is not working

2014-11-05 Thread Cheng, Hao
Which version are you using? I can reproduce that in the latest code, but with 
different exception.
I've filed an bug https://issues.apache.org/jira/browse/SPARK-4263, can you 
also add some information there?

Thanks,
Cheng Hao

-Original Message-
From: Kevin Paul [mailto:kevinpaulap...@gmail.com] 
Sent: Thursday, November 6, 2014 7:09 AM
To: user
Subject: [SQL] PERCENTILE is not working

Hi all, I encounter this error when execute the query

sqlContext.sql("select percentile(age, array(0, 0.5, 1)) from people").collect()

java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer
cannot be cast to [Ljava.lang.Object;

at 
org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)

at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)

at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)

at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)

at org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)

at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:135)

at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)

at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)

at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

Thanks,
Kevin Paul

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional 
commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



[SQL] PERCENTILE is not working

2014-11-05 Thread Kevin Paul
Hi all, I encounter this error when execute the query

sqlContext.sql("select percentile(age, array(0, 0.5, 1)) from people").collect()

java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer
cannot be cast to [Ljava.lang.Object;

at 
org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)

at 
org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)

at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)

at 
org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)

at org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)

at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:135)

at 
org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$6.apply(Aggregate.scala:128)

at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)

at org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)

at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)

at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

Thanks,
Kevin Paul

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark SQL Percentile UDAF

2014-10-09 Thread Anand Mohan Tumuluri
Filed https://issues.apache.org/jira/browse/SPARK-3891

Thanks,
Anand Mohan

On Thu, Oct 9, 2014 at 7:13 PM, Michael Armbrust 
wrote:

> Please file a JIRA:https://issues.apache.org/jira/browse/SPARK/
> <https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK%2F&sa=D&sntz=1&usg=AFQjCNFS_GnMso2OCOITA0TSJ5U10b3JSQ>
>
> On Thu, Oct 9, 2014 at 6:48 PM, Anand Mohan  wrote:
>
>> Hi,
>>
>> I just noticed the Percentile UDAF PR being merged into trunk and decided
>> to test it.
>> So pulled in today's trunk and tested the percentile queries.
>> They work marvelously, Thanks a lot for bringing this into Spark SQL.
>>
>> However Hive percentile UDAF also supports an array mode where in you can
>> give the list of percentiles that you want and it would return an array of
>> double values one for each requested percentile.
>> This query is failing with the below error. However a query with the
>> individual percentiles like
>> percentile(turnaroundtime,0.25),percentile(turnaroundtime,0.5),percentile(turnaroundtime,0.75)
>> is working. (and so this issue is not of a high priority as there is this
>> workaround for us)
>>
>> Thanks,
>> Anand Mohan
>>
>> 0: jdbc:hive2://dev-uuppala.sfohi.philips.com> select name,
>> percentile(turnaroundtime,array(0,0.25,0.5,0.75,1)) from exam group by name;
>>
>> Error: org.apache.spark.SparkException: Job aborted due to stage failure:
>> Task 1 in stage 25.0 failed 4 times, most recent failure: Lost task 1.3 in
>> stage 25.0 (TID 305, Dev-uuppala.sfohi.philips.com):
>> java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot
>> be cast to [Ljava.lang.Object;
>>
>> org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)
>>
>> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)
>>
>> org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)
>>
>> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)
>>
>> org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)
>>
>> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
>>
>> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
>> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
>> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
>>
>> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
>> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
>> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
>> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
>> org.apache.spark.scheduler.Task.run(Task.scala:56)
>>
>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>>
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>>
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
>> java.lang.Thread.run(Thread.java:745)
>> Driver stacktrace: (state=,code=0)
>>
>>
>>
>> --
>> View this message in context: Spark SQL Percentile UDAF
>> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Percentile-UDAF-tp16092.html>
>> Sent from the Apache Spark User List mailing list archive
>> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>>
>
>


Re: Spark SQL Percentile UDAF

2014-10-09 Thread Michael Armbrust
Please file a JIRA:https://issues.apache.org/jira/browse/SPARK/
<https://www.google.com/url?q=https%3A%2F%2Fissues.apache.org%2Fjira%2Fbrowse%2FSPARK%2F&sa=D&sntz=1&usg=AFQjCNFS_GnMso2OCOITA0TSJ5U10b3JSQ>

On Thu, Oct 9, 2014 at 6:48 PM, Anand Mohan  wrote:

> Hi,
>
> I just noticed the Percentile UDAF PR being merged into trunk and decided
> to test it.
> So pulled in today's trunk and tested the percentile queries.
> They work marvelously, Thanks a lot for bringing this into Spark SQL.
>
> However Hive percentile UDAF also supports an array mode where in you can
> give the list of percentiles that you want and it would return an array of
> double values one for each requested percentile.
> This query is failing with the below error. However a query with the
> individual percentiles like
> percentile(turnaroundtime,0.25),percentile(turnaroundtime,0.5),percentile(turnaroundtime,0.75)
> is working. (and so this issue is not of a high priority as there is this
> workaround for us)
>
> Thanks,
> Anand Mohan
>
> 0: jdbc:hive2://dev-uuppala.sfohi.philips.com> select name,
> percentile(turnaroundtime,array(0,0.25,0.5,0.75,1)) from exam group by name;
>
> Error: org.apache.spark.SparkException: Job aborted due to stage failure:
> Task 1 in stage 25.0 failed 4 times, most recent failure: Lost task 1.3 in
> stage 25.0 (TID 305, Dev-uuppala.sfohi.philips.com):
> java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot
> be cast to [Ljava.lang.Object;
>
> org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)
>
> org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)
>
> org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)
>
> org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)
>
> org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)
>
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)
>
> org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
>
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
> org.apache.spark.scheduler.Task.run(Task.scala:56)
>
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)
>
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
>
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> Driver stacktrace: (state=,code=0)
>
>
>
> --
> View this message in context: Spark SQL Percentile UDAF
> <http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Percentile-UDAF-tp16092.html>
> Sent from the Apache Spark User List mailing list archive
> <http://apache-spark-user-list.1001560.n3.nabble.com/> at Nabble.com.
>


Spark SQL Percentile UDAF

2014-10-09 Thread Anand Mohan
Hi,

I just noticed the Percentile UDAF PR being merged into trunk and decided
to test it.
So pulled in today's trunk and tested the percentile queries.
They work marvelously, Thanks a lot for bringing this into Spark SQL.

However Hive percentile UDAF also supports an array mode where in you can
give the list of percentiles that you want and it would return an array of
double values one for each requested percentile.
This query is failing with the below error. However a query with the
individual percentiles like
percentile(turnaroundtime,0.25),percentile(turnaroundtime,0.5),percentile(turnaroundtime,0.75)
is working. (and so this issue is not of a high priority as there is this
workaround for us)

Thanks,
Anand Mohan

0: jdbc:hive2://dev-uuppala.sfohi.philips.com> select name,
percentile(turnaroundtime,array(0,0.25,0.5,0.75,1)) from exam group by name;

Error: org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 25.0 failed 4 times, most recent failure: Lost task 1.3 in
stage 25.0 (TID 305, Dev-uuppala.sfohi.philips.com):
java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot
be cast to [Ljava.lang.Object;

org.apache.hadoop.hive.serde2.objectinspector.StandardListObjectInspector.getListLength(StandardListObjectInspector.java:83)

org.apache.hadoop.hive.serde2.objectinspector.ObjectInspectorConverters$ListConverter.convert(ObjectInspectorConverters.java:259)

org.apache.hadoop.hive.ql.udf.generic.GenericUDFUtils$ConversionHelper.convertIfNecessary(GenericUDFUtils.java:349)

org.apache.hadoop.hive.ql.udf.generic.GenericUDAFBridge$GenericUDAFBridgeEvaluator.iterate(GenericUDAFBridge.java:170)

org.apache.spark.sql.hive.HiveUdafFunction.update(hiveUdfs.scala:342)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:167)

org.apache.spark.sql.execution.Aggregate$$anonfun$execute$1$$anonfun$7.apply(Aggregate.scala:151)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)
org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:599)

org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
org.apache.spark.scheduler.Task.run(Task.scala:56)

org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:181)

java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)

java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:745)
Driver stacktrace: (state=,code=0)




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-SQL-Percentile-UDAF-tp16092.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Getting percentile from Spark Streaming?

2014-08-13 Thread bumble123
Hi,

I'm trying to figure out how to constantly update, say, the 95th percentile
of a set of data through Spark Streaming. I'm not sure how to order the
dataset though, and while I can find percentiles in regular Spark, I can't
seem to figure out how to get that to transfer over to Spark Streaming. My
data is inputted as a key-value pair separated by a comma, and I only want
to use the values to figure out percentiles. Can anyone help with this?
Here's what I have for finding percentiles in regular Spark:

val sorted = textFile.map(line => line.split(",")).map(kvp =>
(kvp(1)->kvp(0))).sortByKey(true)
val rank = 0.9 * sorted.count()
val flatten = sorted.take(sorted.count().toInt)
val percentile = if(rank.isInstanceOf[Int])
{(flatten(rank.toInt-1)._1.toDouble + flatten(rank.toInt)._1.toDouble)/2.0}
else flatten(rank.toInt)._1.toDouble
println(percentile)

Thanks in advance.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Getting-percentile-from-Spark-Streaming-tp12040.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Implementing percentile through top Vs take

2014-07-31 Thread Bharath Ravi Kumar
Thanks. Turns out I needed the RDD sorted for another purpose, so keeping a
sorted pair rdd anyway made sense. And apologies for not reading the source
for top before asking the question (/*poor attempt to save time*/).


On Thu, Jul 31, 2014 at 12:34 AM, Sean Owen  wrote:

> No, it's definitely not done on the driver. It works as you say. Look
> at the source code for RDD.takeOrdered, which is what top calls.
>
>
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130
>
> On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar 
> wrote:
> > I'm looking to select the top n records (by rank) from a data set of a
> few
> > hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
> > entirely a driver-side operation in that all records are sorted in the
> > driver's memory. I prefer an approach where the records are sorted on the
> > cluster and only the top ones sent to the driver. I'm hence leaning
> towards
> > creating a  JavaPairRDD on a key, then sorting the rdd by key and
> invoking
> > take(N). I'd like to know if rdd.top achieves the same result (while
> being
> > executed on the cluster) as take or if my assumption that it's a driver
> side
> > operation is correct.
> >
> > Thanks,
> > Bharath
>


Re: Implementing percentile through top Vs take

2014-07-30 Thread Sean Owen
No, it's definitely not done on the driver. It works as you say. Look
at the source code for RDD.takeOrdered, which is what top calls.

https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130

On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar  wrote:
> I'm looking to select the top n records (by rank) from a data set of a few
> hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
> entirely a driver-side operation in that all records are sorted in the
> driver's memory. I prefer an approach where the records are sorted on the
> cluster and only the top ones sent to the driver. I'm hence leaning towards
> creating a  JavaPairRDD on a key, then sorting the rdd by key and invoking
> take(N). I'd like to know if rdd.top achieves the same result (while being
> executed on the cluster) as take or if my assumption that it's a driver side
> operation is correct.
>
> Thanks,
> Bharath


Implementing percentile through top Vs take

2014-07-30 Thread Bharath Ravi Kumar
I'm looking to select the top n records (by rank) from a data set of a few
hundred GB's. My understanding is that JavaRDD.top(n, comparator) is
entirely a driver-side operation in that all records are sorted in the
driver's memory. I prefer an approach where the records are sorted on the
cluster and only the top ones sent to the driver. I'm hence leaning towards
creating a  JavaPairRDD on a key, then sorting the rdd by key and invoking
take(N). I'd like to know if rdd.top achieves the same result (while being
executed on the cluster) as take or if my assumption that it's a driver
side operation is correct.

Thanks,
Bharath


Re: Support for Percentile and Variance Aggregation functions in Spark with HiveContext

2014-07-25 Thread Michael Armbrust
Hmm, in general we try to support all the UDAFs, but this one must be using
a different base class that we don't have a wrapper for.  JIRA here:
https://issues.apache.org/jira/browse/SPARK-2693


On Fri, Jul 25, 2014 at 8:06 AM,  wrote:

>
> Hi all,
>
> I am using Spark 1.0.0 with CDH 5.1.0.
>
> I want to aggregate the data in a raw table using a simple query like below
>
> *SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4),
> year,month,day FROM  raw_data_table  GROUP BY year, month, day*
>
> MIN, MAX and AVG functions work fine for me, but with PERCENTILE, I get an
> error as shown below.
>
> Exception in thread "main" java.lang.RuntimeException: No handler for udf
> class org.apache.hadoop.hive.ql.udf.UDAFPercentile
> at scala.sys.package$.error(package.scala:27)
> at
> org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)
> at
> org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)
> at
> org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
>
> I have read in the documentation that with HiveContext Spark SQL supports
> all the UDFs supported in Hive.
>
> I want to know if there is anything else I need to follow to use
> Percentile with Spark SQL..?? Or .. Are there any limitations still in
> Spark SQL with respect to UDFs and UDAFs in the version I am using..??
>
>
>
>
>
> Thanks and regards
>
> Vinay Kashyap
>


Support for Percentile and Variance Aggregation functions in Spark with HiveContext

2014-07-25 Thread vinay . kashyap





Hi all,
I am using Spark 1.0.0 with CDH 5.1.0.
I want to
aggregate the data in a raw table using a simple query like
below
SELECT MIN(field1), MAX(field2), AVG(field3),
PERCENTILE(field4), year,month,day FROM  raw_data_table  GROUP
BY year, month, day
MIN, MAX and AVG functions work fine
for me, but with PERCENTILE, I get an error as shown
below.
Exception in thread "main"
java.lang.RuntimeException: No handler for udf class
org.apache.hadoop.hive.ql.udf.UDAFPercentile

    at
scala.sys.package$.error(package.scala:27)

    at
org.apache.spark.sql.hive.HiveFunctionRegistry$.lookupFunction(hiveUdfs.scala:69)

    at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:115)

    at
org.apache.spark.sql.catalyst.analysis.Analyzer$ResolveFunctions$$anonfun$apply$4$$anonfun$applyOrElse$3.applyOrElse(Analyzer.scala:113)

    at
org.apache.spark.sql.catalyst.trees.TreeNode.transformDown(TreeNode.scala:165)
I
have read in the documentation that with HiveContext Spark SQL supports
all the UDFs supported in Hive.
I want to know if there is anything
else I need to follow to use Percentile with Spark SQL..?? Or .. Are there
any limitations still in Spark SQL with respect to UDFs and UDAFs in the
version I am using..??
 
 
Thanks and
regards
Vinay Kashyap