Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
, Nov 11, 2019 at 10:34 AM Tzahi File wrote: > Currently, I'm using the percentile approx function with Hive. > I'm looking for a better way to run this function or another way to get > the same result with spark, but faster and not using gigantic instances.. > > I'm trying to op

Re: Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
Currently, I'm using the percentile approx function with Hive. I'm looking for a better way to run this function or another way to get the same result with spark, but faster and not using gigantic instances.. I'm trying to optimize this job by changing the Spark configuration. If you have any

Re: Using Percentile in Spark SQL

2019-11-11 Thread Muthu Jayakumar
; Hi, >>> >>> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a >>> percentile function. I'm trying to improve this job by moving it to run >>> with spark SQL. >>> >>> Any suggestions on how to use a percentile function in Spa

Re: Using Percentile in Spark SQL

2019-11-11 Thread Patrick McCarthy
gt;> >> Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a >> percentile function. I'm trying to improve this job by moving it to run >> with spark SQL. >> >> Any suggestions on how to use a percentile function in Spark? >> >>

Re: Using Percentile in Spark SQL

2019-11-11 Thread Jerry Vinokurov
for this task? Because I bet that's what's slowing you down. On Mon, Nov 11, 2019 at 9:46 AM Tzahi File wrote: > Hi, > > Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a > percentile function. I'm trying to improve this job by moving it to run > with spa

Using Percentile in Spark SQL

2019-11-11 Thread Tzahi File
Hi, Currently, I'm using hive huge cluster(m5.24xl * 40 workers) to run a percentile function. I'm trying to improve this job by moving it to run with spark SQL. Any suggestions on how to use a percentile function in Spark? Thanks, -- Tzahi File Data Engineer [image: ironSource] <h

Re: Percentile calculation in spark 1.6

2016-02-23 Thread Ted Yu
28805602/how-to-compute-percentiles-in-apache-spark > > On Feb 23, 2016, at 10:08 AM, Arunkumar Pillai <arunkumar1...@gmail.com> > wrote: > > How to calculate percentile in spark 1.6 ? > > > -- > Thanks and Regards > Arun > > >

Re: Percentile calculation in spark 1.6

2016-02-23 Thread Chandeep Singh
t; wrote: > > How to calculate percentile in spark 1.6 ? > > > -- > Thanks and Regards > Arun

Percentile calculation in spark 1.6

2016-02-23 Thread Arunkumar Pillai
How to calculate percentile in spark 1.6 ? -- Thanks and Regards Arun

[Spark 1.5.1] percentile in spark

2016-02-08 Thread Arunkumar Pillai
Hi I'm using sql query find the percentile value. Is there any pre defined functions for percentile calculation -- Thanks and Regards Arun

Re: How to calculate percentile of a column of DataFrame?

2015-10-14 Thread Umesh Kacha
gt;>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction = >>>>>>>> UserDefinedFunction(,IntegerType,List()) >>>>>>>> >>>>>>>> scala> df.select($"id", callUDF("simpleUDF", $&quo

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
;>>> >>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <umesh.ka...@gmail.com> >>>>> wrote: >>>>> >>>>>> I have a doubt Michael I tried to use callUDF in the following code >>>>>> it does not work. >&

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
wrote: >>>> >>>>> I have a doubt Michael I tried to use callUDF in the following code >>>>> it does not work. >>>>> >>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25))) &g

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
gt;>>> |id2| 41| >>>>>> |id3| 50| >>>>>> +---++ >>>>>> >>>>>> Which Spark release are you using ? >>>>>> >>>>>> Can you

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
tack trace where you got the error ? >>>>> >>>>> Cheers >>>>> >>>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <umesh.ka...@gmail.com> >>>>>> wrote: >>>>>> I have a doubt Michael

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
t; +---++ >>>>>>> |id1| 26| >>>>>>> |id2| 41| >>>>>>> |id3| 50| >>>>>>> +---+----+ >

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
value,25)| >>>>>>> +---++ >>>>>>> |id1| 26| >>>>>>> |id2| 41| >>>>>>> |id3| 50| >>>>>>> +-

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
;>>>>> |id2| 41| >>>>>> |id3| 50| >>>>>> +---++ >>>>>> >>>>>> Which Spark release are you using ? >>>>>> >>>>>> Can you pastebin the ful

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
t;> v * v + cnst) >>>>>>>> res0: org.apache.spark.sql.UserDefinedFunction = >>>>>>>> UserDefinedFunction(,IntegerType,List()) >>>>>>>> >>>>>>>> scala> df.select($"id", callUDF("simple

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Umesh Kacha
t;, >>>>>>> lit(25))).show() >>>>>>> +---++ >>>>>>> | id|'simpleUDF(value,25)| >>>>>>> +---++ >>>>>>> |id1|

Re: How to calculate percentile of a column of DataFrame?

2015-10-13 Thread Ted Yu
t;, >>>>>>> lit(25))).show() >>>>>>> +---++ >>>>>>> | id|'simpleUDF(value,25)| >>>>>>> +---++ >>>>>>> |id1|

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
made a typo... >>>> >>>> callUDF("percentile_approx", col("mycol"), lit(0.25)) >>>> >>>> The first argument is the name of the UDF, all other arguments need to >>>> be columns that are passed in as arguments. lit is just

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Richard Eggert
s is confusing because I made a typo... >>>>> >>>>> callUDF("percentile_approx", col("mycol"), lit(0.25)) >>>>> >>>>> The first argument is the name of the UDF, all other arguments need to >>>>> be colum

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
gt;>>>> >>>>> callUDF("percentile_approx", col("mycol"), lit(0.25)) >>>>> >>>>> The first argument is the name of the UDF, all other arguments need to >>>>> be columns that are passed in as argu

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
.com> wrote: >>>>> >>>>>> This is confusing because I made a typo... >>>>>> >>>>>> callUDF("percentile_approx", col("mycol"), lit(0.25)) >>>>>> >>>>>> The first argument is the

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
e of the UDF, all other arguments need to >>> be columns that are passed in as arguments. lit is just saying to make a >>> literal column that always has the value 0.25. >>> >>> On Fri, Oct 9, 2015 at 12:16 PM, <saif.a.ell...@wellsfargo.com> wrote: >>

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
try. >>>> >>>> On Sat, Oct 10, 2015 at 1:20 AM, Michael Armbrust < >>>> mich...@databricks.com> wrote: >>>> >>>>> This is confusing because I made a typo... >>>>> >>>>> callUDF("percentile_approx&

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
t;>> This is confusing because I made a typo... >>>> >>>> callUDF("percentile_approx", col("mycol"), lit(0.25)) >>>> >>>> The first argument is the name of the UDF, all other arguments need to >>>> be columns that ar

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
kes only two arguments >>>> function name in String and Column class type. Please guide. >>>> >>>> On Sat, Oct 10, 2015 at 1:29 AM, Umesh Kacha <umesh.ka...@gmail.com> >>>> wrote: >>>> >>>>> thanks much Michael let me try. >

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Umesh Kacha
k. >>>>> >>>>> sourceFrame.agg(callUdf("percentile_approx",col("myCol"),lit(0.25))) >>>>> >>>>> Above code does not compile because callUdf() takes only two arguments >>>>> function name in String an

Re: How to calculate percentile of a column of DataFrame?

2015-10-12 Thread Ted Yu
;> Can you pastebin the full stack trace where you got the error ? >>>>> >>>>> Cheers >>>>> >>>>> On Fri, Oct 9, 2015 at 1:09 PM, Umesh Kacha <umesh.ka...@gmail.com> >>>>> wrote: >>>>> >>>>>> I h

Re: How to calculate percentile of a column of DataFrame?

2015-10-10 Thread Umesh Kacha
t;, col("mycol"), lit(0.25)) >>> >>> The first argument is the name of the UDF, all other arguments need to >>> be columns that are passed in as arguments. lit is just saying to make a >>> literal column that always has the value 0.25. >>> >

How to calculate percentile of a column of DataFrame?

2015-10-09 Thread unk1102
Hi how to calculate percentile of a column in a DataFrame? I cant find any percentile_approx function in Spark aggregation functions. For e.g. in Hive we have percentile_approx and we can use it in the following way hiveContext.sql("select percentile_approx("mycol",0.25) from myTab

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
You can use callUDF(col("mycol"), lit(0.25)) to call hive UDFs from dataframes. On Fri, Oct 9, 2015 at 12:01 PM, unk1102 <umesh.ka...@gmail.com> wrote: > Hi how to calculate percentile of a column in a DataFrame? I cant find any > percentile_approx function in Spark

RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Where can we find other available functions such as lit() ? I can’t find lit in the api. Thanks From: Michael Armbrust [mailto:mich...@databricks.com] Sent: Friday, October 09, 2015 4:04 PM To: unk1102 Cc: user Subject: Re: How to calculate percentile of a column of DataFrame? You can use

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
lable functions such as lit() ? I can’t find > lit in the api. > > > > Thanks > > > > *From:* Michael Armbrust [mailto:mich...@databricks.com] > *Sent:* Friday, October 09, 2015 4:04 PM > *To:* unk1102 > *Cc:* user > *Subject:* Re: How to calculate percentil

RE: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Saif.A.Ellafi
Yes but I mean, this is rather curious. How is def lit(literal:Any) --> becomes a percentile function lit(25) Thanks for clarification Saif From: Umesh Kacha [mailto:umesh.ka...@gmail.com] Sent: Friday, October 09, 2015 4:10 PM To: Ellafi, Saif A. Cc: Michael Armbrust; user Subject: Re:

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Michael Armbrust
value 0.25. On Fri, Oct 9, 2015 at 12:16 PM, <saif.a.ell...@wellsfargo.com> wrote: > Yes but I mean, this is rather curious. How is def lit(literal:Any) --> > becomes a percentile function lit(25) > > > > Thanks for clarification > > Saif > > > > *F

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
name of the UDF, all other arguments need to be > columns that are passed in as arguments. lit is just saying to make a > literal column that always has the value 0.25. > > On Fri, Oct 9, 2015 at 12:16 PM, <saif.a.ell...@wellsfargo.com> wrote: > >> Yes but I mean, this is rat

Re: How to calculate percentile of a column of DataFrame?

2015-10-09 Thread Umesh Kacha
the value 0.25. >> >> On Fri, Oct 9, 2015 at 12:16 PM, <saif.a.ell...@wellsfargo.com> wrote: >> >>> Yes but I mean, this is rather curious. How is def lit(literal:Any) --> >>> becomes a percentile function lit(25) >>> >>> >>

RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Imran for very detailed explanations and options. I think for now T-Digest is what I want. From: iras...@cloudera.com Date: Tue, 17 Feb 2015 08:39:48 -0600 Subject: Re: Percentile example To: myl...@hotmail.com CC: user@spark.apache.org (trying to repost to the list w/out URLs

RE: Percentile example

2015-02-17 Thread SiMaYunRui
Thanks Kohler, that's very interesting approach. I never used Spark SQL and not sure whether my cluster was configured well for it. But will definitely have a try.  From: c.koh...@elsevier.com To: myl...@hotmail.com; user@spark.apache.org Subject: Re: Percentile example Date: Tue, 17 Feb 2015

Re: Percentile example

2015-02-17 Thread Imran Rashid
some of the data to the driver, sort that data in memory, and take the 66th percentile of that sample. 1b. Make a histogram with pre-determined buckets. Eg., if you know your data ranges from 0 to 1 and is uniform-ish, you could make buckets every 0.01. Then count how many data points go

Re: Percentile example

2015-02-17 Thread Kohler, Curt E (ELS-STL)
to do): JavaSparkContext sc = new JavaSparkContext(sparkConf); JavaHiveContext hsc = new JavaHiveContext(sc); //Get your Data into a SchemaRDD and register the Table // Query it String hql = SELECT FIELD1, FIELD2, percentile(FIELD3, 0.05) AS ptile5 from TABLE-NAME GROUP BY FIELD1, FIELD2

Percentile example

2015-02-15 Thread SiMaYunRui
hello, I am a newbie to spark and trying to figure out how to get percentile against a big data set. Actually, I googled this topic but not find any very useful code example and explanation. Seems that I can use transformer SortBykey to get my data set in order, but not pretty sure how can I

Percentile Calculation

2015-01-28 Thread kundan kumar
Is there any inbuilt function for calculating percentile over a dataset ? I want to calculate the percentiles for each column in my data. Regards, Kundan

Re: Percentile Calculation

2015-01-28 Thread Kohler, Curt E (ELS-STL)
When I looked at this last fall, the only way that seemed to be available was to transform my data into SchemaRDDs, register them as tables and then use the Hive processor to calculate them with its built in percentile UDFs that were added in 1.2. Curt From

status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Kevin Burton
I’m curious what the status of implementing hive analytics functions in spark. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Many of these seem missing. I’m assuming they’re not implemented yet? Is there an ETA on them? or am I the first to bring this

Re: status of spark analytics functions? over, rank, percentile, row_number, etc.

2015-01-10 Thread Will Benton
Subject: status of spark analytics functions? over, rank, percentile, row_number, etc. I’m curious what the status of implementing hive analytics functions in spark. https://cwiki.apache.org/confluence/display/Hive/LanguageManual+WindowingAndAnalytics Many of these seem missing. I’m

Re: Percentile

2014-11-29 Thread Imran Rashid
point to its percentile in the distribution. to create the tdigests, you would do something like this: val myDataRDD = ... myDataRDD.mapPartitions{itr = xDistribution = TDigest.createArrayDigest(32, 100) yDistribution = TDigest.createArrayDigest(32, 100) ... itr.foreach{ data

Percentile

2014-11-27 Thread Franco Barrientos
Hi folks!, Anyone known how can I calculate for each elements of a variable in a RDD its percentile? I tried to calculate trough Spark SQL with subqueries but I think that is imposible in Spark SQL. Any idea will be welcome. Thanks in advance, Franco Barrientos Data Scientist Málaga

[SQL] PERCENTILE is not working

2014-11-05 Thread Kevin Paul
Hi all, I encounter this error when execute the query sqlContext.sql(select percentile(age, array(0, 0.5, 1)) from people).collect() java.lang.ClassCastException: scala.collection.mutable.ArrayBuffer cannot be cast to [Ljava.lang.Object; at org.apache.hadoop.hive.serde2

RE: [SQL] PERCENTILE is not working

2014-11-05 Thread Cheng, Hao
...@gmail.com] Sent: Thursday, November 6, 2014 7:09 AM To: user Subject: [SQL] PERCENTILE is not working Hi all, I encounter this error when execute the query sqlContext.sql(select percentile(age, array(0, 0.5, 1)) from people).collect() java.lang.ClassCastException

Re: [SQL] PERCENTILE is not working

2014-11-05 Thread Yin Huai
://issues.apache.org/jira/browse/SPARK-4263, can you also add some information there? Thanks, Cheng Hao -Original Message- From: Kevin Paul [mailto:kevinpaulap...@gmail.com] Sent: Thursday, November 6, 2014 7:09 AM To: user Subject: [SQL] PERCENTILE is not working Hi all, I encounter

Re: Spark SQL Percentile UDAF

2014-10-09 Thread Michael Armbrust
the Percentile UDAF PR being merged into trunk and decided to test it. So pulled in today's trunk and tested the percentile queries. They work marvelously, Thanks a lot for bringing this into Spark SQL. However Hive percentile UDAF also supports an array mode where in you can give the list

Re: Spark SQL Percentile UDAF

2014-10-09 Thread Anand Mohan Tumuluri
%2Fbrowse%2FSPARK%2Fsa=Dsntz=1usg=AFQjCNFS_GnMso2OCOITA0TSJ5U10b3JSQ On Thu, Oct 9, 2014 at 6:48 PM, Anand Mohan chinn...@gmail.com wrote: Hi, I just noticed the Percentile UDAF PR being merged into trunk and decided to test it. So pulled in today's trunk and tested the percentile queries

Getting percentile from Spark Streaming?

2014-08-13 Thread bumble123
Hi, I'm trying to figure out how to constantly update, say, the 95th percentile of a set of data through Spark Streaming. I'm not sure how to order the dataset though, and while I can find percentiles in regular Spark, I can't seem to figure out how to get that to transfer over to Spark Streaming

Implementing percentile through top Vs take

2014-07-30 Thread Bharath Ravi Kumar
I'm looking to select the top n records (by rank) from a data set of a few hundred GB's. My understanding is that JavaRDD.top(n, comparator) is entirely a driver-side operation in that all records are sorted in the driver's memory. I prefer an approach where the records are sorted on the cluster

Re: Implementing percentile through top Vs take

2014-07-30 Thread Sean Owen
No, it's definitely not done on the driver. It works as you say. Look at the source code for RDD.takeOrdered, which is what top calls. https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/rdd/RDD.scala#L1130 On Wed, Jul 30, 2014 at 7:07 PM, Bharath Ravi Kumar

Support for Percentile and Variance Aggregation functions in Spark with HiveContext

2014-07-25 Thread vinay . kashyap
Hi all, I am using Spark 1.0.0 with CDH 5.1.0. I want to aggregate the data in a raw table using a simple query like below SELECT MIN(field1), MAX(field2), AVG(field3), PERCENTILE(field4), year,month,day FROM  raw_data_table  GROUP BY year, month, day MIN, MAX and AVG functions work fine