Re: Replacing groupBykey() with reduceByKey()

2018-08-08 Thread Biplob Biswas
Hi Santhosh, My name is not Bipin, its Biplob as is clear from my Signature. Regarding your question, I have no clue what your map operation is doing on the grouped data, so I can only suggest you to do : dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).reduceByKey

Re: Replacing groupBykey() with reduceByKey()

2018-08-06 Thread Bathi CCDB
reduceByKey() in this context ? Santhosh On Mon, Aug 6, 2018 at 1:20 AM, Biplob Biswas wrote: > Hi Santhosh, > > If you are not performing any aggregation, then I don't think you can > replace your groupbykey with a reducebykey, and as I see you are only > grouping and taking 2 value

Re: Replacing groupBykey() with reduceByKey()

2018-08-06 Thread Biplob Biswas
Hi Santhosh, If you are not performing any aggregation, then I don't think you can replace your groupbykey with a reducebykey, and as I see you are only grouping and taking 2 values of the result, thus I believe you can't just replace your groupbykey with that. Thanks & Regards Biplob Bi

Replacing groupBykey() with reduceByKey()

2018-08-03 Thread Bathi CCDB
I am trying to replace groupByKey() with reudceByKey(), I am a pyspark and python newbie and I am having a hard time figuring out the lambda function for the reduceByKey() operation. Here is the code dd = hive_context.read.orc(orcfile_dir).rdd.map(lambda x: (x[0],x)).groupByKey(25).take(2) Here

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Dinesh Dharme
Yeah, you are right. I ran the experiments locally not on YARN. On Fri, Jul 27, 2018 at 11:54 PM, Vadim Semenov wrote: > `spark.worker.cleanup.enabled=true` doesn't work for YARN. > On Fri, Jul 27, 2018 at 8:52 AM dineshdharme > wrote: > > > > I am trying to do few

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Vadim Semenov
`spark.worker.cleanup.enabled=true` doesn't work for YARN. On Fri, Jul 27, 2018 at 8:52 AM dineshdharme wrote: > > I am trying to do few (union + reduceByKey) operations on a hiearchical > dataset in a iterative fashion in rdd. The first few loops run fine but on > the subs

Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread dineshdharme
I am trying to do few (union + reduceByKey) operations on a hiearchical dataset in a iterative fashion in rdd. The first few loops run fine but on the subsequent loops, the operations ends up using the whole scratch space provided to it. I have set the spark scratch directory, i.e

Re: Structured Stream equivalent of reduceByKey

2017-11-06 Thread Michael Armbrust
up that has changed in the batch. >> >> On Wed, Oct 25, 2017 at 5:52 PM, Piyush Mukati <piyush.muk...@gmail.com> >> wrote: >> >>> Hi, >>> we are migrating some jobs from Dstream to Structured Stream. >>> >>> Currently to handle agg

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Piyush Mukati
h Mukati <piyush.muk...@gmail.com> > wrote: > >> Hi, >> we are migrating some jobs from Dstream to Structured Stream. >> >> Currently to handle aggregations we call map and reducebyKey on each RDD >> like >> rdd.map(event => (event._1, event)

Re: Structured Stream equivalent of reduceByKey

2017-10-26 Thread Michael Armbrust
Oct 25, 2017 at 5:52 PM, Piyush Mukati <piyush.muk...@gmail.com> wrote: > Hi, > we are migrating some jobs from Dstream to Structured Stream. > > Currently to handle aggregations we call map and reducebyKey on each RDD > like > rdd.map(event => (event._1, event))

Structured Stream equivalent of reduceByKey

2017-10-25 Thread Piyush Mukati
Hi, we are migrating some jobs from Dstream to Structured Stream. Currently to handle aggregations we call map and reducebyKey on each RDD like rdd.map(event => (event._1, event)).reduceByKey((a, b) => merge(a, b)) The final output of each RDD is merged to the sink with support for aggre

Re: reducebykey

2017-04-07 Thread Ankur Srivastava
Hi Stephen, If you use aggregate functions or reduceGroup on KeyValueGroupedDataset it behaves as reduceByKey on RDD. Only if you use flatMapGroups and mapGroups it behaves as groupByKey on RDD and if you read the API documentation it warns of using the API. Hope this helps. Thanks Ankur

reducebykey

2017-04-07 Thread Stephen Fletcher
Are there plans to add reduceByKey to dataframes, Since switching over to spark 2 I find myself increasing dissatisfied with the idea of converting dataframes to RDD to do procedural programming on grouped data(both from a ease of programming stance and performance stance). So I've been using

[Spark Core]: flatMap/reduceByKey seems to be quite slow with Long keys on some distributions

2017-04-01 Thread Richard Tsai
is split into two stages, flatMap() and count(). When counting Tuples, flatMap() takes about 6s and count() takes about 2s, while when counting Longs, flatMap() takes 18s and count() takes 10s. I haven't look into Spark's implementation of flatMap/reduceByKey, but I guess Spark has some

groupByKey vs reduceByKey

2016-12-09 Thread Appu K
Hi, Read somewhere that groupByKey() in RDD disables map-side aggregation as the aggregation function (appending to a list) does not save any space. However from my understanding, using something like reduceByKey or (CombineByKey + a combiner function,) we could reduce the data shuffled

Access broadcast variable from within function passed to reduceByKey

2016-11-15 Thread coolgar
e-from-within-function-passed-to-reduceByKey-tp28082.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
(Spark 2.0 , on AWS EMR yarn cluster) >> >>> listens to Campaigns based on live stock feeds and the batch duration >> is 5 >> >>> seconds. The applications uses Kafka DirectStream and based on the >> feed >> >>> source there are three streams.

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread ayan guha
d the batch duration > is 5 > >>> seconds. The applications uses Kafka DirectStream and based on the feed > >>> source there are three streams. As given in the code snippet I am > doing a > >>> union of three streams and I am trying to remove the dupli

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
luster) >>> listens to Campaigns based on live stock feeds and the batch duration is 5 >>> seconds. The applications uses Kafka DirectStream and based on the feed >>> source there are three streams. As given in the code snippet I am doing a >>> union of three streams

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
three streams and I am trying to remove the duplicate campaigns >> received using reduceByKey based on the customer and campaignId. I could >> see lot of duplicate email being send out for the same key in the same >> batch.I was expecting reduceByKey to remove the duplicate campaigns

Re: Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread Cody Koeninger
ree streams and I am trying to remove the duplicate campaigns > received using reduceByKey based on the customer and campaignId. I could > see lot of duplicate email being send out for the same key in the same > batch.I was expecting reduceByKey to remove the duplicate campaigns in a > bat

Spark Streaming- ReduceByKey not removing Duplicates for the same key in a Batch

2016-11-12 Thread dev loper
snippet I am doing a union of three streams and I am trying to remove the duplicate campaigns received using reduceByKey based on the customer and campaignId. I could see lot of duplicate email being send out for the same key in the same batch.I was expecting reduceByKey to remove the duplicate

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
it works similarly as reducebykey. > > On 29 Aug 2016 20:57, "Marius Soutier" <mps@gmail.com > <mailto:mps@gmail.com>> wrote: > In DataFrames (and thus in 1.5 in general) this is not possible, correct? > >> On 11.08.2016, at 05:42, Holden Karau &

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread ayan guha
If you are confused because of the name of two APIs. I think DF API name groupBy came from SQL, but it works similarly as reducebykey. On 29 Aug 2016 20:57, "Marius Soutier" <mps@gmail.com> wrote: > In DataFrames (and thus in 1.5 in general) this is not possible, correct

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-29 Thread Marius Soutier
hips partial aggregations instead). > > I wonder whether the DataFrame API optimizes the code doing something > similar to what RDD.reduceByKey does. > > I am using Spark 1.6.2. > > Regards, > Luis > > > > -- > View this message in context: > http://ap

Re: Is there a reduceByKey functionality in DataFrame API?

2016-08-10 Thread Holden Karau
ds, > Luis > > > > -- > View this message in context: http://apache-spark-user-list. > 1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame- > API-tp27508.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > >

Is there a reduceByKey functionality in DataFrame API?

2016-08-10 Thread luismattor
t.1001560.n3.nabble.com/Is-there-a-reduceByKey-functionality-in-DataFrame-API-tp27508.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-23 Thread Nirav Patel
SO it was indeed my merge function. I created new result object for every merge and its working now. Thanks On Wed, Jun 22, 2016 at 3:46 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > PS. In my reduceByKey operation I have two mutable object. What I do is > merge mutable2 i

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel
PS. In my reduceByKey operation I have two mutable object. What I do is merge mutable2 into mutable1 and return mutable1. I read that it works for aggregateByKey so thought it will work for reduceByKey as well. I might be wrong here. Can someone verify if this will work or be un predictable

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Nirav Patel
incorrect result, did you observe any error (on > workers) ? > > Cheers > > On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel <npa...@xactlycorp.com> > wrote: > >> I have an RDD[String, MyObj] which is a result of Join + Map operation. >> It has no partiti

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Ted Yu
For the run which returned incorrect result, did you observe any error (on workers) ? Cheers On Tue, Jun 21, 2016 at 10:42 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > I have an RDD[String, MyObj] which is a result of Join + Map operation. It > has no partitioner info. I run

Re: Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-22 Thread Takeshi Yamamuro
Hi, Could you check the issue also occurs in v1.6.1 and v2.0? // maropu On Wed, Jun 22, 2016 at 2:42 PM, Nirav Patel <npa...@xactlycorp.com> wrote: > I have an RDD[String, MyObj] which is a result of Join + Map operation. It > has no partitioner info. I run reduceByKey without

Spark 1.5.2 - Different results from reduceByKey over multiple iterations

2016-06-21 Thread Nirav Patel
I have an RDD[String, MyObj] which is a result of Join + Map operation. It has no partitioner info. I run reduceByKey without passing any Partitioner or partition counts. I observed that output aggregation result for given key is incorrect sometime. like 1 out of 5 times. It looks like reduce

Re: Dataset - reduceByKey

2016-06-07 Thread Jacek Laskowski
t; I am looking at the option of moving RDD based operations to Dataset based > operations. We are calling 'reduceByKey' on some pair RDDs we have. What > would the equivalent be in the Dataset interface - I do not see a simple > reduceByKey replacement. > &

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
On Tue, Jun 7, 2016 at 2:50 PM, Richard Marscher <rmarsc...@localytics.com> wrote: > There certainly are some gaps between the richness of the RDD API and the > Dataset API. I'm also migrating from RDD to Dataset and ran into > reduceByKey and join scenarios. > > In the spark

Re: Dataset - reduceByKey

2016-06-07 Thread Richard Marscher
There certainly are some gaps between the richness of the RDD API and the Dataset API. I'm also migrating from RDD to Dataset and ran into reduceByKey and join scenarios. In the spark-dev list, one person was discussing reduceByKey being sub-optimal at the moment and it spawned this JIRA https

Re: Dataset - reduceByKey

2016-06-07 Thread Takeshi Yamamuro
un 7, 2016 at 2:32 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> > wrote: > >> Hello. >> >> I am looking at the option of moving RDD based operations to Dataset >> based operations. We are calling 'reduceByKey' on some pair RDDs we have. >> What would the

Re: Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
at 2:32 PM, Bryan Jeffrey <bryan.jeff...@gmail.com> wrote: > Hello. > > I am looking at the option of moving RDD based operations to Dataset based > operations. We are calling 'reduceByKey' on some pair RDDs we have. What > would the equivalent be in the Dataset interface -

Dataset - reduceByKey

2016-06-07 Thread Bryan Jeffrey
Hello. I am looking at the option of moving RDD based operations to Dataset based operations. We are calling 'reduceByKey' on some pair RDDs we have. What would the equivalent be in the Dataset interface - I do not see a simple reduceByKey replacement. Regards, Bryan Jeffrey

How to change Spark DataFrame groupby("col1",..,"coln") into reduceByKey()?

2016-05-22 Thread unk1102
by keys are empty. How do I avoid empty group by keys in DataFrame? Does DataFrame avoid empty group by key? I have around 8 keys on which I do group by. sourceFrame.select("blabla").groupby("col1","col2","col3",..."col8").agg("bla bla");

Re: Executor memory requirement for reduceByKey

2016-05-17 Thread Raghavendra Pandey
Even though it does not sound intuitive, reduce by key expects all values for a particular key for a partition to be loaded into memory. So once you increase the partitions you can run the jobs.

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
...@cs.stanford.edu> wrote: >> >>> Hello, >>> >>> I'm using Spark version 1.6.0 and have trouble with memory when trying >>> to do reducebykey on a dataset with as many as 75 million keys. I.e. I get >>> the following exception when I run the

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
, Sung Hwan Chung < > coded...@cs.stanford.edu> wrote: > >> Hello, >> >> I'm using Spark version 1.6.0 and have trouble with memory when trying to >> do reducebykey on a dataset with as many as 75 million keys. I.e. I get the >> following exception when I r

Re: Executor memory requirement for reduceByKey

2016-05-13 Thread Ted Yu
trying to > do reducebykey on a dataset with as many as 75 million keys. I.e. I get the > following exception when I run the task. > > There are 20 workers in the cluster. It is running under the standalone > mode with 12 GB assigned per executor and 4 cores per worker. The > spark.memo

Executor memory requirement for reduceByKey

2016-05-13 Thread Sung Hwan Chung
Hello, I'm using Spark version 1.6.0 and have trouble with memory when trying to do reducebykey on a dataset with as many as 75 million keys. I.e. I get the following exception when I run the task. There are 20 workers in the cluster. It is running under the standalone mode with 12 GB assigned

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Sumedh Wale
On Monday 25 April 2016 11:28 PM, Weiping Qu wrote: Dear Ted, You are right. ReduceByKey is transformation. My fault. I would rephrase my question using following code snippet. object ScalaApp {   def main(args: Array

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu
Dear Ted, You are right. ReduceByKey is transformation. My fault. I would rephrase my question using following code snippet. object ScalaApp { def main(args: Array[String]): Unit ={ val conf = new SparkConf().setAppName("ScalaApp").setMaster("local") val sc = n

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Ted Yu
ds > on whether they are lazily executed or not. > As far as I saw from my codes, the reduceByKey will be executed without > any operations in the Action category. > Please correct me if I am wrong. > > Thanks, > Regards, > Weiping > > On 25.04.2016 17:20, Chadha Pooja

Re: reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu
Thanks. I read that from the specification. I thought the way people distinguish actions and transformations depends on whether they are lazily executed or not. As far as I saw from my codes, the reduceByKey will be executed without any operations in the Action category. Please correct me if I

reduceByKey as Action or Transformation

2016-04-25 Thread Weiping Qu
Hi, I'd like just to verify that whether reduceByKey is transformation or actions. As written in RDD papers, spark flow will not be triggered only if actions are reached. I tried and saw that the my flow will be executed once there is a reduceByKey while it is categorized into transformations

What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Tom Seddon
in my job that send data to the driver - just a pull of data from S3, a map and reduceByKey and then conversion to dataframe and saveAsTable action that puts the results back on S3. I've found a few references to reduceByKey and spark.driver.maxResultSize having some importance, but cannot fathom how

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Zhan Zhang
ke any sense to me, as there are no actions in my job that send data to the driver - just a pull of data from S3, a map and reduceByKey and then conversion to dataframe and saveAsTable action that puts the results back on S3. I've found a few references to reduceByKey and spark.driver.maxResult

Re: What is the relationship between reduceByKey and spark.driver.maxResultSize?

2015-12-11 Thread Eugen Cepoi
o actions in my job that > send data to the driver - just a pull of data from S3, a map and > reduceByKey and then conversion to dataframe and saveAsTable action that > puts the results back on S3. > > I've found a few references to reduceByKey and spark.driver.maxResultSize >

Re: Data in one partition after reduceByKey

2015-11-25 Thread Ruslan Dautkhanov
65), (320,66), (340,67), > (360,68), (380,69), (400,70), (420,71), (440,72), (460,73), (480,74), (500... > > scala> val rddNum = sc.parallelize(nums) > rddNum: org.apache.spark.rdd.RDD[(Int, Int)] = ParallelCollectionRDD[0] at > parallelize at :23 > > scala> val reducedNum

Re: Data in one partition after reduceByKey

2015-11-23 Thread Patrick McGloin
)] = ShuffledRDD[1] at reduceByKey at :25 scala> reducedNum.mapPartitions(iter => Array(iter.size).iterator, true).collect.toList res2: List[Int] = List(50, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0) To distribute my data more evenly across the partitions I created my own custom Partitoiner

Data in one partition after reduceByKey

2015-11-20 Thread Patrick McGloin
zes: RDD[Int] = rdd.mapPartitions(iter => Array(iter.size).iterator, true)log.info(s"rdd ==> [${sizes.collect.toList}]") My question is why does my data end up in one partition after the reduceByKey? After the filter it can be seen that the data is evenly distributed, but the

Re: Incorrect results with reduceByKey

2015-11-18 Thread tovbinm
Deep copying the data solved the issue: data.map(r => {val t = SpecificData.get().deepCopy(r.getSchema, r); (t.id, List(t)) }).reduceByKey(_ ++ _) (noted here: https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/SparkContext.scala#L1003) Thanks Igor Ber

Incorrect results with reduceByKey

2015-11-17 Thread tovbinm
Howdy, We've noticed a strange behavior with Avro serialized data and reduceByKey RDD functionality. Please see below: // We're reading a bunch of Avro serialized data val data: RDD[T] = sparkContext.hadoopFile(path, classOf[AvroInputFormat[T]], classOf[AvroWrapper[T]], classOf[NullWritable

Re: job hangs when using pipe() with reduceByKey()

2015-11-01 Thread Gylfi
Hi. What is slow exactly? In code-base 1: When you run the persist() + count() you stored the result in RAM. Then the map + reducebykey is done on in-memory data. In the latter case (all-in-oneline) you are doing both steps at the same time. So you are saying that if you sum-up the time

Re: job hangs when using pipe() with reduceByKey()

2015-11-01 Thread hotdog
yes, the first code takes only 30mins. but the second method, I wait for 5 hours, only finish 10% -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/job-hangs-when-using-pipe-with-reduceByKey-tp25242p25249.html Sent from the Apache Spark User List mailing

job hangs when using pipe() with reduceByKey()

2015-10-31 Thread hotdog
I meet a situation: When I use val a = rdd.pipe("./my_cpp_program").persist() a.count() // just use it to persist a val b = a.map(s => (s, 1)).reduceByKey().count() it 's so fast but when I use val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByK

Re:Re: job hangs when using pipe() with reduceByKey()

2015-10-31 Thread 李森栋
pipe("./my_cpp_program").persist() a.count() // just use it to persist a val b = a.map(s => (s, 1)).reduceByKey().count() it 's so fast but when I use val b = rdd.pipe("./my_cpp_program").map(s => (s, 1)).reduceByKey().count() it is so slow and there are many

Re: job hangs when using pipe() with reduceByKey()

2015-10-31 Thread Ted Yu
Which Spark release are you using ? Which OS ? Thanks On Sat, Oct 31, 2015 at 5:18 AM, hotdog <lisend...@163.com> wrote: > I meet a situation: > When I use > val a = rdd.pipe("./my_cpp_program").persist() > a.count() // just use it to persist a > val b = a.map(s

Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha
Hi, We currently use reduceByKey to reduce by a particular metric name in our Streaming/Batch job. It seems to be doing a lot of shuffles and it has impact on performance. Does using a custompartitioner before calling reduceByKey improve performance? Thanks, Swetha -- View this message

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
ultimately improves performance. On Tue, Oct 27, 2015 at 1:20 PM, swetha <swethakasire...@gmail.com> wrote: > Hi, > > We currently use reduceByKey to reduce by a particular metric name in our > Streaming/Batch job. It seems to be doing a lot of shuffles and it has > impact

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy
So, Wouldn't using a customPartitioner on the rdd upon which the groupByKey or reduceByKey is performed avoid shuffles and improve performance? My code does groupByAndSort and reduceByKey on different datasets as shown below. Would using a custom partitioner on those datasets before using

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
do that operation with a reduceByKey? 2. If not, use more partitions. That would cause lesser data in each partition, so less spilling. 3. You can control the amount memory allocated for shuffles by changing the configuration spark.shuffle.memoryFraction . More fraction would cause less spilling

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread Tathagata Das
e same. > > > Also, does using a customPartitioner for a reduceByKey improve performance? > > > def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String, > List[(Long, String)])] = > { val grpdRecs = rdd.groupByKey(); val srtdRecs = > grpdRecs.mapValues[(List[(Long

Re: Does using Custom Partitioner before calling reduceByKey improve performance?

2015-10-27 Thread swetha kasireddy
. Also, does using a customPartitioner for a reduceByKey improve performance? def getGrpdAndSrtdRecs(rdd: RDD[(String, (Long, String))]): RDD[(String, List[(Long, String)])] = { val grpdRecs = rdd.groupByKey(); val srtdRecs = grpdRecs.mapValues[(List[(Long, String)])](iter => iter.toList.sortBy(_

Re: Configuring Spark for reduceByKey on on massive data sets

2015-10-12 Thread hotdog
hi Daniel, Do you solve your problem? I met the same problem when running massive data using reduceByKey on yarn. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Configuring-Spark-for-reduceByKey-on-on-massive-data-sets-tp5966p25023.html Sent from

Re: "Too many open files" exception on reduceByKey

2015-10-11 Thread Tian Zhang
It turns out the mesos can overwrite the OS ulimit -n setting. So we have increased the mesos slave ulimit -n setting. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p25019.html Sent from the Apache Spark

Re: "Too many open files" exception on reduceByKey

2015-10-09 Thread tian zhang
un(Thread.java:745) -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. -

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread Tian Zhang
-many-open-files-exception-on-reduceByKey-tp2462p24985.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h

Re: "Too many open files" exception on reduceByKey

2015-10-08 Thread DB Tsai
va.lang.Thread.run(Thread.java:745) > > > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Too-many-open-files-exception-on-reduceByKey-tp2462p24985.html > Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: How to make Group By/reduceByKey more efficient?

2015-09-24 Thread Adrian Tanase
All the *ByKey aggregations perform an efficient shuffle and preserve partitioning on the output. If all you need is to call reduceByKey, then don’t bother with groupBy. You should use groupBy if you really need all the datapoints from a key for a very custom operation. From the docs: Note

reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread swetha
Hi, How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of keys for which I need to do sum and average inside the updateStateByKey by joining with old state. How do I accomplish that? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560

Re: reduceByKey inside updateStateByKey in Spark Streaming???

2015-09-24 Thread Adrian Tanase
case? Sent from my iPhone > On 24 Sep 2015, at 19:47, swetha <swethakasire...@gmail.com> wrote: > > Hi, > > How to use reduceByKey inside updateStateByKey? Suppose I have a bunch of > keys for which I need to do sum and average inside the updateStateByKey by > joini

How to make Group By/reduceByKey more efficient?

2015-09-22 Thread swetha
Hi, How to make Group By more efficient? Is it recommended to use a custom partitioner and then do a Group By? And can we use a custom partitioner and then use a reduceByKey for optimization? Thanks, Swetha -- View this message in context: http://apache-spark-user-list.1001560.n3

reduceByKey not working on JavaPairDStream

2015-08-26 Thread Deepesh Maheshwari
Hi, I have applied mapToPair and then a reduceByKey on a DStream to obtain a JavaPairDStreamString, MapString, Object. I have to apply a flatMapToPair and reduceByKey on the DSTream Obtained above. But i do not see any logs from reduceByKey operation. Can anyone explain why is this happening

Re: reduceByKey not working on JavaPairDStream

2015-08-26 Thread Sean Owen
I don't see that you invoke any action in this code. It won't do anything unless you tell it to perform an action that requires the transformations. On Wed, Aug 26, 2015 at 7:05 AM, Deepesh Maheshwari deepesh.maheshwar...@gmail.com wrote: Hi, I have applied mapToPair and then a reduceByKey

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-24 Thread satish chandra j
HI All, Please find fix info for users who are following the mail chain of this issue and the respective solution below: *reduceByKey: Non working snippet* import org.apache.spark.Context import org.apache.spark.Context._ import org.apache.spark.SparkConf val conf = new SparkConf() val sc = new

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-22 Thread satish chandra j
UseCompressedOops is set; assuming yes res0: Array[(Int, Int)] = Array((0,3), (1,50), (2,40)) Yong -- Date: Fri, 21 Aug 2015 19:24:09 +0530 Subject: Re: Transformation not happening for reduceByKey or GroupByKey From: jsatishchan...@gmail.com To: abhis

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964
: Transformation not happening for reduceByKey or GroupByKey From: zjf...@gmail.com To: jsatishchan...@gmail.com CC: robin.e...@xense.co.uk; user@spark.apache.org Hi Satish, I don't see where spark support -i, so suspect it is provided by DSE. In that case, it might be bug of DSE. On Fri, Aug 21, 2015

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
know too much about the original problem though. Yong -- Date: Fri, 21 Aug 2015 18:19:49 +0800 Subject: Re: Transformation not happening for reduceByKey or GroupByKey From: zjf...@gmail.com To: jsatishchan...@gmail.com CC: robin.e...@xense.co.uk; user

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Abhishek R. Singh
), (0,2),(1,20),(1,30),(2,40)) I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on Values for each key Code: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Result in console: RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at console:73

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
for each key Code: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Result in console: RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at console:73 res:Array[(Int,Int)] = Array() Command as mentioned dse spark --master local --jars postgresql-9.4-1201.jar -i

RE: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread java8964
not happening for reduceByKey or GroupByKey From: jsatishchan...@gmail.com To: abhis...@tetrationanalytics.com CC: user@spark.apache.org HI Abhishek, I have even tried that but rdd2 is empty Regards,Satish On Fri, Aug 21, 2015 at 6:47 PM, Abhishek R. Singh abhis...@tetrationanalytics.com wrote: You had

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
),(2,40)) I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on Values for each key Code: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Result in console: RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at console:73 res:Array[(Int,Int)] = Array

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
),(Int)] = Array((0,1), (0,2),(1,20),(1,30),(2,40)) I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on Values for each key Code: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Result in console: RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread Jeff Zhang
((0,3),(1,50),(2,40)) just a sum function on Values for each key Code: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Result in console: RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at console:73 res:Array[(Int,Int)] = Array() Command as mentioned dse spark --master

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-21 Thread satish chandra j
), (0,2),(1,20),(1,30),(2,40)) I am expecting output as Array((0,3),(1,50),(2,40)) just a sum function on Values for each key Code: RDD.reduceByKey((x,y) = x+y) RDD.take(3) Result in console: RDD: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at console:73 res:Array[(Int

Transformation not happening for reduceByKey or GroupByKey

2015-08-20 Thread satish chandra j
: org.apache.spark.rdd.RDD[(Int,Int)]= ShuffledRDD[1] at reduceByKey at console:73 res:Array[(Int,Int)] = Array() Command as mentioned dse spark --master local --jars postgresql-9.4-1201.jar -i ScriptFile Please let me know what is missing in my code, as my resultant Array is empty Regards, Satish

Re: Transformation not happening for reduceByKey or GroupByKey

2015-08-20 Thread satish chandra j
)]= ShuffledRDD[1] at reduceByKey at console:73 res:Array[(Int,Int)] = Array() Command as mentioned dse spark --master local --jars postgresql-9.4-1201.jar -i ScriptFile Please let me know what is missing in my code, as my resultant Array is empty Regards, Satish

Re: Apache Spark : Custom function for reduceByKey - missing arguments for method

2015-07-10 Thread Richard Marscher
, V) = V. I am really stuck here. Please help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Custom-function-for-reduceByKey-missing-arguments-for-method-tp23756.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Apache Spark : Custom function for reduceByKey - missing arguments for method

2015-07-09 Thread ameyamm
really stuck here. Please help. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Apache-Spark-Custom-function-for-reduceByKey-missing-arguments-for-method-tp23756.html Sent from the Apache Spark User List mailing list archive at Nabble.com

Re: run reduceByKey on huge data in spark

2015-06-30 Thread barge.nilesh
-spark-user-list.1001560.n3.nabble.com/run-reduceByKey-on-huge-data-in-spark-tp23546p23555.html Sent from the Apache Spark User List mailing list archive at Nabble.com. - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

run reduceByKey on huge data in spark

2015-06-30 Thread hotdog
I'm running reduceByKey in spark. My program is the simplest example of spark: val counts = textFile.flatMap(line = line.split( )).repartition(2). .map(word = (word, 1)) .reduceByKey(_ + _, 1) counts.saveAsTextFile(hdfs://...) but it always run out

Re: run reduceByKey on huge data in spark

2015-06-30 Thread lisendong
mode ? Cheers On Tue, Jun 30, 2015 at 10:03 AM, hotdog lisend...@163.com mailto:lisend...@163.com wrote: I'm running reduceByKey in spark. My program is the simplest example of spark: val counts = textFile.flatMap(line = line.split( )).repartition(2). .map(word

Re: run reduceByKey on huge data in spark

2015-06-30 Thread Ted Yu
Which Spark release are you using ? Are you running in standalone mode ? Cheers On Tue, Jun 30, 2015 at 10:03 AM, hotdog lisend...@163.com wrote: I'm running reduceByKey in spark. My program is the simplest example of spark: val counts = textFile.flatMap(line = line.split( )).repartition

reduceByKey - add values to a list

2015-06-25 Thread Kannappan Sirchabesan
), (USA, Colorado) Output RDD: (USA, [California, Colorado]), (UK, Yorkshire) Is it possible to use reduceByKey or foldByKey to achieve this, instead of groupBykey. Something equivalent to a cons operator from LISP?, so that I could just say reduceBykey(lambda x,y: (cons x y) ). May be it is more

  1   2   3   >