Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Dinesh Dharme
Yeah, you are right. I ran the experiments locally not on YARN. On Fri, Jul 27, 2018 at 11:54 PM, Vadim Semenov wrote: > `spark.worker.cleanup.enabled=true` doesn't work for YARN. > On Fri, Jul 27, 2018 at 8:52 AM dineshdharme > wrote: > > > > I am trying to do few (union + reduceByKey)

Re: Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread Vadim Semenov
`spark.worker.cleanup.enabled=true` doesn't work for YARN. On Fri, Jul 27, 2018 at 8:52 AM dineshdharme wrote: > > I am trying to do few (union + reduceByKey) operations on a hiearchical > dataset in a iterative fashion in rdd. The first few loops run fine but on > the subsequent loops, the

Iterative rdd union + reduceByKey operations on small dataset leads to "No space left on device" error on account of lot of shuffle spill.

2018-07-27 Thread dineshdharme
I am trying to do few (union + reduceByKey) operations on a hiearchical dataset in a iterative fashion in rdd. The first few loops run fine but on the subsequent loops, the operations ends up using the whole scratch space provided to it. I have set the spark scratch directory, i.e.

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-20 Thread Everett Anderson
PM, Everett Anderson <ever...@nuna.com> wrote: > Hi! > > On Thu, Mar 16, 2017 at 5:20 PM, Burak Yavuz <brk...@gmail.com> wrote: > >> Hi Everett, >> >> IIRC we added unionAll in Spark 2.0 which is the same implementation as >> rdd union. The union

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Hi! On Thu, Mar 16, 2017 at 5:20 PM, Burak Yavuz <brk...@gmail.com> wrote: > Hi Everett, > > IIRC we added unionAll in Spark 2.0 which is the same implementation as > rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and > that's why you should be seei

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Burak Yavuz
Hi Everett, IIRC we added unionAll in Spark 2.0 which is the same implementation as rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and that's why you should be seeing the slowdown. Best, Burak On Thu, Mar 16, 2017 at 4:14 PM, Everett Anderson <ever...@nuna.com.inva

Re: Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Looks like the Dataset version of union may also fail with the following on larger data sets, which again seems like it might be drawing everything into the driver for some reason -- 7/03/16 22:28:21 WARN TaskSetManager: Lost task 1.0 in stage 91.0 (TID 5760,

Spark 2.0.2 Dataset union() slowness vs RDD union?

2017-03-16 Thread Everett Anderson
Hi, We're using Dataset union() in Spark 2.0.2 to concatenate a bunch of tables together and save as Parquet to S3, but it seems to take a long time. We're using the S3A FileSystem implementation under the covers, too, if that helps. Watching the Spark UI, the executors all eventually stop

RDD union

2015-04-09 Thread Debasish Das
Hi, I have some code that creates ~ 80 RDD and then a sc.union is applied to combine all 80 into one for the next step (to run topByKey for example)... While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3 hrs (I am validating these numbers)... Is there any checkpoint

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
Date: Fri, 5 Dec 2014 14:58:37 -0600 Subject: Re: Java RDD Union To: ronalday...@live.com; user@spark.apache.org foreach also creates a new RDD, and does not modify an existing RDD. However, in practice, nothing stops you from fiddling with the Java objects inside an RDD when you get

Re: Java RDD Union

2014-12-06 Thread Sean Owen
I guess a major problem with this is that you lose fault tolerance. You have no way of recreating the local state of the mutable RDD if a partition is lost. Why would you need thousands of RDDs for kmeans? it's a few per iteration. An RDD is more bookkeeping that data structure, itself. They

RE: Java RDD Union

2014-12-06 Thread Ron Ayoub
. But anyway, that is the very thing Spark is advertised for. From: so...@cloudera.com Date: Sat, 6 Dec 2014 06:39:10 -0600 Subject: Re: Java RDD Union To: ronalday...@live.com CC: user@spark.apache.org I guess a major problem with this is that you lose fault tolerance. You have no way

Java RDD Union

2014-12-05 Thread Ron Ayoub
I'm a bit confused regarding expected behavior of unions. I'm running on 8 cores. I have an RDD that is used to collect cluster associations (cluster id, content id, distance) for internal clusters as well as leaf clusters since I'm doing hierarchical k-means and need all distances for sorting

Re: Java RDD Union

2014-12-05 Thread Sean Owen
No, RDDs are immutable. union() creates a new RDD, and does not modify an existing RDD. Maybe this obviates the question. I'm not sure what you mean about releasing from memory. If you want to repartition the unioned RDD, you repartition the result of union(), not anything else. On Fri, Dec 5,

Re: Java RDD Union

2014-12-05 Thread Sameer Farooqui
Hi Ron, Out of curiosity, why do you think that union is modifying an existing RDD in place? In general all transformations, including union, will create new RDDs, not modify old RDDs in place. Here's a quick test: scala val firstRDD = sc.parallelize(1 to 5) firstRDD:

Re: Java RDD Union

2014-12-05 Thread Sean Owen
but there are referents are and somehow this will no longer work when clustering. Thanks for comments. From: so...@cloudera.com Date: Fri, 5 Dec 2014 14:22:38 -0600 Subject: Re: Java RDD Union To: ronalday...@live.com CC: user@spark.apache.org No, RDDs are immutable. union() creates a new

Question about RDD Union and SubtractByKey

2014-11-10 Thread Darin McBeath
I have the following code where I'm using RDD 'union' and 'subtractByKey' to create a new baseline RDD.  All of my RDDs are a key pair with the 'key' a String and the 'value' a String (xml document). // **// Merge the daily deletes/updates

RDD union of a window in Dstream

2014-05-21 Thread Laeeq Ahmed
Hi, I want to do union of all RDDs in each window of DStream. I found Dstream.union and haven't seen anything like DStream.windowRDDUnion. Is there any way around it? I want to find mean and SD of all values which comes under each sliding window for which I need to union all the RDDs in each

Re: RDD union of a window in Dstream

2014-05-21 Thread Tobias Pfeiffer
Hi, On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote: I want to do union of all RDDs in each window of DStream. A window *is* a union of all RDDs in the respective time interval. The documentation says a DStream is represented as a sequence of RDDs. However, data from a