Yeah, you are right. I ran the experiments locally not on YARN.
On Fri, Jul 27, 2018 at 11:54 PM, Vadim Semenov wrote:
> `spark.worker.cleanup.enabled=true` doesn't work for YARN.
> On Fri, Jul 27, 2018 at 8:52 AM dineshdharme
> wrote:
> >
> > I am trying to do few (union + reduceByKey)
`spark.worker.cleanup.enabled=true` doesn't work for YARN.
On Fri, Jul 27, 2018 at 8:52 AM dineshdharme wrote:
>
> I am trying to do few (union + reduceByKey) operations on a hiearchical
> dataset in a iterative fashion in rdd. The first few loops run fine but on
> the subsequent loops, the
I am trying to do few (union + reduceByKey) operations on a hiearchical
dataset in a iterative fashion in rdd. The first few loops run fine but on
the subsequent loops, the operations ends up using the whole scratch space
provided to it.
I have set the spark scratch directory, i.e.
PM, Everett Anderson <ever...@nuna.com> wrote:
> Hi!
>
> On Thu, Mar 16, 2017 at 5:20 PM, Burak Yavuz <brk...@gmail.com> wrote:
>
>> Hi Everett,
>>
>> IIRC we added unionAll in Spark 2.0 which is the same implementation as
>> rdd union. The union
Hi!
On Thu, Mar 16, 2017 at 5:20 PM, Burak Yavuz <brk...@gmail.com> wrote:
> Hi Everett,
>
> IIRC we added unionAll in Spark 2.0 which is the same implementation as
> rdd union. The union in DataFrames with Spark 2.0 does dedeuplication, and
> that's why you should be seei
Hi Everett,
IIRC we added unionAll in Spark 2.0 which is the same implementation as rdd
union. The union in DataFrames with Spark 2.0 does dedeuplication, and
that's why you should be seeing the slowdown.
Best,
Burak
On Thu, Mar 16, 2017 at 4:14 PM, Everett Anderson <ever...@nuna.com.inva
Looks like the Dataset version of union may also fail with the following on
larger data sets, which again seems like it might be drawing everything
into the driver for some reason --
7/03/16 22:28:21 WARN TaskSetManager: Lost task 1.0 in stage 91.0 (TID
5760,
Hi,
We're using Dataset union() in Spark 2.0.2 to concatenate a bunch of tables
together and save as Parquet to S3, but it seems to take a long time. We're
using the S3A FileSystem implementation under the covers, too, if that
helps.
Watching the Spark UI, the executors all eventually stop
Hi,
I have some code that creates ~ 80 RDD and then a sc.union is applied to
combine all 80 into one for the next step (to run topByKey for example)...
While creating 80 RDDs take 3 mins per RDD, doing a union over them takes 3
hrs (I am validating these numbers)...
Is there any checkpoint
Date: Fri, 5 Dec 2014 14:58:37 -0600
Subject: Re: Java RDD Union
To: ronalday...@live.com; user@spark.apache.org
foreach also creates a new RDD, and does not modify an existing RDD.
However, in practice, nothing stops you from fiddling with the Java
objects inside an RDD when you get
I guess a major problem with this is that you lose fault tolerance.
You have no way of recreating the local state of the mutable RDD if a
partition is lost.
Why would you need thousands of RDDs for kmeans? it's a few per iteration.
An RDD is more bookkeeping that data structure, itself. They
.
But anyway, that is the very thing Spark is advertised for.
From: so...@cloudera.com
Date: Sat, 6 Dec 2014 06:39:10 -0600
Subject: Re: Java RDD Union
To: ronalday...@live.com
CC: user@spark.apache.org
I guess a major problem with this is that you lose fault tolerance.
You have no way
I'm a bit confused regarding expected behavior of unions. I'm running on 8
cores. I have an RDD that is used to collect cluster associations (cluster id,
content id, distance) for internal clusters as well as leaf clusters since I'm
doing hierarchical k-means and need all distances for sorting
No, RDDs are immutable. union() creates a new RDD, and does not modify
an existing RDD. Maybe this obviates the question. I'm not sure what
you mean about releasing from memory. If you want to repartition the
unioned RDD, you repartition the result of union(), not anything else.
On Fri, Dec 5,
Hi Ron,
Out of curiosity, why do you think that union is modifying an existing RDD
in place? In general all transformations, including union, will create new
RDDs, not modify old RDDs in place.
Here's a quick test:
scala val firstRDD = sc.parallelize(1 to 5)
firstRDD:
but there are referents are and somehow this will no longer work
when clustering.
Thanks for comments.
From: so...@cloudera.com
Date: Fri, 5 Dec 2014 14:22:38 -0600
Subject: Re: Java RDD Union
To: ronalday...@live.com
CC: user@spark.apache.org
No, RDDs are immutable. union() creates a new
I have the following code where I'm using RDD 'union' and 'subtractByKey' to
create a new baseline RDD. All of my RDDs are a key pair with the 'key' a
String and the 'value' a String (xml document).
// **// Merge the daily
deletes/updates
Hi,
I want to do union of all RDDs in each window of DStream. I found Dstream.union
and haven't seen anything like DStream.windowRDDUnion.
Is there any way around it?
I want to find mean and SD of all values which comes under each sliding window
for which I need to union all the RDDs in each
Hi,
On Wed, May 21, 2014 at 9:42 PM, Laeeq Ahmed laeeqsp...@yahoo.com wrote:
I want to do union of all RDDs in each window of DStream.
A window *is* a union of all RDDs in the respective time interval.
The documentation says a DStream is represented as a sequence of
RDDs. However, data from a
19 matches
Mail list logo