Fwd: the life cycle shuffle Dependency

2023-12-27 Thread yang chen
hi, I'm learning spark, and wonder when to delete shuffle data, I find the
ContextCleaner class which clean the shuffle data when shuffle dependency
is GC-ed.  Based on source code, the shuffle dependency is gc-ed only when
active job finish, but i'm not sure,  Could you explain the life cycle of a
Shuffle-Dependency?


Re: Combining Many RDDs

2015-03-27 Thread Yang Chen
Hi Kelvin,

Thank you. That works for me. I wrote my own joins that produced Scala
collections, instead of using rdd.join.

Regards,
Yang

On Thu, Mar 26, 2015 at 5:51 PM, Kelvin Chu <2dot7kel...@gmail.com> wrote:

> Hi, I used union() before and yes it may be slow sometimes. I _guess_ your
> variable 'data' is a Scala collection and compute() returns an RDD. Right?
> If yes, I tried the approach below to operate on one RDD only during the
> whole computation (Yes, I also saw that too many RDD hurt performance).
>
> Change compute() to return Scala collection instead of RDD.
>
> val result = sc.parallelize(data)// Create and partition the
> 0.5M items in a single RDD.
>   .flatMap(compute(_))   // You still have only one RDD with each item
> joined with external data already
>
> Hope this help.
>
> Kelvin
>
> On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen  wrote:
>
>> Hi Mark,
>>
>> That's true, but in neither way can I combine the RDDs, so I have to
>> avoid unions.
>>
>> Thanks,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra 
>> wrote:
>>
>>> RDD#union is not the same thing as SparkContext#union
>>>
>>> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen  wrote:
>>>
>>>> Hi Noorul,
>>>>
>>>> Thank you for your suggestion. I tried that, but ran out of memory. I
>>>> did some search and found some suggestions
>>>> that we should try to avoid rdd.union(
>>>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>>>> ).
>>>> I will try to come up with some other ways.
>>>>
>>>> Thank you,
>>>> Yang
>>>>
>>>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M 
>>>> wrote:
>>>>
>>>>> sparkx  writes:
>>>>>
>>>>> > Hi,
>>>>> >
>>>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>>>> performs
>>>>> > some sort of computation (joining a shared external dataset, if that
>>>>> does
>>>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>>>> would like
>>>>> > to combine all these RDDs and perform a next job. What I have found
>>>>> out is
>>>>> > that the computation itself is quite fast, but combining these RDDs
>>>>> takes
>>>>> > much longer time.
>>>>> >
>>>>> > val result = data// 0.5M data items
>>>>> >   .map(compute(_))   // Produces an RDD - fast
>>>>> >   .reduce(_ ++ _)  // Combining RDDs - slow
>>>>> >
>>>>> > I have also tried to collect results from compute(_) and use a
>>>>> flatMap, but
>>>>> > that is also slow.
>>>>> >
>>>>> > Is there a way to efficiently do this? I'm thinking about writing
>>>>> this
>>>>> > result to HDFS and reading from disk for the next job, but am not
>>>>> sure if
>>>>> > that's a preferred way in Spark.
>>>>> >
>>>>>
>>>>> Are you looking for SparkContext.union() [1] ?
>>>>>
>>>>> This is not performing well with spark cassandra connector. I am not
>>>>> sure whether this will help you.
>>>>>
>>>>> Thanks and Regards
>>>>> Noorul
>>>>>
>>>>> [1]
>>>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>>>
>>>>
>>>>
>>>>
>>>> --
>>>> Yang Chen
>>>> Dept. of CISE, University of Florida
>>>> Mail: y...@yang-cs.com
>>>> Web: www.cise.ufl.edu/~yang
>>>>
>>>
>>>
>>
>>
>> --
>> Yang Chen
>> Dept. of CISE, University of Florida
>> Mail: y...@yang-cs.com
>> Web: www.cise.ufl.edu/~yang
>>
>
>


-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang


Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Mark,

That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.

Thanks,
Yang

On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra 
wrote:

> RDD#union is not the same thing as SparkContext#union
>
> On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen  wrote:
>
>> Hi Noorul,
>>
>> Thank you for your suggestion. I tried that, but ran out of memory. I did
>> some search and found some suggestions
>> that we should try to avoid rdd.union(
>> http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
>> ).
>> I will try to come up with some other ways.
>>
>> Thank you,
>> Yang
>>
>> On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M 
>> wrote:
>>
>>> sparkx  writes:
>>>
>>> > Hi,
>>> >
>>> > I have a Spark job and a dataset of 0.5 Million items. Each item
>>> performs
>>> > some sort of computation (joining a shared external dataset, if that
>>> does
>>> > matter) and produces an RDD containing 20-500 result items. Now I
>>> would like
>>> > to combine all these RDDs and perform a next job. What I have found
>>> out is
>>> > that the computation itself is quite fast, but combining these RDDs
>>> takes
>>> > much longer time.
>>> >
>>> > val result = data// 0.5M data items
>>> >   .map(compute(_))   // Produces an RDD - fast
>>> >   .reduce(_ ++ _)  // Combining RDDs - slow
>>> >
>>> > I have also tried to collect results from compute(_) and use a
>>> flatMap, but
>>> > that is also slow.
>>> >
>>> > Is there a way to efficiently do this? I'm thinking about writing this
>>> > result to HDFS and reading from disk for the next job, but am not sure
>>> if
>>> > that's a preferred way in Spark.
>>> >
>>>
>>> Are you looking for SparkContext.union() [1] ?
>>>
>>> This is not performing well with spark cassandra connector. I am not
>>> sure whether this will help you.
>>>
>>> Thanks and Regards
>>> Noorul
>>>
>>> [1]
>>> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>>>
>>
>>
>>
>> --
>> Yang Chen
>> Dept. of CISE, University of Florida
>> Mail: y...@yang-cs.com
>> Web: www.cise.ufl.edu/~yang
>>
>
>


-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang


Re: Combining Many RDDs

2015-03-26 Thread Yang Chen
Hi Noorul,

Thank you for your suggestion. I tried that, but ran out of memory. I did
some search and found some suggestions
that we should try to avoid rdd.union(
http://stackoverflow.com/questions/28343181/memory-efficient-way-of-union-a-sequence-of-rdds-from-files-in-apache-spark
).
I will try to come up with some other ways.

Thank you,
Yang

On Thu, Mar 26, 2015 at 1:13 PM, Noorul Islam K M  wrote:

> sparkx  writes:
>
> > Hi,
> >
> > I have a Spark job and a dataset of 0.5 Million items. Each item performs
> > some sort of computation (joining a shared external dataset, if that does
> > matter) and produces an RDD containing 20-500 result items. Now I would
> like
> > to combine all these RDDs and perform a next job. What I have found out
> is
> > that the computation itself is quite fast, but combining these RDDs takes
> > much longer time.
> >
> > val result = data// 0.5M data items
> >   .map(compute(_))   // Produces an RDD - fast
> >   .reduce(_ ++ _)  // Combining RDDs - slow
> >
> > I have also tried to collect results from compute(_) and use a flatMap,
> but
> > that is also slow.
> >
> > Is there a way to efficiently do this? I'm thinking about writing this
> > result to HDFS and reading from disk for the next job, but am not sure if
> > that's a preferred way in Spark.
> >
>
> Are you looking for SparkContext.union() [1] ?
>
> This is not performing well with spark cassandra connector. I am not
> sure whether this will help you.
>
> Thanks and Regards
> Noorul
>
> [1]
> http://spark.apache.org/docs/1.3.0/api/scala/index.html#org.apache.spark.SparkContext
>



-- 
Yang Chen
Dept. of CISE, University of Florida
Mail: y...@yang-cs.com
Web: www.cise.ufl.edu/~yang