hi, I'm learning spark, and wonder when to delete shuffle data, I find the
ContextCleaner class which clean the shuffle data when shuffle dependency
is GC-ed. Based on source code, the shuffle dependency is gc-ed only when
active job finish, but i'm not sure, Could you explain the life cycle of
= sc.parallelize(data)// Create and partition the
0.5M items in a single RDD.
.flatMap(compute(_)) // You still have only one RDD with each item
joined with external data already
Hope this help.
Kelvin
On Thu, Mar 26, 2015 at 2:37 PM, Yang Chen y...@yang-cs.com wrote:
Hi Mark
Hi Mark,
That's true, but in neither way can I combine the RDDs, so I have to avoid
unions.
Thanks,
Yang
On Thu, Mar 26, 2015 at 5:31 PM, Mark Hamstra m...@clearstorydata.com
wrote:
RDD#union is not the same thing as SparkContext#union
On Thu, Mar 26, 2015 at 2:27 PM, Yang Chen y...@yang