Re: About StorageLevel

2014-06-26 Thread Andrew Or
Hi Kang, You raise a good point. Spark does not automatically cache all your RDDs. Why? Simply because the application may create many RDDs, and not all of them are to be reused. After all, there is only so much memory available to each executor, and caching an RDD adds some overhead especially

RE: About StorageLevel

2014-06-26 Thread Liu, Raymond
: Friday, June 27, 2014 10:08 AM To: user Subject: Re: About StorageLevel Thank u Andrew, that's very helpful. I still have some doubts on a simple trial: I opened a spark shell in local mode, and typed in val r=sc.parallelize(0 to 50) val r2=r.keyBy(x=x).groupByKey(10) and then I invoked

RE: About StorageLevel

2014-06-26 Thread tomsheep...@gmail.com
stage, it behaves like there is a persist(StorageLevel.DISk_ONLY) called implicitly? Regards, Kang Liu From: Liu, Raymond Date: 2014-06-27 11:02 To: user@spark.apache.org Subject: RE: About StorageLevel I think there is a shuffle stage involved. And the future count job will depends on the first