Hi, This is on version 1.1.0.
I’m did a simple test on MEMORY_AND_DISK storage level. > var file = > sc.textFile(“file:///path/to/file.txt”).persit(StorageLevel.MEMORY_AND_DISK) > file.count() The file is 1.5GB and there is only 1 worker. I have requested for 1GB of worker memory per node: ID Name Cores Memory per Node Submitted Time User State Duration app-20141120193912-0002 Spark shell 64 1024.0 MB 2014/11/20 19:39:12 root RUNNING 6.0 min After doing a simple count, the job web ui indicates the entire file is saved on disk? RDD Name Storage Level Cached Fraction Size in Size in Size on Partitions Cached Memory Tachyon Disk file:///path/to/file.txt Disk Serialized 1x 46 100% 0.0 B 0.0 B 1476.5 MB Replicated 1. Shouldn’t some partitions be saved into memory? 2. If I run with MEMORY_ONLY option, I can save some partitions into memory but there are still space left according to the executor page 220.6 MB / 530.3MB and it did not fully use up them? Each partition is about 73MB. RDD Name Storage Level Cached Fraction Size in Size in Size on Partitions Cached Memory Tachyon Disk file:///path/to/file.txt Memory Deserialized 3 7% 220.6 MB 0.0 B 0.0 B 1x Replicated Executor Address RDD Memory Disk Active Failed Complete Total Task Input Shuffle Shuffle ID Blocks Used Used Tasks Tasks Tasks Tasks Time Read Write 220.6 MB 1457.4MB 0 foo.co:48660 3 / 530.3 0.0 B 0 0 46 46 14.2 m 0.0 B 0.0 B MB 14/11/20 19:53:22 INFO BlockManagerInfo: Added rdd_1_22 in memory on foo.co:48660 (size: 73.6 MB, free: 309.6 MB) 14/11/20 19:53:22 INFO TaskSetManager: Finished task 22.0 in stage 0.0 (TID 22) in 29833 ms on foo.co (43/46) 14/11/20 19:53:24 INFO TaskSetManager: Finished task 33.0 in stage 0.0 (TID 33) in 31502 ms on foo.co (44/46) 14/11/20 19:53:24 INFO TaskSetManager: Finished task 24.0 in stage 0.0 (TID 24) in 31651 ms on foo.co (45/46) 14/11/20 19:53:24 INFO TaskSetManager: Finished task 14.0 in stage 0.0 (TID 14) in 31782 ms on foo.co (46/46) 14/11/20 19:53:24 INFO DAGScheduler: Stage 0 (count at <console>:16) finished in 31.818 s 14/11/20 19:53:24 INFO TaskSchedulerImpl: Removed TaskSet 0.0, whose tasks have all completed, from pool 14/11/20 19:53:24 INFO SparkContext: Job finished: count at <console>:16, took 31.926585742 s res0: Long = 10000000 Is this correct? 3. I can’t seem to work out the math to derive 530MB that is made available in the executor? 1024MB * memoryFraction(0.6) = 614.4 Thanks! --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org