Anyone has any clue what's going on.? Why would caching with 2g memory much faster than with 15g memory?
Thanks very much! On Fri, Oct 16, 2015 at 2:02 PM, Jia Zhan <zhanjia...@gmail.com> wrote: > Hi all, > > I am running Spark locally in one node and trying to sweep the memory size > for performance tuning. The machine has 8 CPUs and 16G main memory, the > dataset in my local disk is about 10GB. I have several quick questions and > appreciate any comments. > > 1. Spark performs in-memory computing, but without using RDD.cache(), will > anything be cached in memory at all? My guess is that, without RDD.cache(), > only a small amount of data will be stored in OS buffer cache, and every > iteration of computation will still need to fetch most data from disk every > time, is that right? > > 2. To evaluate how caching helps with iterative computation, I wrote a > simple program as shown below, which basically consists of one saveAsText() > and three reduce() actions/stages. I specify "spark.driver.memory" to > "15g", others by default. Then I run three experiments. > > * val* *conf* = *new* *SparkConf*().setAppName(*"wordCount"*) > > *val* *sc* = *new* *SparkContext*(conf) > > *val* *input* = sc.textFile(*"/InputFiles"*) > > *val* *words* = input.flatMap(line *=>* line.split(*" "*)).map(word > *=>* (word, *1*)).reduceByKey(_+_).saveAsTextFile(*"/OutputFiles"*) > > *val* *ITERATIONS* = *3* > > *for* (i *<-* *1* to *ITERATIONS*) { > > *val* *totallength* = input.filter(line*=>*line.contains(*"the"* > )).map(s*=>*s.length).reduce((a,b)*=>*a+b) > > } > > (I) The first run: no caching at all. The application finishes in ~12 > minutes (2.6min+3.3min+3.2min+3.3min) > > (II) The second run, I modified the code so that the input will be cached: > *val input = sc.textFile("/InputFiles").cache()* > The application finishes in ~11 mins!! (5.4min+1.9min+1.9min+2.0min)! > The storage page in Web UI shows 48% of the dataset is cached, which > makes sense due to large java object overhead, and > spark.storage.memoryFraction is 0.6 by default. > > (III) However, the third run, same program as the second one, but I > changed "spark.driver.memory" to be "2g". > The application finishes in just 3.6 minutes (3.0min + 9s + 9s + 9s)!! > And UI shows 6% of the data is cached. > > *From the results we can see the reduce stages finish in seconds, how > could that happen with only 6% cached? Can anyone explain?* > > I am new to Spark and would appreciate any help on this. Thanks! > > Jia > > > > -- Jia Zhan