Please see the inline comments. Thanks Jerry
From: Darren Hoo [mailto:darren....@gmail.com] Sent: Wednesday, March 18, 2015 9:30 PM To: Shao, Saisai Cc: user@spark.apache.org; Akhil Das Subject: Re: [spark-streaming] can shuffle write to disk be disabled? On Wed, Mar 18, 2015 at 8:31 PM, Shao, Saisai <saisai.s...@intel.com<mailto:saisai.s...@intel.com>> wrote: >From the log you pasted I think this (-rw-r--r-- 1 root root 80K Mar 18 >16:54 shuffle_47_519_0.data) is not shuffle spilled data, but the final >shuffle result. why the shuffle result is written to disk? This is the internal mechanism for Spark. As I said, did you think shuffle is the bottleneck which makes your job running slowly? I am quite new to spark, So I am just doing wild guesses. which information should I provide further that can help to find the real bottleneck? You can monitor the system metrics, as well as JVM, also information from web UI is very useful. Maybe you should identify the cause at first. Besides from the log it looks your memory is not enough the cache the data, maybe you should increase the memory size of the executor. running two executors, the memory ussage is quite low: executor 0 8.6 MB / 4.1 GB executor 1 23.9 MB / 4.1 GB <driver> 0.0B / 529.9 MB submitted with args : --executor-memory 8G --num-executors 2 --driver-memory 1G