Thanks, Shao On Wed, Mar 18, 2015 at 3:34 PM, Shao, Saisai <saisai.s...@intel.com> wrote:
> Yeah, as I said your job processing time is much larger than the sliding > window, and streaming job is executed one by one in sequence, so the next > job will wait until the first job is finished, so the total latency will be > accumulated. > > > > I think you need to identify the bottleneck of your job at first. If the > shuffle is so slow, you could enlarge the shuffle fraction of memory to > reduce the spill, but finally the shuffle data will be written to disk, > this cannot be disabled, unless you mount your spark.tmp.dir on ramdisk. > > > I have increased spark.shuffle.memoryFraction to 0.8 which I can see from SparKUI's environment variables But spill always happens even from start when latency is less than slide window(I changed it to 10 seconds), the shuflle data disk written is really a snow ball effect, it slows down eventually. I noticed that the files spilled to disk are all very small in size but huge in numbers: total 344K drwxr-xr-x 2 root root 4.0K Mar 18 16:55 . drwxr-xr-x 66 root root 4.0K Mar 18 16:39 .. -rw-r--r-- 1 root root 80K Mar 18 16:54 shuffle_47_519_0.data -rw-r--r-- 1 root root 75K Mar 18 16:54 shuffle_48_419_0.data -rw-r--r-- 1 root root 36K Mar 18 16:54 shuffle_48_518_0.data -rw-r--r-- 1 root root 69K Mar 18 16:55 shuffle_49_319_0.data -rw-r--r-- 1 root root 330 Mar 18 16:55 shuffle_49_418_0.data -rw-r--r-- 1 root root 65K Mar 18 16:55 shuffle_49_517_0.data MemStore says: 15/03/18 17:59:43 WARN MemoryStore: Failed to reserve initial memory threshold of 1024.0 KB for computing block rdd_1338_2 in memory. 15/03/18 17:59:43 WARN MemoryStore: Not enough space to cache rdd_1338_2 in memory! (computed 512.0 B so far) 15/03/18 17:59:43 INFO MemoryStore: Memory use = 529.0 MB (blocks) + 0.0 B (scratch space shared across 0 thread(s)) = 529.0 MB. Storage limit = 529.9 MB. Not enough space even for 512 byte?? The executors still has plenty free memory: 0 slave1:40778 0 0.0 B / 529.9 MB 0.0 B 16 0 15047 15063 2.17 h 0.0 B 402.3 MB 768.0 B 1 slave2:50452 0 0.0 B / 529.9 MB 0.0 B 16 0 14447 14463 2.17 h 0.0 B 388.8 MB 1248.0 B 1 lvs02:47325 116 27.6 MB / 529.9 MB 0.0 B 8 0 58169 58177 3.16 h 893.5 MB 624.0 B 1189.9 MB <driver> lvs02:47041 0 0.0 B / 529.9 MB 0.0 B 0 0 0 0 0 ms 0.0 B 0.0 B 0.0 B Besides if CPU or network is the bottleneck, you might need to add more > resources to your cluster. > > > 3 dedicated servers each with CPU 16 cores + 16GB memory and Gigabyte network. CPU load is quite low , about 1~3 from top, and network usage is far from saturated. I don't even do any usefull complex calculations in this small Simple App yet.