Hello, I have an issue where my spark code is using too much memory in the final step ( a count for testing purpose, it will write the result to a db when it works ). I'm really not too sure how I can break down the last step to use less RAM.
So, basically my data is log lines and each log line has a session id. I want to group by session to reconstruct the events of a session for BI purposes. So my steps are: -Load the loglines -Do a map to create a K,V for each log line -Do a groupByKey. -Do a final map on the log lines to rebuild my session. -Do a count to trigger everything. That did not work at all, I let it run for 35 minutes and all it was doing was disk read/write and all the cpu were blocked on IO wait and I have 1% free Mem. So, I thought that I could help by reading my log lines in chunks of 1 200 000 lines and THEN doing a groupByKey on that subset. After everything was done, I would just combine all my rdd with "+" and do a final groupByKey pass. The result is still the same, heavy disk swapping, 1% memory left and all the CPU are doing io wait. It looks like: -Load subset -Do a map to create a K,V for each log line -Do a groupByKey. -Add all the subset rdd together. -Do a final groupByKey. -Do a count. I can post the code if it would help but there's a lot of code confusing the issue that's used to extract the logs from mongodb with a flatmap. This is the memory usage of each process, it's an issue because I have 12GB of RAM on that machine: VIRT RES SHR S %CPU TIME+ COMMAND 3378712 2.646g 700 D 0.3 0:21.30 python 3377568 2.566g 700 D 0.0 0:20.80 python 3374984 2.485g 700 D 0.0 0:20.29 python 3375588 2.449g 700 D 0.3 0:20.62 python 3495560 206908 3920 S 1.3 0:45.36 java If I look at the swap space with "free", same thing, there's no memory left to swap out from buffer/cache total used free shared buffers cached Mem: 12305524 12159320 146204 20 1072 29036 -/+ buffers/cache: 12129212 176312 Swap: 5857276 3885296 1971980 In the screenshot below, you can see the step where it's stuck at. The substep are groups of 4 because I break down each sub chunk into blocks of 4. <http://apache-spark-user-list.1001560.n3.nabble.com/file/n10134/issue.png> -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Last-step-of-processing-is-using-too-much-memory-tp10134.html Sent from the Apache Spark User List mailing list archive at Nabble.com.