Hello,

I have an issue where my spark code is using too much memory in the final
step ( a count for testing purpose, it will write the result to a db when it
works ). I'm really not too sure how I can break down the last step to use
less RAM.

So, basically my data is log lines and each log line has a session id. I
want to group by session to reconstruct the events of a session for BI
purposes.

So my steps are:

-Load the loglines
-Do a map to create a K,V for each log line
-Do a groupByKey.
-Do a final map on the log lines to rebuild my session.
-Do a count to trigger everything.

That did not work at all, I let it run for 35 minutes and all it was doing
was disk read/write and all the cpu were blocked on IO wait and I have 1%
free Mem.

So, I thought that I could help by reading my log lines in chunks of 1 200
000 lines and THEN doing a groupByKey on that subset. After everything was
done, I would just combine all my rdd with "+" and do a final groupByKey
pass. The result is still the same, heavy disk swapping, 1% memory left and
all the CPU are doing io wait.

It looks like:
-Load subset
-Do a map to create a K,V for each log line
-Do a groupByKey.
-Add all the subset rdd together.
-Do a final groupByKey.
-Do a count.

I can post the code if it would help but there's a lot of code confusing the
issue that's used to extract the logs from mongodb with a flatmap.


This is the memory usage of each process, it's an issue because I have 12GB
of RAM on that machine:
     VIRT    RES    SHR S  %CPU     TIME+ COMMAND
 3378712 2.646g    700 D   0.3   0:21.30 python
 3377568 2.566g    700 D   0.0   0:20.80 python
 3374984 2.485g    700 D   0.0   0:20.29 python
 3375588 2.449g    700 D   0.3   0:20.62 python
 3495560 206908   3920 S   1.3   0:45.36 java

If I look at the swap space with "free", same thing, there's no memory left
to swap out from buffer/cache

              total       used       free     shared    buffers     cached
Mem:      12305524   12159320     146204         20       1072      29036
-/+ buffers/cache:   12129212     176312
Swap:      5857276    3885296    1971980


In the screenshot below, you can see the step where it's stuck at. The
substep are groups of 4 because I break down each sub chunk into blocks of
4.
<http://apache-spark-user-list.1001560.n3.nabble.com/file/n10134/issue.png> 




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Last-step-of-processing-is-using-too-much-memory-tp10134.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

Reply via email to