Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Xiangrui Meng
Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS performance should be improved in 1.3.0. -Xiangrui On Fri, Feb 20, 2015 at 1:32 PM, Antony Mayi antonym...@yahoo.com.invalid wrote: Hi Ilya, thanks for your insight, this was the right clue. I had default parallelism already

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Ilya Ganelin
No problem, Antony. ML lib is tricky! I'd love to chat with you about your use case - sounds like we're working on similar problems/scales. On Fri, Feb 20, 2015 at 1:55 PM Xiangrui Meng men...@gmail.com wrote: Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS performance

Re: loads of memory still GC overhead limit exceeded

2015-02-20 Thread Antony Mayi
Hi Ilya, thanks for your insight, this was the right clue. I had default parallelism already set but it was quite low (hundreds) and moreover the number of partitions of the input RDD was low as well so the chunks were really too big. Increased parallelism and repartitioning seems to be

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Sean Owen
This should result in 4 executors, not 25. They should be able to execute 4*4 = 16 tasks simultaneously. You have them grab 4*32 = 128GB of RAM, not 1TB. It still feels like this shouldn't be running out of memory, not by a long shot though. But just pointing out potential differences between

loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
Hi, I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running spark 1.2.0 in yarn-client mode with following layout: spark.executor.cores=4 spark.executor.memory=28G spark.yarn.executor.memoryOverhead=4096 I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
based on spark UI I am running 25 executors for sure. why would you expect four? I submit the task with --num-executors 25 and I get 6-7 executors running per host (using more of smaller executors allows me better cluster utilization when running parallel spark sessions (which is not the case

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Sean Owen
Oh OK you are saying you are requesting 25 executors and getting them, got it. You can consider making fewer, bigger executors to pool rather than split up your memory, but at some point it becomes counter-productive. 32GB is a fine executor size. So you have ~8GB available per task which seems

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
it is from within the ALS.trainImplicit() call. btw. the exception varies between this GC overhead limit exceeded and Java heap space (which I guess is just different outcome of same problem). just tried another run and here are the logs (filtered) - note I tried this run with

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Antony Mayi
now with reverted spark.shuffle.io.preferDirectBufs (to true) getting again GC overhead limit exceeded: === spark stdout ===15/02/19 12:08:08 WARN scheduler.TaskSetManager: Lost task 7.0 in stage 18.0 (TID 5329, 192.168.1.93): java.lang.OutOfMemoryError: GC overhead limit exceeded        at

Re: loads of memory still GC overhead limit exceeded

2015-02-19 Thread Ilya Ganelin
Hi Anthony - you are seeing a problem that I ran into. The underlying issue is your default parallelism setting. What's happening is that within ALS certain RDD operations end up changing the number of partitions you have of your data. For example if you start with an RDD of 300 partitions, unless