Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS
performance should be improved in 1.3.0. -Xiangrui
On Fri, Feb 20, 2015 at 1:32 PM, Antony Mayi
antonym...@yahoo.com.invalid wrote:
Hi Ilya,
thanks for your insight, this was the right clue. I had default parallelism
already
No problem, Antony. ML lib is tricky! I'd love to chat with you about your
use case - sounds like we're working on similar problems/scales.
On Fri, Feb 20, 2015 at 1:55 PM Xiangrui Meng men...@gmail.com wrote:
Hi Antony, Is it easy for you to try Spark 1.3.0 or master? The ALS
performance
Hi Ilya,
thanks for your insight, this was the right clue. I had default parallelism
already set but it was quite low (hundreds) and moreover the number of
partitions of the input RDD was low as well so the chunks were really too big.
Increased parallelism and repartitioning seems to be
This should result in 4 executors, not 25. They should be able to
execute 4*4 = 16 tasks simultaneously. You have them grab 4*32 = 128GB
of RAM, not 1TB.
It still feels like this shouldn't be running out of memory, not by a
long shot though. But just pointing out potential differences between
Hi,
I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running spark
1.2.0 in yarn-client mode with following layout:
spark.executor.cores=4
spark.executor.memory=28G
spark.yarn.executor.memoryOverhead=4096
I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a
based on spark UI I am running 25 executors for sure. why would you expect
four? I submit the task with --num-executors 25 and I get 6-7 executors running
per host (using more of smaller executors allows me better cluster utilization
when running parallel spark sessions (which is not the case
Oh OK you are saying you are requesting 25 executors and getting them,
got it. You can consider making fewer, bigger executors to pool rather
than split up your memory, but at some point it becomes
counter-productive. 32GB is a fine executor size.
So you have ~8GB available per task which seems
it is from within the ALS.trainImplicit() call. btw. the exception varies
between this GC overhead limit exceeded and Java heap space (which I guess
is just different outcome of same problem).
just tried another run and here are the logs (filtered) - note I tried this run
with
now with reverted spark.shuffle.io.preferDirectBufs (to true) getting again GC
overhead limit exceeded:
=== spark stdout ===15/02/19 12:08:08 WARN scheduler.TaskSetManager: Lost task
7.0 in stage 18.0 (TID 5329, 192.168.1.93): java.lang.OutOfMemoryError: GC
overhead limit exceeded at
Hi Anthony - you are seeing a problem that I ran into. The underlying issue
is your default parallelism setting. What's happening is that within ALS
certain RDD operations end up changing the number of partitions you have of
your data. For example if you start with an RDD of 300 partitions, unless
10 matches
Mail list logo