ALS memory limits

Debasish Das Tue, 25 Mar 2014 23:07:28 -0700

Hi,

For our usecases we are looking into 20 x 1M matrices which comes in the
similar ranges as outlined by the paper over here:


http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html

Is the exponential runtime growth in spark ALS as outlined by the blog
still exists in recommendation.ALS ?

I am running a spark cluster of 10 nodes with total memory of around 1 TB
with 80 cores....

With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on
every worker which is around 8 GB....

Even if both the factor matrices are cached in memory I should be bounded
by ~ 9 GB but even with 32 GB per worker I see GC errors...

I am debugging the scalability and memory requirements of the algorithm
further but any insights will be very helpful...

Also there are two other issues:

1. If GC errors are hit, that worker JVM goes down and I have to restart it
manually. Is this expected ?

2. When I try to make use of all 80 cores on the cluster I get some issues
related to java.io.File not found exception on /tmp/ ? Is there some OS
limit that how many cores can simultaneously access /tmp from a process ?

Thanks.
Deb

On Sun, Mar 16, 2014 at 2:20 PM, Sean Owen <so...@cloudera.com> wrote:

> Good point -- there's been another optimization for ALS in HEAD (
> https://github.com/apache/spark/pull/131), but yes the better place to
> pick up just essential changes since 0.9.0 including the previous one is
> the 0.9 branch.
>
> --
> Sean Owen | Director, Data Science | London
>
>
> On Sun, Mar 16, 2014 at 2:18 PM, Patrick Wendell <pwend...@gmail.com>wrote:
>
>> Sean - was this merged into the 0.9 branch as well (it seems so based
>> on the message from rxin). If so it might make sense to try out the
>> head of branch-0.9 as well. Unless there are *also* other changes
>> relevant to this in master.
>>
>> - Patrick
>>
>> On Sun, Mar 16, 2014 at 12:24 PM, Sean Owen <so...@cloudera.com> wrote:
>> > You should simply use a snapshot built from HEAD of
>> github.com/apache/spark
>> > if you can. The key change is in MLlib and with any luck you can just
>> > replace that bit. See the PR I referenced.
>> >
>> > Sure with enough memory you can get it to run even with the memory
>> issue,
>> > but it could be hundreds of GB at your scale. Not sure I take the point
>> > about the JVM; you can give it 64GB of heap and executors can use that
>> much,
>> > sure.
>> >
>> > You could reduce the number of features a lot to work around it too, or
>> > reduce the input size. (If anyone saw my blog post about StackOverflow
>> and
>> > ALS -- that's why I snuck in a relatively paltry 40 features and pruned
>> > questions with <4 tags :) )
>> >
>> > I don't think jblas has anything to do with it per se, and the
>> allocation
>> > fails in Java code, not native code. This should be exactly what that
>> PR I
>> > mentioned fixes.
>> >
>> > --
>> > Sean Owen | Director, Data Science | London
>> >
>> >
>> > On Sun, Mar 16, 2014 at 11:48 AM, Debasish Das <
>> debasish.da...@gmail.com>
>> > wrote:
>> >>
>> >> Thanks Sean...let me get the latest code..do you know which PR was it ?
>> >>
>> >> But will the executors run fine with say 32 gb or 64 gb of memory ?
>> Does
>> >> not JVM shows up issues when the max memory goes beyond certain
>> limit...
>> >>
>> >> Also the failure is due to GC limits from jblas...and I was thinking
>> that
>> >> jblas is going to call native malloc right ? May be 64 gb is not a big
>> deal
>> >> then...I will try increasing to 32 and then 64...
>> >>
>> >> java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead
>> limit
>> >> exceeded)
>> >>
>> >>
>> org.jblas.DoubleMatrix.<init>(DoubleMatrix.java:323)org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:471)org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:476)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366)scala.Array$.fill(Array.scala:267)com.verizon.bigdata.mllib.recommendation.ALSQR.updateBlock(ALSQR.scala:366)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:346)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:345)org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32)org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32)scala.collection.Iterator$$anon$11.next(Iterator.scala:328)org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:32)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:32)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)org.apache.spark.scheduler.Task.run(Task.scala:53)org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213)
>> >>
>> >>
>> >>
>> >> On Sun, Mar 16, 2014 at 11:42 AM, Sean Owen <so...@cloudera.com>
>> wrote:
>> >>>
>> >>> Are you using HEAD or 0.9.0? I know there was a memory issue fixed a
>> few
>> >>> weeks ago that made ALS need a lot more memory than is needed.
>> >>>
>> >>> https://github.com/apache/incubator-spark/pull/629
>> >>>
>> >>> Try the latest code.
>> >>>
>> >>> --
>> >>> Sean Owen | Director, Data Science | London
>> >>>
>> >>>
>> >>> On Sun, Mar 16, 2014 at 11:40 AM, Debasish Das <
>> debasish.da...@gmail.com>
>> >>> wrote:
>> >>>>
>> >>>> Hi,
>> >>>>
>> >>>> I gave my spark job 16 gb of memory and it is running on 8 executors.
>> >>>>
>> >>>> The job needs more memory due to ALS requirements (20M x 1M matrix)
>> >>>>
>> >>>> On each node I do have 96 gb of memory and I am using 16 gb out of
>> it. I
>> >>>> want to increase the memory but I am not sure what is the right way
>> to
>> >>>> do
>> >>>> that...
>> >>>>
>> >>>> On 8 executor if I give 96 gb it might be a issue due to GC...
>> >>>>
>> >>>> Ideally on 8 nodes, I would run with 48 executors and each executor
>> will
>> >>>> get 16 gb of memory..Total  48 JVMs...
>> >>>>
>> >>>> Is it possible to increase executors per node ?
>> >>>>
>> >>>> Thanks.
>> >>>> Deb
>> >>>
>> >>>
>> >>
>> >
>>
>
>

ALS memory limits

Reply via email to