Hi, For our usecases we are looking into 20 x 1M matrices which comes in the similar ranges as outlined by the paper over here:
http://sandeeptata.blogspot.com/2012/12/sparkler-large-scale-matrix.html Is the exponential runtime growth in spark ALS as outlined by the blog still exists in recommendation.ALS ? I am running a spark cluster of 10 nodes with total memory of around 1 TB with 80 cores.... With rank = 50, the memory requirements for ALS should be 20Mx50 doubles on every worker which is around 8 GB.... Even if both the factor matrices are cached in memory I should be bounded by ~ 9 GB but even with 32 GB per worker I see GC errors... I am debugging the scalability and memory requirements of the algorithm further but any insights will be very helpful... Also there are two other issues: 1. If GC errors are hit, that worker JVM goes down and I have to restart it manually. Is this expected ? 2. When I try to make use of all 80 cores on the cluster I get some issues related to java.io.File not found exception on /tmp/ ? Is there some OS limit that how many cores can simultaneously access /tmp from a process ? Thanks. Deb On Sun, Mar 16, 2014 at 2:20 PM, Sean Owen <so...@cloudera.com> wrote: > Good point -- there's been another optimization for ALS in HEAD ( > https://github.com/apache/spark/pull/131), but yes the better place to > pick up just essential changes since 0.9.0 including the previous one is > the 0.9 branch. > > -- > Sean Owen | Director, Data Science | London > > > On Sun, Mar 16, 2014 at 2:18 PM, Patrick Wendell <pwend...@gmail.com>wrote: > >> Sean - was this merged into the 0.9 branch as well (it seems so based >> on the message from rxin). If so it might make sense to try out the >> head of branch-0.9 as well. Unless there are *also* other changes >> relevant to this in master. >> >> - Patrick >> >> On Sun, Mar 16, 2014 at 12:24 PM, Sean Owen <so...@cloudera.com> wrote: >> > You should simply use a snapshot built from HEAD of >> github.com/apache/spark >> > if you can. The key change is in MLlib and with any luck you can just >> > replace that bit. See the PR I referenced. >> > >> > Sure with enough memory you can get it to run even with the memory >> issue, >> > but it could be hundreds of GB at your scale. Not sure I take the point >> > about the JVM; you can give it 64GB of heap and executors can use that >> much, >> > sure. >> > >> > You could reduce the number of features a lot to work around it too, or >> > reduce the input size. (If anyone saw my blog post about StackOverflow >> and >> > ALS -- that's why I snuck in a relatively paltry 40 features and pruned >> > questions with <4 tags :) ) >> > >> > I don't think jblas has anything to do with it per se, and the >> allocation >> > fails in Java code, not native code. This should be exactly what that >> PR I >> > mentioned fixes. >> > >> > -- >> > Sean Owen | Director, Data Science | London >> > >> > >> > On Sun, Mar 16, 2014 at 11:48 AM, Debasish Das < >> debasish.da...@gmail.com> >> > wrote: >> >> >> >> Thanks Sean...let me get the latest code..do you know which PR was it ? >> >> >> >> But will the executors run fine with say 32 gb or 64 gb of memory ? >> Does >> >> not JVM shows up issues when the max memory goes beyond certain >> limit... >> >> >> >> Also the failure is due to GC limits from jblas...and I was thinking >> that >> >> jblas is going to call native malloc right ? May be 64 gb is not a big >> deal >> >> then...I will try increasing to 32 and then 64... >> >> >> >> java.lang.OutOfMemoryError (java.lang.OutOfMemoryError: GC overhead >> limit >> >> exceeded) >> >> >> >> >> org.jblas.DoubleMatrix.<init>(DoubleMatrix.java:323)org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:471)org.jblas.DoubleMatrix.zeros(DoubleMatrix.java:476)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$17.apply(ALSQR.scala:366)scala.Array$.fill(Array.scala:267)com.verizon.bigdata.mllib.recommendation.ALSQR.updateBlock(ALSQR.scala:366)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:346)com.verizon.bigdata.mllib.recommendation.ALSQR$$anonfun$com$verizon$bigdata$mllib$recommendation$ALSQR$$updateFeatures$2.apply(ALSQR.scala:345)org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32)org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:32)scala.collection.Iterator$$anon$11.next(Iterator.scala:328)org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:32)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:32)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:242)org.apache.spark.rdd.RDD.iterator(RDD.scala:233)org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)org.apache.spark.scheduler.Task.run(Task.scala:53)org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:213) >> >> >> >> >> >> >> >> On Sun, Mar 16, 2014 at 11:42 AM, Sean Owen <so...@cloudera.com> >> wrote: >> >>> >> >>> Are you using HEAD or 0.9.0? I know there was a memory issue fixed a >> few >> >>> weeks ago that made ALS need a lot more memory than is needed. >> >>> >> >>> https://github.com/apache/incubator-spark/pull/629 >> >>> >> >>> Try the latest code. >> >>> >> >>> -- >> >>> Sean Owen | Director, Data Science | London >> >>> >> >>> >> >>> On Sun, Mar 16, 2014 at 11:40 AM, Debasish Das < >> debasish.da...@gmail.com> >> >>> wrote: >> >>>> >> >>>> Hi, >> >>>> >> >>>> I gave my spark job 16 gb of memory and it is running on 8 executors. >> >>>> >> >>>> The job needs more memory due to ALS requirements (20M x 1M matrix) >> >>>> >> >>>> On each node I do have 96 gb of memory and I am using 16 gb out of >> it. I >> >>>> want to increase the memory but I am not sure what is the right way >> to >> >>>> do >> >>>> that... >> >>>> >> >>>> On 8 executor if I give 96 gb it might be a issue due to GC... >> >>>> >> >>>> Ideally on 8 nodes, I would run with 48 executors and each executor >> will >> >>>> get 16 gb of memory..Total 48 JVMs... >> >>>> >> >>>> Is it possible to increase executors per node ? >> >>>> >> >>>> Thanks. >> >>>> Deb >> >>> >> >>> >> >> >> > >> > >