Re: ALS array index out of bound with 50 factors
Hi Xiangrui, With 4 ALS iterations it runs fine...If I run 10 I am failing...I believe I have to cut the lineage chain and call checkpointTrying to follow the other email chain on checkpointing... Thanks. Deb On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, Are you using the master branch or a particular commit? Do you have negative or out-of-integer-range user or product ids? There is an issue with ALS' partitioning (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not sure whether that is the reason. Could you try to see whether you can reproduce the error on a public data set, e.g., movielens? Thanks! Best, Xiangrui On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I deployed apache/spark master today and recently there were many ALS related checkins and enhancements.. I am running ALS with explicit feedback and I remember most enhancements were related to implicit feedback... With 25 factors my runs were successful but with 50 factors I am getting array index out of bound... Note that I was hitting gc errors before with an older version of spark but it seems like the sparse matrix partitioning scheme has changed now...data caching looks much balanced now...earlier one node was becoming bottleneck...Although I ran with 64g memory per node... There are around 3M products, 25M users... Anyone noticed this bug or something similar ? 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 81029 at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS.org $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:52) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at java.security.AccessController.doPrivileged(Native Method) at javax.security.auth.Subject.doAs(Subject.java:396) at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408) at org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:42) at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176) at
Re: ALS array index out of bound with 50 factors
On the partitioning / id keys. If we would look at hash partitioning, how feasible will it be to just allow the user and item ids to be strings? A lot of the time these ids are strings anyway (UUIDs and so on), and it's really painful to translate between String - Int the whole time. Are there any obvious blockers to this? I am a bit rusty on the ALS code but from a quick scan I think this may work. Performance may be an issue with large String keys... Any majore issues/objections to this thinking? I may be able to find time to take a stab at this if there is demand. On Mon, Apr 7, 2014 at 6:08 AM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, Are you using the master branch or a particular commit? Do you have negative or out-of-integer-range user or product ids? There is an issue with ALS' partitioning (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not sure whether that is the reason. Could you try to see whether you can reproduce the error on a public data set, e.g., movielens? Thanks! Best, Xiangrui On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I deployed apache/spark master today and recently there were many ALS related checkins and enhancements.. I am running ALS with explicit feedback and I remember most enhancements were related to implicit feedback... With 25 factors my runs were successful but with 50 factors I am getting array index out of bound... Note that I was hitting gc errors before with an older version of spark but it seems like the sparse matrix partitioning scheme has changed now...data caching looks much balanced now...earlier one node was becoming bottleneck...Although I ran with 64g memory per node... There are around 3M products, 25M users... Anyone noticed this bug or something similar ? 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 81029 at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS.org $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161) at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102) at org.apache.spark.scheduler.Task.run(Task.scala:52) at org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43) at org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42) at java.security.AccessController.doPrivileged(Native Method) at
Re: ALS array index out of bound with 50 factors
Nick, I already have this code which calls dictionary generation and then maps string etc to ints...I think the core algorithm should stay in ints...if you like I can add this code in MFUtils.scalathat's the convention I followed similar to MLUtils.scala...actually these functions should be even made part of MLUtils.scala... Only thing is that the join should be an option which makes it application dependent...sometimes people would like to do map side joins if their dictionaries are small...in my case user dictionary has 25M rows and product dictionary has 3M rows...so join optimization did not help... Thanks. Deb On Mon, Apr 7, 2014 at 6:57 AM, Nick Pentreath nick.pentre...@gmail.comwrote: On the partitioning / id keys. If we would look at hash partitioning, how feasible will it be to just allow the user and item ids to be strings? A lot of the time these ids are strings anyway (UUIDs and so on), and it's really painful to translate between String - Int the whole time. Are there any obvious blockers to this? I am a bit rusty on the ALS code but from a quick scan I think this may work. Performance may be an issue with large String keys... Any majore issues/objections to this thinking? I may be able to find time to take a stab at this if there is demand. On Mon, Apr 7, 2014 at 6:08 AM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, Are you using the master branch or a particular commit? Do you have negative or out-of-integer-range user or product ids? There is an issue with ALS' partitioning (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not sure whether that is the reason. Could you try to see whether you can reproduce the error on a public data set, e.g., movielens? Thanks! Best, Xiangrui On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I deployed apache/spark master today and recently there were many ALS related checkins and enhancements.. I am running ALS with explicit feedback and I remember most enhancements were related to implicit feedback... With 25 factors my runs were successful but with 50 factors I am getting array index out of bound... Note that I was hitting gc errors before with an older version of spark but it seems like the sparse matrix partitioning scheme has changed now...data caching looks much balanced now...earlier one node was becoming bottleneck...Although I ran with 64g memory per node... There are around 3M products, 25M users... Anyone noticed this bug or something similar ? 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 81029 at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS.org $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at
Re: tachyon dependency
Tachyon is Java 6 compatible from version 0.4. Beside putting input/output data in Tachyon ( http://tachyon-project.org/Running-Spark-on-Tachyon.html ), Spark applications can also persist data into Tachyon ( https://github.com/apache/spark/blob/master/docs/scala-programming-guide.md ). On Mon, Apr 7, 2014 at 7:42 AM, Koert Kuipers ko...@tresata.com wrote: i noticed there is a dependency on tachyon in spark core 1.0.0-SNAPSHOT. how does that work? i believe tachyon is written in java 7, yet spark claims to be java 6 compatible. -- Haoyuan Li Algorithms, Machines, People Lab, EECS, UC Berkeley http://www.cs.berkeley.edu/~haoyuan/
Contributing to Spark
Hi, How I contribute to Spark and it's associated projects? Appreciate the help... Thanks Mukesh
Re: Contributing to Spark
This is a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Sujeet On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G muk...@gmail.com wrote: Hi, How I contribute to Spark and it's associated projects? Appreciate the help... Thanks Mukesh
Re: ALS array index out of bound with 50 factors
I am using master... No negative indexes... If I run with 4 iterations it runs fine and I can generate factors... With 10 iterations run fails with array index out of bound... 25m users and 3m products are within int limits Does it help if I can point the logs for both the runs to you ? I will debug it further today... On Apr 7, 2014 9:54 AM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, This thread is for the out-of-bound error you described. I don't think the number of iterations has any effect here. My questions were: 1) Are you using the master branch or a particular commit? 2) Do you have negative or out-of-integer-range user or product ids? Try to print out the max/min value of user/product ids. Best, Xiangrui On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, With 4 ALS iterations it runs fine...If I run 10 I am failing...I believe I have to cut the lineage chain and call checkpointTrying to follow the other email chain on checkpointing... Thanks. Deb On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, Are you using the master branch or a particular commit? Do you have negative or out-of-integer-range user or product ids? There is an issue with ALS' partitioning (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not sure whether that is the reason. Could you try to see whether you can reproduce the error on a public data set, e.g., movielens? Thanks! Best, Xiangrui On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I deployed apache/spark master today and recently there were many ALS related checkins and enhancements.. I am running ALS with explicit feedback and I remember most enhancements were related to implicit feedback... With 25 factors my runs were successful but with 50 factors I am getting array index out of bound... Note that I was hitting gc errors before with an older version of spark but it seems like the sparse matrix partitioning scheme has changed now...data caching looks much balanced now...earlier one node was becoming bottleneck...Although I ran with 64g memory per node... There are around 3M products, 25M users... Anyone noticed this bug or something similar ? 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 81029 at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS.org $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at
Flaky streaming tests
Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull) on what I think is an unrelated change. Does anyone know anything about these? Should we just remove some of these tests since they seem to be constantly failing? -Kay
Re: Flaky streaming tests
I met this issue when Jenkins seems to be very busy On Monday, April 7, 2014, Kay Ousterhout k...@eecs.berkeley.edu wrote: Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull ) on what I think is an unrelated change. Does anyone know anything about these? Should we just remove some of these tests since they seem to be constantly failing? -Kay
Re: Flaky streaming tests
TD - do you know what is going on here? I looked into this ab it and at least a few of these that use Thread.sleep() and assume the sleep will be exact, which is wrong. We should disable all the tests that do and probably they should be re-written to virtualize time. - Patrick On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.eduwrote: Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull ) on what I think is an unrelated change. Does anyone know anything about these? Should we just remove some of these tests since they seem to be constantly failing? -Kay
Re: Flaky streaming tests
There is a JIRA for one of the flakey tests here: https://issues.apache.org/jira/browse/SPARK-1409 On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell pwend...@gmail.com wrote: TD - do you know what is going on here? I looked into this ab it and at least a few of these that use Thread.sleep() and assume the sleep will be exact, which is wrong. We should disable all the tests that do and probably they should be re-written to virtualize time. - Patrick On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull ) on what I think is an unrelated change. Does anyone know anything about these? Should we just remove some of these tests since they seem to be constantly failing? -Kay
Re: Flaky streaming tests
Yes, I will take a look at those tests ASAP. TD On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell pwend...@gmail.com wrote: TD - do you know what is going on here? I looked into this ab it and at least a few of these that use Thread.sleep() and assume the sleep will be exact, which is wrong. We should disable all the tests that do and probably they should be re-written to virtualize time. - Patrick On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.edu wrote: Hi all, The InputStreamsSuite seems to have some serious flakiness issues -- I've seen the file input stream fail many times and now I'm seeing some actor input stream test failures ( https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull ) on what I think is an unrelated change. Does anyone know anything about these? Should we just remove some of these tests since they seem to be constantly failing? -Kay
Re: ALS array index out of bound with 50 factors
Hi Deb, It would be helpful if you can attached the logs. It is strange to see that you can make 4 iterations but not 10. Xiangrui On Mon, Apr 7, 2014 at 10:36 AM, Debasish Das debasish.da...@gmail.com wrote: I am using master... No negative indexes... If I run with 4 iterations it runs fine and I can generate factors... With 10 iterations run fails with array index out of bound... 25m users and 3m products are within int limits Does it help if I can point the logs for both the runs to you ? I will debug it further today... On Apr 7, 2014 9:54 AM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, This thread is for the out-of-bound error you described. I don't think the number of iterations has any effect here. My questions were: 1) Are you using the master branch or a particular commit? 2) Do you have negative or out-of-integer-range user or product ids? Try to print out the max/min value of user/product ids. Best, Xiangrui On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das debasish.da...@gmail.com wrote: Hi Xiangrui, With 4 ALS iterations it runs fine...If I run 10 I am failing...I believe I have to cut the lineage chain and call checkpointTrying to follow the other email chain on checkpointing... Thanks. Deb On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng men...@gmail.com wrote: Hi Deb, Are you using the master branch or a particular commit? Do you have negative or out-of-integer-range user or product ids? There is an issue with ALS' partitioning (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not sure whether that is the reason. Could you try to see whether you can reproduce the error on a public data set, e.g., movielens? Thanks! Best, Xiangrui On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com wrote: Hi, I deployed apache/spark master today and recently there were many ALS related checkins and enhancements.. I am running ALS with explicit feedback and I remember most enhancements were related to implicit feedback... With 25 factors my runs were successful but with 50 factors I am getting array index out of bound... Note that I was hitting gc errors before with an older version of spark but it seems like the sparse matrix partitioning scheme has changed now...data caching looks much balanced now...earlier one node was becoming bottleneck...Although I ran with 64g memory per node... There are around 3M products, 25M users... Anyone noticed this bug or something similar ? 14/04/05 23:03:15 WARN TaskSetManager: Loss was due to java.lang.ArrayIndexOutOfBoundsException java.lang.ArrayIndexOutOfBoundsException: 81029 at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446) at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141) at org.apache.spark.mllib.recommendation.ALS.org $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416) at org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31) at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149) at org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229) at org.apache.spark.rdd.RDD.iterator(RDD.scala:220) at
Spark Streaming and Flume Avro RPC Servers
Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case would benefit from balancing the load across Workers because of our volume of messages. We would be using a load balancer in front of the Spark workers running the Avro RPC servers, essentially round-robinning the messages across all of them. If this is something that is currently not supported, I'd be interested in contributing to the code to make it happen. - Christophe
Re: Spark Streaming and Flume Avro RPC Servers
You can configure your sinks to write to one or more Avro sources in a load-balanced configuration. https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors mfe On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp christo...@christophe.ccwrote: Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case would benefit from balancing the load across Workers because of our volume of messages. We would be using a load balancer in front of the Spark workers running the Avro RPC servers, essentially round-robinning the messages across all of them. If this is something that is currently not supported, I'd be interested in contributing to the code to make it happen. - Christophe -- Michael Ernest Sr. Solutions Consultant West Coast
Re: Spark Streaming and Flume Avro RPC Servers
I don't see why not. If one were doing something similar with straight Flume, you'd start an agent on each node you care to receive Avro/RPC events. In the absence of clearer insight to your use case, I'm puzzling just a little why it's necessary for each Worker to be its own receiver, but there's no real objection or concern to fuel the puzzlement, just curiosity. On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp christo...@christophe.ccwrote: Could it be as simple as just changing FlumeUtils to accept a list of host/port number pairs to start the RPC servers on? On 4/7/14, 12:58 PM, Christophe Clapp wrote: Based on the source code here: https://github.com/apache/spark/blob/master/external/ flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala It looks like in its current version, FlumeUtils does not support starting an Avro RPC server on more than one worker. - Christophe On 4/7/14, 12:23 PM, Michael Ernest wrote: You can configure your sinks to write to one or more Avro sources in a load-balanced configuration. https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors mfe On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp christo...@christophe.ccwrote: Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case would benefit from balancing the load across Workers because of our volume of messages. We would be using a load balancer in front of the Spark workers running the Avro RPC servers, essentially round-robinning the messages across all of them. If this is something that is currently not supported, I'd be interested in contributing to the code to make it happen. - Christophe -- Michael Ernest Sr. Solutions Consultant West Coast
Re: Spark Streaming and Flume Avro RPC Servers
Cool. I'll look at making the code change in FlumeUtils and generating a pull request. As far as the use case, the volume of messages we have is currently about 30 MB per second which may grow to over what a 1 Gbit network adapter can handle. - Christophe On Apr 7, 2014 1:51 PM, Michael Ernest mfern...@cloudera.com wrote: I don't see why not. If one were doing something similar with straight Flume, you'd start an agent on each node you care to receive Avro/RPC events. In the absence of clearer insight to your use case, I'm puzzling just a little why it's necessary for each Worker to be its own receiver, but there's no real objection or concern to fuel the puzzlement, just curiosity. On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp christo...@christophe.ccwrote: Could it be as simple as just changing FlumeUtils to accept a list of host/port number pairs to start the RPC servers on? On 4/7/14, 12:58 PM, Christophe Clapp wrote: Based on the source code here: https://github.com/apache/spark/blob/master/external/ flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala It looks like in its current version, FlumeUtils does not support starting an Avro RPC server on more than one worker. - Christophe On 4/7/14, 12:23 PM, Michael Ernest wrote: You can configure your sinks to write to one or more Avro sources in a load-balanced configuration. https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors mfe On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp christo...@christophe.ccwrote: Hi, From my testing of Spark Streaming with Flume, it seems that there's only one of the Spark worker nodes that runs a Flume Avro RPC server to receive messages at any given time, as opposed to every Spark worker running an Avro RPC server to receive messages. Is this the case? Our use-case would benefit from balancing the load across Workers because of our volume of messages. We would be using a load balancer in front of the Spark workers running the Avro RPC servers, essentially round-robinning the messages across all of them. If this is something that is currently not supported, I'd be interested in contributing to the code to make it happen. - Christophe -- Michael Ernest Sr. Solutions Consultant West Coast
Re: MLLib - Thoughts about refactoring Updater for LBFGS?
Hi guys, The latest PR uses Breeze's L-BFGS implement which is introduced by Xiangrui's sparse input format work in SPARK-1212. https://github.com/apache/spark/pull/353 Now, it works with the new sparse framework! Any feedback would be greatly appreciated. Thanks. Sincerely, DB Tsai --- My Blog: https://www.dbtsai.com LinkedIn: https://www.linkedin.com/in/dbtsai On Thu, Apr 3, 2014 at 5:02 PM, DB Tsai dbt...@alpinenow.com wrote: -- Forwarded message -- From: David Hall d...@cs.berkeley.edu Date: Sat, Mar 15, 2014 at 10:02 AM Subject: Re: MLLib - Thoughts about refactoring Updater for LBFGS? To: DB Tsai dbt...@alpinenow.com On Fri, Mar 7, 2014 at 10:56 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, Please let me know the version of Breeze that LBFGS can be serialized, and CachedDiffFunction is built-in in LBFGS once you finish. I'll update the PR to Spark from using RISO implementation to Breeze implementation. The current master (0.7-SNAPSHOT) has these problems fixed. Thanks. Sincerely, DB Tsai Machine Learning Engineer Alpine Data Labs -- Web: http://alpinenow.com/ On Thu, Mar 6, 2014 at 4:26 PM, David Hall d...@cs.berkeley.edu wrote: On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai dbt...@alpinenow.com wrote: Hi David, I can converge to the same result with your breeze LBFGS and Fortran implementations now. Probably, I made some mistakes when I tried breeze before. I apologize that I claimed it's not stable. See the test case in BreezeLBFGSSuite.scala https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS This is training multinomial logistic regression against iris dataset, and both optimizers can train the models with 98% training accuracy. great to hear! There were some bugs in LBFGS about 6 months ago, so depending on the last time you tried it, it might indeed have been bugged. There are two issues to use Breeze in Spark, 1) When the gradientSum and lossSum are computed distributively in custom defined DiffFunction which will be passed into your optimizer, Spark will complain LBFGS class is not serializable. In BreezeLBFGS.scala, I've to convert RDD to array to make it work locally. It should be easy to fix by just having LBFGS to implement Serializable. I'm not sure why Spark should be serializing LBFGS? Shouldn't it live on the controller node? Or is this a per-node thing? But no problem to make it serializable. 2) Breeze computes redundant gradient and loss. See the following log from both Fortran and Breeze implementations. Err, yeah. I should probably have LBFGS do this automatically, but there's a CachedDiffFunction that gets rid of the redundant calculations. -- David Thanks. Fortran: Iteration -1: loss 1.3862943611198926, diff 1.0 Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352 Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126 Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336 Iteration 3: loss 1.054036932835569, diff 0.03566113127440601 Iteration 4: loss 0.9907956302751622, diff 0.0507649459571 Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761 Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982 Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716 Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277 Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075 Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627 Breeze: Iteration -1: loss 1.3862943611198926, diff 1.0 Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeSystemBLAS Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit WARNING: Failed to load implementation from: com.github.fommil.netlib.NativeRefBLAS Iteration 0: loss 1.3862943611198926, diff 0.0 Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352 Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126 Iteration 3: loss 1.1242501524477688, diff 0.0 Iteration 4: loss 1.1242501524477688, diff 0.0 Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336 Iteration 6: loss 1.0930151243303563, diff 0.0 Iteration 7: loss 1.0930151243303563, diff 0.0 Iteration 8: loss 1.054036932835569, diff 0.03566113127440601 Iteration 9: loss 1.054036932835569, diff 0.0 Iteration 10: loss 1.054036932835569, diff 0.0 Iteration 11: loss 0.9907956302751622, diff 0.0507649459571 Iteration 12: loss 0.9907956302751622, diff 0.0 Iteration 13: loss 0.9907956302751622, diff 0.0 Iteration 14: loss 0.9184205380342829, diff 0.07304737423337761 Iteration 15: loss 0.9184205380342829, diff 0.0 Iteration 16: loss 0.9184205380342829, diff 0.0 Iteration 17: loss
Re: Contributing to Spark
Hi Sujeet, Thanks. I went thru the website and looks great. Is there a list of items that I can choose from, for contribution? Thanks Mukesh On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi svarakh...@gopivotal.comwrote: This is a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Sujeet On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G muk...@gmail.com wrote: Hi, How I contribute to Spark and it's associated projects? Appreciate the help... Thanks Mukesh
Re: Contributing to Spark
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them here: https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened) Matei On Apr 7, 2014, at 9:45 PM, Mukesh G muk...@gmail.com wrote: Hi Sujeet, Thanks. I went thru the website and looks great. Is there a list of items that I can choose from, for contribution? Thanks Mukesh On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi svarakh...@gopivotal.comwrote: This is a good place to start: https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark Sujeet On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G muk...@gmail.com wrote: Hi, How I contribute to Spark and it's associated projects? Appreciate the help... Thanks Mukesh