Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Debasish Das
Hi Xiangrui,

With 4 ALS iterations it runs fine...If I run 10 I am failing...I believe I
have to cut the lineage chain and call checkpointTrying to follow the
other email chain on checkpointing...

Thanks.
Deb


On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng men...@gmail.com wrote:

 Hi Deb,

 Are you using the master branch or a particular commit? Do you have
 negative or out-of-integer-range user or product ids? There is an
 issue with ALS' partitioning
 (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
 sure whether that is the reason. Could you try to see whether you can
 reproduce the error on a public data set, e.g., movielens? Thanks!

 Best,
 Xiangrui

 On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I deployed apache/spark master today and recently there were many ALS
  related checkins and enhancements..
 
  I am running ALS with explicit feedback and I remember most enhancements
  were related to implicit feedback...
 
  With 25 factors my runs were successful but with 50 factors I am getting
  array index out of bound...
 
  Note that I was hitting gc errors before with an older version of spark
 but
  it seems like the sparse matrix partitioning scheme has changed
 now...data
  caching looks much balanced now...earlier one node was becoming
  bottleneck...Although I ran with 64g memory per node...
 
  There are around 3M products, 25M users...
 
  Anyone noticed this bug or something similar ?
 
  14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
  java.lang.ArrayIndexOutOfBoundsException
  java.lang.ArrayIndexOutOfBoundsException: 81029
  at
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
  at
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
  at org.apache.spark.mllib.recommendation.ALS.org
  $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
  at
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
  at
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
  at
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
  at
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
  at
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
  at
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
  at
  org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
  at
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
  at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
  at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
  at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
  at org.apache.spark.scheduler.Task.run(Task.scala:52)
  at
 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
  at
 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43)
  at
 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
  at java.security.AccessController.doPrivileged(Native Method)
  at javax.security.auth.Subject.doAs(Subject.java:396)
  at
 
 org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1408)
  at
 
 org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:42)
  at
 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:176)
  at
 
 

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Nick Pentreath
On the partitioning / id keys. If we would look at hash partitioning, how
feasible will it be to just allow the user and item ids to be strings? A
lot of the time these ids are strings anyway (UUIDs and so on), and it's
really painful to translate between String - Int the whole time.

Are there any obvious blockers to this? I am a bit rusty on the ALS code
but from a quick scan I think this may work. Performance may be an issue
with large String keys... Any majore issues/objections to this thinking?

I may be able to find time to take a stab at this if there is demand.


On Mon, Apr 7, 2014 at 6:08 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Deb,

 Are you using the master branch or a particular commit? Do you have
 negative or out-of-integer-range user or product ids? There is an
 issue with ALS' partitioning
 (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
 sure whether that is the reason. Could you try to see whether you can
 reproduce the error on a public data set, e.g., movielens? Thanks!

 Best,
 Xiangrui

 On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi,
 
  I deployed apache/spark master today and recently there were many ALS
  related checkins and enhancements..
 
  I am running ALS with explicit feedback and I remember most enhancements
  were related to implicit feedback...
 
  With 25 factors my runs were successful but with 50 factors I am getting
  array index out of bound...
 
  Note that I was hitting gc errors before with an older version of spark
 but
  it seems like the sparse matrix partitioning scheme has changed
 now...data
  caching looks much balanced now...earlier one node was becoming
  bottleneck...Although I ran with 64g memory per node...
 
  There are around 3M products, 25M users...
 
  Anyone noticed this bug or something similar ?
 
  14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
  java.lang.ArrayIndexOutOfBoundsException
  java.lang.ArrayIndexOutOfBoundsException: 81029
  at
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
  at
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
  at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
  at org.apache.spark.mllib.recommendation.ALS.org
  $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
  at
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
  at
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
  at
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
  at
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
  at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
  at
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
  at
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
  at
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
  at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
  at org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
  at
  org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
  at
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
  at org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
  at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
  at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
  at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:161)
  at
 
 org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:102)
  at org.apache.spark.scheduler.Task.run(Task.scala:52)
  at
 
 org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:211)
  at
 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:43)
  at
 
 org.apache.spark.deploy.SparkHadoopUtil$$anon$1.run(SparkHadoopUtil.scala:42)
  at java.security.AccessController.doPrivileged(Native Method)
  at 

Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Debasish Das
Nick,

I already have this code which calls dictionary generation and then maps
string etc to ints...I think the core algorithm should stay in ints...if
you like I can add this code in MFUtils.scalathat's the convention I
followed similar to MLUtils.scala...actually these functions should be even
made part of MLUtils.scala...

Only thing is that the join should be an option which makes it application
dependent...sometimes people would like to do map side joins if their
dictionaries are small...in my case user dictionary has 25M rows and
product dictionary has 3M rows...so join optimization did not help...

Thanks.
Deb



On Mon, Apr 7, 2014 at 6:57 AM, Nick Pentreath nick.pentre...@gmail.comwrote:

 On the partitioning / id keys. If we would look at hash partitioning, how
 feasible will it be to just allow the user and item ids to be strings? A
 lot of the time these ids are strings anyway (UUIDs and so on), and it's
 really painful to translate between String - Int the whole time.

 Are there any obvious blockers to this? I am a bit rusty on the ALS code
 but from a quick scan I think this may work. Performance may be an issue
 with large String keys... Any majore issues/objections to this thinking?

 I may be able to find time to take a stab at this if there is demand.


 On Mon, Apr 7, 2014 at 6:08 AM, Xiangrui Meng men...@gmail.com wrote:

  Hi Deb,
 
  Are you using the master branch or a particular commit? Do you have
  negative or out-of-integer-range user or product ids? There is an
  issue with ALS' partitioning
  (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
  sure whether that is the reason. Could you try to see whether you can
  reproduce the error on a public data set, e.g., movielens? Thanks!
 
  Best,
  Xiangrui
 
  On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com
  wrote:
   Hi,
  
   I deployed apache/spark master today and recently there were many ALS
   related checkins and enhancements..
  
   I am running ALS with explicit feedback and I remember most
 enhancements
   were related to implicit feedback...
  
   With 25 factors my runs were successful but with 50 factors I am
 getting
   array index out of bound...
  
   Note that I was hitting gc errors before with an older version of spark
  but
   it seems like the sparse matrix partitioning scheme has changed
  now...data
   caching looks much balanced now...earlier one node was becoming
   bottleneck...Although I ran with 64g memory per node...
  
   There are around 3M products, 25M users...
  
   Anyone noticed this bug or something similar ?
  
   14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
   java.lang.ArrayIndexOutOfBoundsException
   java.lang.ArrayIndexOutOfBoundsException: 81029
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
   at scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at org.apache.spark.mllib.recommendation.ALS.org
   $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
   at
  
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
   at
  
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at
  
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
   at
  
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
   at
  
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
   org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
  
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
 

Re: tachyon dependency

2014-04-07 Thread Haoyuan Li
Tachyon is Java 6 compatible from version 0.4. Beside putting input/output
data in Tachyon ( http://tachyon-project.org/Running-Spark-on-Tachyon.html ),
Spark applications can also persist data into Tachyon (
https://github.com/apache/spark/blob/master/docs/scala-programming-guide.md
).


On Mon, Apr 7, 2014 at 7:42 AM, Koert Kuipers ko...@tresata.com wrote:

 i noticed there is a dependency on tachyon in spark core 1.0.0-SNAPSHOT.
 how does that work? i believe tachyon is written in java 7, yet spark
 claims to be java 6 compatible.




-- 
Haoyuan Li
Algorithms, Machines, People Lab, EECS, UC Berkeley
http://www.cs.berkeley.edu/~haoyuan/


Contributing to Spark

2014-04-07 Thread Mukesh G
Hi,

   How I contribute to Spark and it's associated projects?

Appreciate the help...

Thanks

Mukesh


Re: Contributing to Spark

2014-04-07 Thread Sujeet Varakhedi
This is a good place to start:
https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

Sujeet


On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G muk...@gmail.com wrote:

 Hi,

How I contribute to Spark and it's associated projects?

 Appreciate the help...

 Thanks

 Mukesh



Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Debasish Das
I am using master...

No negative indexes...

If I run with 4 iterations it runs fine and I can generate factors...

With 10 iterations run fails with array index out of bound...

25m users and 3m products are within int limits

Does it help if I can point the logs for both the runs to you ?

I will debug it further today...
 On Apr 7, 2014 9:54 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Deb,

 This thread is for the out-of-bound error you described. I don't think
 the number of iterations has any effect here. My questions were:

 1) Are you using the master branch or a particular commit?

 2) Do you have negative or out-of-integer-range user or product ids?
 Try to print out the max/min value of user/product ids.

 Best,
 Xiangrui

 On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi Xiangrui,
 
  With 4 ALS iterations it runs fine...If I run 10 I am failing...I
 believe I
  have to cut the lineage chain and call checkpointTrying to follow the
  other email chain on checkpointing...
 
  Thanks.
  Deb
 
 
  On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Hi Deb,
 
  Are you using the master branch or a particular commit? Do you have
  negative or out-of-integer-range user or product ids? There is an
  issue with ALS' partitioning
  (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
  sure whether that is the reason. Could you try to see whether you can
  reproduce the error on a public data set, e.g., movielens? Thanks!
 
  Best,
  Xiangrui
 
  On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com
 
  wrote:
   Hi,
  
   I deployed apache/spark master today and recently there were many ALS
   related checkins and enhancements..
  
   I am running ALS with explicit feedback and I remember most
 enhancements
   were related to implicit feedback...
  
   With 25 factors my runs were successful but with 50 factors I am
 getting
   array index out of bound...
  
   Note that I was hitting gc errors before with an older version of
 spark
  but
   it seems like the sparse matrix partitioning scheme has changed
  now...data
   caching looks much balanced now...earlier one node was becoming
   bottleneck...Although I ran with 64g memory per node...
  
   There are around 3M products, 25M users...
  
   Anyone noticed this bug or something similar ?
  
   14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
   java.lang.ArrayIndexOutOfBoundsException
   java.lang.ArrayIndexOutOfBoundsException: 81029
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
   at
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
   at
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at org.apache.spark.mllib.recommendation.ALS.org
   $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
   at
  
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
   at
  
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at
  
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
   at
  
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
   at
  
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
   org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
  
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
  
 
 

Flaky streaming tests

2014-04-07 Thread Kay Ousterhout
Hi all,

The InputStreamsSuite seems to have some serious flakiness issues -- I've
seen the file input stream fail many times and now I'm seeing some actor
input stream test failures (
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull)
on what I think is an unrelated change.  Does anyone know anything about
these?  Should we just remove some of these tests since they seem to be
constantly failing?

-Kay


Re: Flaky streaming tests

2014-04-07 Thread Nan Zhu
I met this issue when Jenkins seems to be very busy

On Monday, April 7, 2014, Kay Ousterhout k...@eecs.berkeley.edu wrote:

 Hi all,

 The InputStreamsSuite seems to have some serious flakiness issues -- I've
 seen the file input stream fail many times and now I'm seeing some actor
 input stream test failures (

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
 )
 on what I think is an unrelated change.  Does anyone know anything about
 these?  Should we just remove some of these tests since they seem to be
 constantly failing?

 -Kay



Re: Flaky streaming tests

2014-04-07 Thread Patrick Wendell
TD - do you know what is going on here?

I looked into this ab it and at least a few of these that use
Thread.sleep() and assume the sleep will be exact, which is wrong. We
should disable all the tests that do and probably they should be re-written
to virtualize time.

- Patrick


On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.eduwrote:

 Hi all,

 The InputStreamsSuite seems to have some serious flakiness issues -- I've
 seen the file input stream fail many times and now I'm seeing some actor
 input stream test failures (

 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
 )
 on what I think is an unrelated change.  Does anyone know anything about
 these?  Should we just remove some of these tests since they seem to be
 constantly failing?

 -Kay



Re: Flaky streaming tests

2014-04-07 Thread Michael Armbrust
There is a JIRA for one of the flakey tests here:
https://issues.apache.org/jira/browse/SPARK-1409


On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell pwend...@gmail.com wrote:

 TD - do you know what is going on here?

 I looked into this ab it and at least a few of these that use
 Thread.sleep() and assume the sleep will be exact, which is wrong. We
 should disable all the tests that do and probably they should be re-written
 to virtualize time.

 - Patrick


 On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.edu
 wrote:

  Hi all,
 
  The InputStreamsSuite seems to have some serious flakiness issues -- I've
  seen the file input stream fail many times and now I'm seeing some actor
  input stream test failures (
 
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
  )
  on what I think is an unrelated change.  Does anyone know anything about
  these?  Should we just remove some of these tests since they seem to be
  constantly failing?
 
  -Kay
 



Re: Flaky streaming tests

2014-04-07 Thread Tathagata Das
Yes, I will take a look at those tests ASAP.

TD



On Mon, Apr 7, 2014 at 11:32 AM, Patrick Wendell pwend...@gmail.com wrote:

 TD - do you know what is going on here?

 I looked into this ab it and at least a few of these that use
 Thread.sleep() and assume the sleep will be exact, which is wrong. We
 should disable all the tests that do and probably they should be re-written
 to virtualize time.

 - Patrick


 On Mon, Apr 7, 2014 at 10:52 AM, Kay Ousterhout k...@eecs.berkeley.edu
 wrote:

  Hi all,
 
  The InputStreamsSuite seems to have some serious flakiness issues -- I've
  seen the file input stream fail many times and now I'm seeing some actor
  input stream test failures (
 
 
 https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/13846/consoleFull
  )
  on what I think is an unrelated change.  Does anyone know anything about
  these?  Should we just remove some of these tests since they seem to be
  constantly failing?
 
  -Kay
 



Re: ALS array index out of bound with 50 factors

2014-04-07 Thread Xiangrui Meng
Hi Deb,

It would be helpful if you can attached the logs. It is strange to see
that you can make 4 iterations but not 10.

Xiangrui

On Mon, Apr 7, 2014 at 10:36 AM, Debasish Das debasish.da...@gmail.com wrote:
 I am using master...

 No negative indexes...

 If I run with 4 iterations it runs fine and I can generate factors...

 With 10 iterations run fails with array index out of bound...

 25m users and 3m products are within int limits

 Does it help if I can point the logs for both the runs to you ?

 I will debug it further today...
  On Apr 7, 2014 9:54 AM, Xiangrui Meng men...@gmail.com wrote:

 Hi Deb,

 This thread is for the out-of-bound error you described. I don't think
 the number of iterations has any effect here. My questions were:

 1) Are you using the master branch or a particular commit?

 2) Do you have negative or out-of-integer-range user or product ids?
 Try to print out the max/min value of user/product ids.

 Best,
 Xiangrui

 On Sun, Apr 6, 2014 at 11:01 PM, Debasish Das debasish.da...@gmail.com
 wrote:
  Hi Xiangrui,
 
  With 4 ALS iterations it runs fine...If I run 10 I am failing...I
 believe I
  have to cut the lineage chain and call checkpointTrying to follow the
  other email chain on checkpointing...
 
  Thanks.
  Deb
 
 
  On Sun, Apr 6, 2014 at 9:08 PM, Xiangrui Meng men...@gmail.com wrote:
 
  Hi Deb,
 
  Are you using the master branch or a particular commit? Do you have
  negative or out-of-integer-range user or product ids? There is an
  issue with ALS' partitioning
  (https://spark-project.atlassian.net/browse/SPARK-1281), but I'm not
  sure whether that is the reason. Could you try to see whether you can
  reproduce the error on a public data set, e.g., movielens? Thanks!
 
  Best,
  Xiangrui
 
  On Sat, Apr 5, 2014 at 10:53 PM, Debasish Das debasish.da...@gmail.com
 
  wrote:
   Hi,
  
   I deployed apache/spark master today and recently there were many ALS
   related checkins and enhancements..
  
   I am running ALS with explicit feedback and I remember most
 enhancements
   were related to implicit feedback...
  
   With 25 factors my runs were successful but with 50 factors I am
 getting
   array index out of bound...
  
   Note that I was hitting gc errors before with an older version of
 spark
  but
   it seems like the sparse matrix partitioning scheme has changed
  now...data
   caching looks much balanced now...earlier one node was becoming
   bottleneck...Although I ran with 64g memory per node...
  
   There are around 3M products, 25M users...
  
   Anyone noticed this bug or something similar ?
  
   14/04/05 23:03:15 WARN TaskSetManager: Loss was due to
   java.lang.ArrayIndexOutOfBoundsException
   java.lang.ArrayIndexOutOfBoundsException: 81029
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1$$anonfun$apply$mcVI$sp$1.apply$mcVI$sp(ALS.scala:450)
   at
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateBlock$1.apply$mcVI$sp(ALS.scala:446)
   at
 scala.collection.immutable.Range.foreach$mVc$sp(Range.scala:141)
   at org.apache.spark.mllib.recommendation.ALS.org
   $apache$spark$mllib$recommendation$ALS$$updateBlock(ALS.scala:445)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:416)
   at
  
 
 org.apache.spark.mllib.recommendation.ALS$$anonfun$org$apache$spark$mllib$recommendation$ALS$$updateFeatures$2.apply(ALS.scala:415)
   at
  
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
   at
  
 
 org.apache.spark.rdd.MappedValuesRDD$$anonfun$compute$1.apply(MappedValuesRDD.scala:31)
   at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
   at
  
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:149)
   at
  
 
 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$4.apply(CoGroupedRDD.scala:147)
   at
  
 
 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
   at
 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
   at
 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:147)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
   org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
  
 
 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)
   at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:229)
   at org.apache.spark.rdd.RDD.iterator(RDD.scala:220)
   at
 

Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp

Hi,

From my testing of Spark Streaming with Flume, it seems that there's 
only one of the Spark worker nodes that runs a Flume Avro RPC server to 
receive messages at any given time, as opposed to every Spark worker 
running an Avro RPC server to receive messages. Is this the case? Our 
use-case would benefit from balancing the load across Workers because of 
our volume of messages. We would be using a load balancer in front of 
the Spark workers running the Avro RPC servers, essentially 
round-robinning the messages across all of them.


If this is something that is currently not supported, I'd be interested 
in contributing to the code to make it happen.


- Christophe


Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Michael Ernest
You can configure your sinks to write to one or more Avro sources in a
load-balanced configuration.

https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors

mfe


On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
christo...@christophe.ccwrote:

 Hi,

 From my testing of Spark Streaming with Flume, it seems that there's only
 one of the Spark worker nodes that runs a Flume Avro RPC server to receive
 messages at any given time, as opposed to every Spark worker running an
 Avro RPC server to receive messages. Is this the case? Our use-case would
 benefit from balancing the load across Workers because of our volume of
 messages. We would be using a load balancer in front of the Spark workers
 running the Avro RPC servers, essentially round-robinning the messages
 across all of them.

 If this is something that is currently not supported, I'd be interested in
 contributing to the code to make it happen.

 - Christophe




-- 
Michael Ernest
Sr. Solutions Consultant
West Coast


Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Michael Ernest
I don't see why not. If one were doing something similar with straight
Flume, you'd start an agent on each node you care to receive Avro/RPC
events. In the absence of clearer insight to your use case, I'm puzzling
just a little why it's necessary for each Worker to be its own receiver,
but there's no real objection or concern to fuel the puzzlement, just
curiosity.


On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp
christo...@christophe.ccwrote:

 Could it be as simple as just changing FlumeUtils to accept a list of
 host/port number pairs to start the RPC servers on?



 On 4/7/14, 12:58 PM, Christophe Clapp wrote:

 Based on the source code here:
 https://github.com/apache/spark/blob/master/external/
 flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala

 It looks like in its current version, FlumeUtils does not support
 starting an Avro RPC server on more than one worker.

 - Christophe

 On 4/7/14, 12:23 PM, Michael Ernest wrote:

 You can configure your sinks to write to one or more Avro sources in a
 load-balanced configuration.

 https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors

 mfe


 On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
 christo...@christophe.ccwrote:

  Hi,

  From my testing of Spark Streaming with Flume, it seems that there's
 only
 one of the Spark worker nodes that runs a Flume Avro RPC server to
 receive
 messages at any given time, as opposed to every Spark worker running an
 Avro RPC server to receive messages. Is this the case? Our use-case
 would
 benefit from balancing the load across Workers because of our volume of
 messages. We would be using a load balancer in front of the Spark
 workers
 running the Avro RPC servers, essentially round-robinning the messages
 across all of them.

 If this is something that is currently not supported, I'd be interested
 in
 contributing to the code to make it happen.

 - Christophe








-- 
Michael Ernest
Sr. Solutions Consultant
West Coast


Re: Spark Streaming and Flume Avro RPC Servers

2014-04-07 Thread Christophe Clapp
Cool. I'll look at making the code change in FlumeUtils and generating a
pull request.

As far as the use case, the volume of messages we have is currently about
30 MB per second which may grow to over what a 1 Gbit network adapter can
handle.

- Christophe
On Apr 7, 2014 1:51 PM, Michael Ernest mfern...@cloudera.com wrote:

 I don't see why not. If one were doing something similar with straight
 Flume, you'd start an agent on each node you care to receive Avro/RPC
 events. In the absence of clearer insight to your use case, I'm puzzling
 just a little why it's necessary for each Worker to be its own receiver,
 but there's no real objection or concern to fuel the puzzlement, just
 curiosity.


 On Mon, Apr 7, 2014 at 4:16 PM, Christophe Clapp
 christo...@christophe.ccwrote:

  Could it be as simple as just changing FlumeUtils to accept a list of
  host/port number pairs to start the RPC servers on?
 
 
 
  On 4/7/14, 12:58 PM, Christophe Clapp wrote:
 
  Based on the source code here:
  https://github.com/apache/spark/blob/master/external/
  flume/src/main/scala/org/apache/spark/streaming/flume/FlumeUtils.scala
 
  It looks like in its current version, FlumeUtils does not support
  starting an Avro RPC server on more than one worker.
 
  - Christophe
 
  On 4/7/14, 12:23 PM, Michael Ernest wrote:
 
  You can configure your sinks to write to one or more Avro sources in a
  load-balanced configuration.
 
  https://flume.apache.org/FlumeUserGuide.html#flume-sink-processors
 
  mfe
 
 
  On Mon, Apr 7, 2014 at 3:19 PM, Christophe Clapp
  christo...@christophe.ccwrote:
 
   Hi,
 
   From my testing of Spark Streaming with Flume, it seems that there's
  only
  one of the Spark worker nodes that runs a Flume Avro RPC server to
  receive
  messages at any given time, as opposed to every Spark worker running
 an
  Avro RPC server to receive messages. Is this the case? Our use-case
  would
  benefit from balancing the load across Workers because of our volume
 of
  messages. We would be using a load balancer in front of the Spark
  workers
  running the Avro RPC servers, essentially round-robinning the messages
  across all of them.
 
  If this is something that is currently not supported, I'd be
 interested
  in
  contributing to the code to make it happen.
 
  - Christophe
 
 
 
 
 
 


 --
 Michael Ernest
 Sr. Solutions Consultant
 West Coast



Re: MLLib - Thoughts about refactoring Updater for LBFGS?

2014-04-07 Thread DB Tsai
Hi guys,

The latest PR uses Breeze's L-BFGS implement which is introduced by
Xiangrui's sparse input format work in SPARK-1212.

https://github.com/apache/spark/pull/353

Now, it works with the new sparse framework!

Any feedback would be greatly appreciated.

Thanks.

Sincerely,

DB Tsai
---
My Blog: https://www.dbtsai.com
LinkedIn: https://www.linkedin.com/in/dbtsai


On Thu, Apr 3, 2014 at 5:02 PM, DB Tsai dbt...@alpinenow.com wrote:
 -- Forwarded message --
 From: David Hall d...@cs.berkeley.edu
 Date: Sat, Mar 15, 2014 at 10:02 AM
 Subject: Re: MLLib - Thoughts about refactoring Updater for LBFGS?
 To: DB Tsai dbt...@alpinenow.com


 On Fri, Mar 7, 2014 at 10:56 PM, DB Tsai dbt...@alpinenow.com wrote:

 Hi David,

 Please let me know the version of Breeze that LBFGS can be serialized,
 and CachedDiffFunction is built-in in LBFGS once you finish. I'll
 update the PR to Spark from using RISO implementation to Breeze
 implementation.


 The current master (0.7-SNAPSHOT) has these problems fixed.



 Thanks.

 Sincerely,

 DB Tsai
 Machine Learning Engineer
 Alpine Data Labs
 --
 Web: http://alpinenow.com/


 On Thu, Mar 6, 2014 at 4:26 PM, David Hall d...@cs.berkeley.edu wrote:
  On Thu, Mar 6, 2014 at 4:21 PM, DB Tsai dbt...@alpinenow.com wrote:
 
  Hi David,
 
  I can converge to the same result with your breeze LBFGS and Fortran
  implementations now. Probably, I made some mistakes when I tried
  breeze before. I apologize that I claimed it's not stable.
 
  See the test case in BreezeLBFGSSuite.scala
  https://github.com/AlpineNow/spark/tree/dbtsai-breezeLBFGS
 
  This is training multinomial logistic regression against iris dataset,
  and both optimizers can train the models with 98% training accuracy.
 
 
  great to hear! There were some bugs in LBFGS about 6 months ago, so
  depending on the last time you tried it, it might indeed have been
  bugged.
 
 
 
  There are two issues to use Breeze in Spark,
 
  1) When the gradientSum and lossSum are computed distributively in
  custom defined DiffFunction which will be passed into your optimizer,
  Spark will complain LBFGS class is not serializable. In
  BreezeLBFGS.scala, I've to convert RDD to array to make it work
  locally. It should be easy to fix by just having LBFGS to implement
  Serializable.
 
 
  I'm not sure why Spark should be serializing LBFGS? Shouldn't it live on
  the controller node? Or is this a per-node thing?
 
  But no problem to make it serializable.
 
 
 
  2) Breeze computes redundant gradient and loss. See the following log
  from both Fortran and Breeze implementations.
 
 
  Err, yeah. I should probably have LBFGS do this automatically, but
  there's
  a CachedDiffFunction that gets rid of the redundant calculations.
 
  -- David
 
 
 
  Thanks.
 
  Fortran:
  Iteration -1: loss 1.3862943611198926, diff 1.0
  Iteration 0: loss 1.5846343143210866, diff 0.14307193024217352
  Iteration 1: loss 1.1242501524477688, diff 0.29053004039012126
  Iteration 2: loss 1.0930151243303563, diff 0.027782962952189336
  Iteration 3: loss 1.054036932835569, diff 0.03566113127440601
  Iteration 4: loss 0.9907956302751622, diff 0.0507649459571
  Iteration 5: loss 0.9184205380342829, diff 0.07304737423337761
  Iteration 6: loss 0.8259870936519937, diff 0.10064381175132982
  Iteration 7: loss 0.6327447552109574, diff 0.23395293458364716
  Iteration 8: loss 0.5534101162436359, diff 0.1253815427665277
  Iteration 9: loss 0.4045020086612566, diff 0.26907321376758075
  Iteration 10: loss 0.3078824990823728, diff 0.23885980452569627
 
  Breeze:
  Iteration -1: loss 1.3862943611198926, diff 1.0
  Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit
  WARNING: Failed to load implementation from:
  com.github.fommil.netlib.NativeSystemBLAS
  Mar 6, 2014 3:59:11 PM com.github.fommil.netlib.BLAS clinit
  WARNING: Failed to load implementation from:
  com.github.fommil.netlib.NativeRefBLAS
  Iteration 0: loss 1.3862943611198926, diff 0.0
  Iteration 1: loss 1.5846343143210866, diff 0.14307193024217352
  Iteration 2: loss 1.1242501524477688, diff 0.29053004039012126
  Iteration 3: loss 1.1242501524477688, diff 0.0
  Iteration 4: loss 1.1242501524477688, diff 0.0
  Iteration 5: loss 1.0930151243303563, diff 0.027782962952189336
  Iteration 6: loss 1.0930151243303563, diff 0.0
  Iteration 7: loss 1.0930151243303563, diff 0.0
  Iteration 8: loss 1.054036932835569, diff 0.03566113127440601
  Iteration 9: loss 1.054036932835569, diff 0.0
  Iteration 10: loss 1.054036932835569, diff 0.0
  Iteration 11: loss 0.9907956302751622, diff 0.0507649459571
  Iteration 12: loss 0.9907956302751622, diff 0.0
  Iteration 13: loss 0.9907956302751622, diff 0.0
  Iteration 14: loss 0.9184205380342829, diff 0.07304737423337761
  Iteration 15: loss 0.9184205380342829, diff 0.0
  Iteration 16: loss 0.9184205380342829, diff 0.0
  Iteration 17: loss 

Re: Contributing to Spark

2014-04-07 Thread Mukesh G
Hi Sujeet,

Thanks. I went thru the website and looks great. Is there a list of
items that I can choose from, for contribution?

Thanks

Mukesh


On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi
svarakh...@gopivotal.comwrote:

 This is a good place to start:
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark

 Sujeet


 On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G muk...@gmail.com wrote:

  Hi,
 
 How I contribute to Spark and it's associated projects?
 
  Appreciate the help...
 
  Thanks
 
  Mukesh
 



Re: Contributing to Spark

2014-04-07 Thread Matei Zaharia
I’d suggest looking for the issues labeled “Starter” on JIRA. You can find them 
here: 
https://issues.apache.org/jira/browse/SPARK-1438?jql=project%20%3D%20SPARK%20AND%20labels%20%3D%20Starter%20AND%20status%20in%20(Open%2C%20%22In%20Progress%22%2C%20Reopened)

Matei

On Apr 7, 2014, at 9:45 PM, Mukesh G muk...@gmail.com wrote:

 Hi Sujeet,
 
Thanks. I went thru the website and looks great. Is there a list of
 items that I can choose from, for contribution?
 
 Thanks
 
 Mukesh
 
 
 On Mon, Apr 7, 2014 at 10:14 PM, Sujeet Varakhedi
 svarakh...@gopivotal.comwrote:
 
 This is a good place to start:
 https://cwiki.apache.org/confluence/display/SPARK/Contributing+to+Spark
 
 Sujeet
 
 
 On Mon, Apr 7, 2014 at 9:20 AM, Mukesh G muk...@gmail.com wrote:
 
 Hi,
 
   How I contribute to Spark and it's associated projects?
 
 Appreciate the help...
 
 Thanks
 
 Mukesh