[jira] [Comment Edited] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-07 Thread Masaru Dobashi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161982#comment-14161982
 ] 

Masaru Dobashi edited comment on SPARK-3803 at 10/7/14 3:14 PM:


Thank you for your comments.
I agree with the idea to throw an exception.

This is because exiting with an appropriate exception and messages seems to be 
kind for users of MLlib.
It helps users to recognize which part of application they should fix.

How about using sys.error() to throw RuntimeException in the same way as 
handling of empty rows.


was (Author: dobachi):
Thank you for your comments.
I agree with the idea to throw an exception.

This is because exiting with appropriate exception and messages seems to be 
kind for users of MLlib.
It helps users to recognize which part of application they should fix.

How about using sys.error() to throw RuntimeException in the same way as 
handling of empty rows.

> ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
> 
>
> Key: SPARK-3803
> URL: https://issues.apache.org/jira/browse/SPARK-3803
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Masaru Dobashi
>
> When I executed computePrincipalComponents method of RowMatrix, I got 
> java.lang.ArrayIndexOutOfBoundsException.
> {code}
> 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
> RDDFunctions.scala:111
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
> (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)
> 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
> scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> The RowMatrix instance was generated from the result of TF-IDF like the 
> following.
> {code}
> scala> val hashingTF = new HashingTF()
> scala> val tf = hashingTF.transform(texts)
> scala> import org.apache.spark.mllib.feature.IDF
> scala> tf.cache()
> scala> val idf = new IDF().fit(tf)
> scala> val tfidf: RDD[Vector] = idf.transform(tf)
> scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> scala> val mat = new RowMatrix(tfidf)
> scala> val pc = mat.computePrincipalComponents(2)
> {code}
> I think this was because I created HashingTF instance with defa

[jira] [Comment Edited] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-07 Thread Masaru Dobashi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161982#comment-14161982
 ] 

Masaru Dobashi edited comment on SPARK-3803 at 10/7/14 3:15 PM:


Thank you for your comments.
I agree with the idea to throw an exception.

This is because exiting with an appropriate exception and messages seems to be 
kind for users of MLlib.
It helps them to recognize which part of application they should change.

How about using sys.error() to throw RuntimeException in the same way as 
handling of empty rows.


was (Author: dobachi):
Thank you for your comments.
I agree with the idea to throw an exception.

This is because exiting with an appropriate exception and messages seems to be 
kind for users of MLlib.
It helps them to recognize which part of application they should fix.

How about using sys.error() to throw RuntimeException in the same way as 
handling of empty rows.

> ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
> 
>
> Key: SPARK-3803
> URL: https://issues.apache.org/jira/browse/SPARK-3803
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Masaru Dobashi
>
> When I executed computePrincipalComponents method of RowMatrix, I got 
> java.lang.ArrayIndexOutOfBoundsException.
> {code}
> 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
> RDDFunctions.scala:111
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
> (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)
> 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
> scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> The RowMatrix instance was generated from the result of TF-IDF like the 
> following.
> {code}
> scala> val hashingTF = new HashingTF()
> scala> val tf = hashingTF.transform(texts)
> scala> import org.apache.spark.mllib.feature.IDF
> scala> tf.cache()
> scala> val idf = new IDF().fit(tf)
> scala> val tfidf: RDD[Vector] = idf.transform(tf)
> scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> scala> val mat = new RowMatrix(tfidf)
> scala> val pc = mat.computePrincipalComponents(2)
> {code}
> I think this was because I created HashingTF instance with 

[jira] [Comment Edited] (SPARK-3803) ArrayIndexOutOfBoundsException found in executing computePrincipalComponents

2014-10-07 Thread Masaru Dobashi (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-3803?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14161982#comment-14161982
 ] 

Masaru Dobashi edited comment on SPARK-3803 at 10/7/14 3:15 PM:


Thank you for your comments.
I agree with the idea to throw an exception.

This is because exiting with an appropriate exception and messages seems to be 
kind for users of MLlib.
It helps them to recognize which part of application they should fix.

How about using sys.error() to throw RuntimeException in the same way as 
handling of empty rows.


was (Author: dobachi):
Thank you for your comments.
I agree with the idea to throw an exception.

This is because exiting with an appropriate exception and messages seems to be 
kind for users of MLlib.
It helps users to recognize which part of application they should fix.

How about using sys.error() to throw RuntimeException in the same way as 
handling of empty rows.

> ArrayIndexOutOfBoundsException found in executing computePrincipalComponents
> 
>
> Key: SPARK-3803
> URL: https://issues.apache.org/jira/browse/SPARK-3803
> Project: Spark
>  Issue Type: Bug
>  Components: MLlib
>Affects Versions: 1.1.0
>Reporter: Masaru Dobashi
>
> When I executed computePrincipalComponents method of RowMatrix, I got 
> java.lang.ArrayIndexOutOfBoundsException.
> {code}
> 14/10/05 20:16:31 INFO DAGScheduler: Failed to run reduce at 
> RDDFunctions.scala:111
> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in 
> stage 31.0 failed 1 times, most recent failure: Lost task 0.0 in stage 31.0 
> (TID 611, localhost): java.lang.ArrayIndexOutOfBoundsException: 4878161
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$.org$apache$spark$mllib$linalg$distributed$RowMatrix$$dspr(RowMatrix.scala:460)
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:114)
> 
> org.apache.spark.mllib.linalg.distributed.RowMatrix$$anonfun$3.apply(RowMatrix.scala:113)
> 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> 
> scala.collection.TraversableOnce$$anonfun$foldLeft$1.apply(TraversableOnce.scala:144)
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.foldLeft(TraversableOnce.scala:144)
> scala.collection.AbstractIterator.foldLeft(Iterator.scala:1157)
> 
> scala.collection.TraversableOnce$class.aggregate(TraversableOnce.scala:201)
> scala.collection.AbstractIterator.aggregate(Iterator.scala:1157)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$4.apply(RDDFunctions.scala:99)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> 
> org.apache.spark.mllib.rdd.RDDFunctions$$anonfun$5.apply(RDDFunctions.scala:100)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> org.apache.spark.rdd.RDD$$anonfun$13.apply(RDD.scala:596)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
> org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> 
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> org.apache.spark.scheduler.Task.run(Task.scala:54)
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> 
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> java.lang.Thread.run(Thread.java:745)
> {code}
> The RowMatrix instance was generated from the result of TF-IDF like the 
> following.
> {code}
> scala> val hashingTF = new HashingTF()
> scala> val tf = hashingTF.transform(texts)
> scala> import org.apache.spark.mllib.feature.IDF
> scala> tf.cache()
> scala> val idf = new IDF().fit(tf)
> scala> val tfidf: RDD[Vector] = idf.transform(tf)
> scala> import org.apache.spark.mllib.linalg.distributed.RowMatrix
> scala> val mat = new RowMatrix(tfidf)
> scala> val pc = mat.computePrincipalComponents(2)
> {code}
> I think this was because I created HashingTF instance with de