Hi,

I want to contribute to the MLlib library but I can't get the tests up
working. I've found three ways of running the tests on the commandline.
I just want to execute the MLlib tests.

1. via dev/run-tests script
    This script executes all tests and take several hours to finish.
Some tests failed but I can't say which of them. Should this really take
that long? Can I specify to run only MLlib tests?

2. directly via maven
I did the following described in the docs [0].

export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
mvn -Pyarn -Phadoop-2.3 -Phive test

This also doesn't work.
Why do I have to package spark bevore running the tests?

3. via sbt
I tried the following. I freshly cloned spark and checked out the tag
v1.1.0-rc4.

sbt/sbt "project mllib" test

and get the following exception in several cluster tests.

[info] - task size should be small in both training and prediction ***
FAILED ***
[info]   org.apache.spark.SparkException: Job aborted due to stage
failure: Master removed our application: FAILED
[info]   at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
[info]   at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
[info]   at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
[info]   at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info]   at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[info]   at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
[info]   at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
[info]   at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
[info]   at scala.Option.foreach(Option.scala:236)
[info]   at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)

summary:

[error] Failed: Total 223, Failed 12, Errors 0, Passed 211
[error] Failed tests:
[error]         org.apache.spark.mllib.clustering.KMeansClusterSuite
[error]        
org.apache.spark.mllib.classification.LogisticRegressionClusterSuite
[error]        
org.apache.spark.mllib.optimization.GradientDescentClusterSuite
[error]         org.apache.spark.mllib.classification.SVMClusterSuite
[error]        
org.apache.spark.mllib.linalg.distributed.RowMatrixClusterSuite
[error]        
org.apache.spark.mllib.regression.LinearRegressionClusterSuite
[error]         org.apache.spark.mllib.classification.NaiveBayesClusterSuite
[error]         org.apache.spark.mllib.regression.LassoClusterSuite
[error]        
org.apache.spark.mllib.regression.RidgeRegressionClusterSuite
[error]         org.apache.spark.mllib.optimization.LBFGSClusterSuite
[error] (mllib/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 661 s, completed 28.10.2014 17:13:10
sbt/sbt "project mllib" test  761,74s user 22,86s system 109% cpu
11:59,57 total

I tried several slightly different ways but I can't get the tests working.
I observed that the tests are running __very__ slow in some
configurations. The cpu nearly idles and the ram usage is low.

Am I doing something fundamental wrong? After many hours of trial and
error I'm stuck.
Long build and test durations are making it difficult to investigate.
Hopefully someone can give me a hint.
Which one is the right way to flexibly run the tests of the different
sub projects.

Thanks,
Niklas


[0] https://spark.apache.org/docs/latest/building-with-maven.html

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org
For additional commands, e-mail: dev-h...@spark.apache.org

Reply via email to