Hi,
I want to contribute to the MLlib library but I can't get the tests up
working. I've found three ways of running the tests on the commandline.
I just want to execute the MLlib tests.
1. via dev/run-tests script
This script executes all tests and take several hours to finish.
Some tests failed but I can't say which of them. Should this really take
that long? Can I specify to run only MLlib tests?
2. directly via maven
I did the following described in the docs [0].
export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M
-XX:ReservedCodeCacheSize=512m"
mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package
mvn -Pyarn -Phadoop-2.3 -Phive test
This also doesn't work.
Why do I have to package spark bevore running the tests?
3. via sbt
I tried the following. I freshly cloned spark and checked out the tag
v1.1.0-rc4.
sbt/sbt "project mllib" test
and get the following exception in several cluster tests.
[info] - task size should be small in both training and prediction ***
FAILED ***
[info] org.apache.spark.SparkException: Job aborted due to stage
failure: Master removed our application: FAILED
[info] at
org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185)
[info] at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174)
[info] at
org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173)
[info] at
scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
[info] at
scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
[info] at
org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173)
[info] at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
[info] at
org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688)
[info] at scala.Option.foreach(Option.scala:236)
[info] at
org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688)
summary:
[error] Failed: Total 223, Failed 12, Errors 0, Passed 211
[error] Failed tests:
[error] org.apache.spark.mllib.clustering.KMeansClusterSuite
[error]
org.apache.spark.mllib.classification.LogisticRegressionClusterSuite
[error]
org.apache.spark.mllib.optimization.GradientDescentClusterSuite
[error] org.apache.spark.mllib.classification.SVMClusterSuite
[error]
org.apache.spark.mllib.linalg.distributed.RowMatrixClusterSuite
[error]
org.apache.spark.mllib.regression.LinearRegressionClusterSuite
[error] org.apache.spark.mllib.classification.NaiveBayesClusterSuite
[error] org.apache.spark.mllib.regression.LassoClusterSuite
[error]
org.apache.spark.mllib.regression.RidgeRegressionClusterSuite
[error] org.apache.spark.mllib.optimization.LBFGSClusterSuite
[error] (mllib/test:test) sbt.TestsFailedException: Tests unsuccessful
[error] Total time: 661 s, completed 28.10.2014 17:13:10
sbt/sbt "project mllib" test 761,74s user 22,86s system 109% cpu
11:59,57 total
I tried several slightly different ways but I can't get the tests working.
I observed that the tests are running __very__ slow in some
configurations. The cpu nearly idles and the ram usage is low.
Am I doing something fundamental wrong? After many hours of trial and
error I'm stuck.
Long build and test durations are making it difficult to investigate.
Hopefully someone can give me a hint.
Which one is the right way to flexibly run the tests of the different
sub projects.
Thanks,
Niklas
[0] https://spark.apache.org/docs/latest/building-with-maven.html
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]