Hi, I want to contribute to the MLlib library but I can't get the tests up working. I've found three ways of running the tests on the commandline. I just want to execute the MLlib tests.
1. via dev/run-tests script This script executes all tests and take several hours to finish. Some tests failed but I can't say which of them. Should this really take that long? Can I specify to run only MLlib tests? 2. directly via maven I did the following described in the docs [0]. export MAVEN_OPTS="-Xmx2g -XX:MaxPermSize=512M -XX:ReservedCodeCacheSize=512m" mvn -Pyarn -Phadoop-2.3 -DskipTests -Phive clean package mvn -Pyarn -Phadoop-2.3 -Phive test This also doesn't work. Why do I have to package spark bevore running the tests? 3. via sbt I tried the following. I freshly cloned spark and checked out the tag v1.1.0-rc4. sbt/sbt "project mllib" test and get the following exception in several cluster tests. [info] - task size should be small in both training and prediction *** FAILED *** [info] org.apache.spark.SparkException: Job aborted due to stage failure: Master removed our application: FAILED [info] at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1185) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1174) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1173) [info] at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) [info] at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) [info] at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1173) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) [info] at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:688) [info] at scala.Option.foreach(Option.scala:236) [info] at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:688) summary: [error] Failed: Total 223, Failed 12, Errors 0, Passed 211 [error] Failed tests: [error] org.apache.spark.mllib.clustering.KMeansClusterSuite [error] org.apache.spark.mllib.classification.LogisticRegressionClusterSuite [error] org.apache.spark.mllib.optimization.GradientDescentClusterSuite [error] org.apache.spark.mllib.classification.SVMClusterSuite [error] org.apache.spark.mllib.linalg.distributed.RowMatrixClusterSuite [error] org.apache.spark.mllib.regression.LinearRegressionClusterSuite [error] org.apache.spark.mllib.classification.NaiveBayesClusterSuite [error] org.apache.spark.mllib.regression.LassoClusterSuite [error] org.apache.spark.mllib.regression.RidgeRegressionClusterSuite [error] org.apache.spark.mllib.optimization.LBFGSClusterSuite [error] (mllib/test:test) sbt.TestsFailedException: Tests unsuccessful [error] Total time: 661 s, completed 28.10.2014 17:13:10 sbt/sbt "project mllib" test 761,74s user 22,86s system 109% cpu 11:59,57 total I tried several slightly different ways but I can't get the tests working. I observed that the tests are running __very__ slow in some configurations. The cpu nearly idles and the ram usage is low. Am I doing something fundamental wrong? After many hours of trial and error I'm stuck. Long build and test durations are making it difficult to investigate. Hopefully someone can give me a hint. Which one is the right way to flexibly run the tests of the different sub projects. Thanks, Niklas [0] https://spark.apache.org/docs/latest/building-with-maven.html --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@spark.apache.org For additional commands, e-mail: dev-h...@spark.apache.org