That is mostly the YARN overhead. You're starting up a container for the AM and executors, at least. That still sounds pretty slow, but the defaults aren't tuned for fast startup. On May 11, 2015 7:00 PM, "Su She" <suhsheka...@gmail.com> wrote:
> Got it to work on the cluster by changing the master to yarn-cluster > instead of local! I do have a couple follow up questions... > > This is the example I was trying to > run: > https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala > > 1) The example still takes about 1 min 15 seconds to run (my cluster > has 3 m3.large nodes). This seems really long for building a model > based off data that is about 10 lines long. Is this normal? > > 2) Any guesses as to why it was able to run in the cluster, but not > locally? > > Thanks for the help! > > > On Mon, Apr 27, 2015 at 11:48 AM, Su She <suhsheka...@gmail.com> wrote: > > Hello Xiangrui, > > > > I am using this spark-submit command (as I do for all other jobs): > > > > > /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit > > --class MLlib --master local[2] --jars $(echo > > /home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',') > > /home/ec2-user/sparkApps/learning-spark/target/simple-project-1.1.jar > > > > Thank you for the help! > > > > Best, > > > > Su > > > > > > On Mon, Apr 27, 2015 at 9:58 AM, Xiangrui Meng <men...@gmail.com> wrote: > >> How did you run the example app? Did you use spark-submit? -Xiangrui > >> > >> On Thu, Apr 23, 2015 at 2:27 PM, Su She <suhsheka...@gmail.com> wrote: > >>> Sorry, accidentally sent the last email before finishing. > >>> > >>> I had asked this question before, but wanted to ask again as I think > >>> it is now related to my pom file or project setup. Really appreciate > the help! > >>> > >>> I have been trying on/off for the past month to try to run this MLlib > >>> example: > https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala > >>> > >>> I am able to build the project successfully. When I run it, it returns: > >>> > >>> features in spam: 8 > >>> features in ham: 7 > >>> > >>> and then freezes. According to the UI, the description of the job is > >>> "count at DataValidators.scala.38. This corresponds to this line in > >>> the code: > >>> > >>> val model = lrLearner.run(trainingData) > >>> > >>> I've tried just about everything I can think of...changed numFeatures > >>> from 1 -> 10,000, set executor memory to 1g, set up a new cluster, at > >>> this point I think I might have missed dependencies as that has > >>> usually been the problem in other spark apps I have tried to run. This > >>> is my pom file, that I have used for other successful spark apps. > >>> Please let me know if you think I need any additional dependencies or > >>> there are incompatibility issues, or a pom.xml that is better to use. > >>> Thank you! > >>> > >>> Cluster information: > >>> > >>> Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0) > >>> java version "1.7.0_25" > >>> Scala version: 2.10.4 > >>> hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0) > >>> > >>> > >>> > >>> <project xmlns = "http://maven.apache.org/POM/4.0.0" > >>> xmlns:xsi="http://w3.org/2001/XMLSchema-instance" xsi:schemaLocation > >>> ="http://maven.apache.org/POM/4.0.0 > >>> http://maven.apache.org/maven-v4_0_0.xsd"> > >>> <groupId> edu.berkely</groupId> > >>> <artifactId> simple-project </artifactId> > >>> <modelVersion> 4.0.0</modelVersion> > >>> <name> Simple Project </name> > >>> <packaging> jar </packaging> > >>> <version> 1.0 </version> > >>> <repositories> > >>> <repository> > >>> <id>cloudera</id> > >>> <url> > http://repository.cloudera.com/artifactory/cloudera-repos/</url> > >>> </repository> > >>> > >>> <repository> > >>> <id>scala-tools.org</id> > >>> <name>Scala-tools Maven2 Repository</name> > >>> <url>http://scala-tools.org/repo-releases</url> > >>> </repository> > >>> > >>> </repositories> > >>> > >>> <pluginRepositories> > >>> <pluginRepository> > >>> <id>scala-tools.org</id> > >>> <name>Scala-tools Maven2 Repository</name> > >>> <url>http://scala-tools.org/repo-releases</url> > >>> </pluginRepository> > >>> </pluginRepositories> > >>> > >>> <build> > >>> <plugins> > >>> <plugin> > >>> <groupId>org.scala-tools</groupId> > >>> <artifactId>maven-scala-plugin</artifactId> > >>> <executions> > >>> > >>> <execution> > >>> <id>compile</id> > >>> <goals> > >>> <goal>compile</goal> > >>> </goals> > >>> <phase>compile</phase> > >>> </execution> > >>> <execution> > >>> <id>test-compile</id> > >>> <goals> > >>> > <goal>testCompile</goal> > >>> </goals> > >>> <phase>test-compile</phase> > >>> </execution> > >>> <execution> > >>> <phase>process-resources</phase> > >>> <goals> > >>> <goal>compile</goal> > >>> </goals> > >>> </execution> > >>> </executions> > >>> </plugin> > >>> <plugin> > >>> <artifactId>maven-compiler-plugin</artifactId> > >>> <configuration> > >>> <source>1.7</source> > >>> <target>1.7</target> > >>> </configuration> > >>> </plugin> > >>> </plugins> > >>> </build> > >>> > >>> > >>> <dependencies> > >>> <dependency> <!--Spark dependency --> > >>> <groupId> org.apache.spark</groupId> > >>> <artifactId>spark-core_2.10</artifactId> > >>> <version>1.2.0-cdh5.3.0</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>org.apache.hadoop</groupId> > >>> <artifactId>hadoop-client</artifactId> > >>> <version>2.5.0-mr1-cdh5.3.0</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>org.scala-lang</groupId> > >>> <artifactId>scala-library</artifactId> > >>> <version>2.10.4</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>org.scala-lang</groupId> > >>> <artifactId>scala-compiler</artifactId> > >>> <version>2.10.4</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>com.101tec</groupId> > >>> <artifactId>zkclient</artifactId> > >>> <version>0.3</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>com.yammer.metrics</groupId> > >>> <artifactId>metrics-core</artifactId> > >>> <version>2.2.0</version> > >>> </dependency> > >>> > >>> > >>> <dependency> > >>> <groupId>org.apache.hadoop</groupId> > >>> <artifactId>hadoop-yarn-server-web-proxy</artifactId> > >>> <version>2.5.0</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>org.apache.thrift</groupId> > >>> <artifactId>libthrift</artifactId> > >>> <version>0.9.2</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>com.google.guava</groupId> > >>> <artifactId>guava</artifactId> > >>> <version>18.0</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>junit</groupId> > >>> <artifactId>junit</artifactId> > >>> <version>3.8.1</version> > >>> <scope>test</scope> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>org.apache.spark</groupId> > >>> <artifactId>spark-mllib_2.10</artifactId> > >>> <version>1.2.0</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>org.scalanlp</groupId> > >>> <artifactId>breeze-math_2.10</artifactId> > >>> <version>0.4</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>com.googlecode.netlib-java</groupId> > >>> <artifactId>netlib-java</artifactId> > >>> <version>1.0</version> > >>> </dependency> > >>> > >>> <dependency> > >>> <groupId>org.jblas</groupId> > >>> <artifactId>jblas</artifactId> > >>> <version>1.2.3</version> > >>> </dependency> > >>> > >>> </dependencies> > >>> > >>> </project> > >>> > >>> --------------------------------------------------------------------- > >>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > >>> For additional commands, e-mail: user-h...@spark.apache.org > >>> > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >