Re: Getting error running MLlib example with new cluster

Su She Mon, 11 May 2015 11:00:59 -0700

Got it to work on the cluster by changing the master to yarn-cluster
instead of local! I do have a couple follow up questions...


This is the example I was trying to
run:https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

1) The example still takes about 1 min 15 seconds to run (my cluster
has 3 m3.large nodes). This seems really long for building a model
based off data that is about 10 lines long. Is this normal?

2) Any guesses as to why it was able to run in the cluster, but not locally?

Thanks for the help!


On Mon, Apr 27, 2015 at 11:48 AM, Su She <suhsheka...@gmail.com> wrote:
> Hello Xiangrui,
>
> I am using this spark-submit command (as I do for all other jobs):
>
> /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit
> --class MLlib --master local[2] --jars $(echo
> /home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')
> /home/ec2-user/sparkApps/learning-spark/target/simple-project-1.1.jar
>
> Thank you for the help!
>
> Best,
>
> Su
>
>
> On Mon, Apr 27, 2015 at 9:58 AM, Xiangrui Meng <men...@gmail.com> wrote:
>> How did you run the example app? Did you use spark-submit? -Xiangrui
>>
>> On Thu, Apr 23, 2015 at 2:27 PM, Su She <suhsheka...@gmail.com> wrote:
>>> Sorry, accidentally sent the last email before finishing.
>>>
>>> I had asked this question before, but wanted to ask again as I think
>>> it is now related to my pom file or project setup. Really appreciate the 
>>> help!
>>>
>>> I have been trying on/off for the past month to try to run this MLlib
>>> example: 
>>> https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala
>>>
>>> I am able to build the project successfully. When I run it, it returns:
>>>
>>> features in spam: 8
>>> features in ham: 7
>>>
>>> and then freezes. According to the UI, the description of the job is
>>> "count at DataValidators.scala.38. This corresponds to this line in
>>> the code:
>>>
>>> val model = lrLearner.run(trainingData)
>>>
>>> I've tried just about everything I can think of...changed numFeatures
>>> from 1 -> 10,000, set executor memory to 1g, set up a new cluster, at
>>> this point I think I might have missed dependencies as that has
>>> usually been the problem in other spark apps I have tried to run. This
>>> is my pom file, that I have used for other successful spark apps.
>>> Please let me know if you think I need any additional dependencies or
>>> there are incompatibility issues, or a pom.xml that is better to use.
>>> Thank you!
>>>
>>> Cluster information:
>>>
>>> Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
>>> java version "1.7.0_25"
>>> Scala version: 2.10.4
>>> hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)
>>>
>>>
>>>
>>> <project xmlns = "http://maven.apache.org/POM/4.0.0";
>>> xmlns:xsi="http://w3.org/2001/XMLSchema-instance"; xsi:schemaLocation
>>> ="http://maven.apache.org/POM/4.0.0
>>> http://maven.apache.org/maven-v4_0_0.xsd";>
>>>         <groupId> edu.berkely</groupId>
>>>         <artifactId> simple-project </artifactId>
>>>         <modelVersion> 4.0.0</modelVersion>
>>>         <name> Simple Project </name>
>>>         <packaging> jar </packaging>
>>>         <version> 1.0 </version>
>>> <repositories>
>>>         <repository>
>>>         <id>cloudera</id>
>>>         <url> 
>>> http://repository.cloudera.com/artifactory/cloudera-repos/</url>
>>>         </repository>
>>>
>>>                 <repository>
>>>                 <id>scala-tools.org</id>
>>>                 <name>Scala-tools Maven2 Repository</name>
>>>                 <url>http://scala-tools.org/repo-releases</url>
>>>                 </repository>
>>>
>>> </repositories>
>>>
>>> <pluginRepositories>
>>>         <pluginRepository>
>>>                 <id>scala-tools.org</id>
>>>                 <name>Scala-tools Maven2 Repository</name>
>>>                 <url>http://scala-tools.org/repo-releases</url>
>>>         </pluginRepository>
>>> </pluginRepositories>
>>>
>>> <build>
>>>         <plugins>
>>>                 <plugin>
>>>                         <groupId>org.scala-tools</groupId>
>>>                         <artifactId>maven-scala-plugin</artifactId>
>>>                         <executions>
>>>
>>>                                 <execution>
>>>                                         <id>compile</id>
>>>                                         <goals>
>>>                                                 <goal>compile</goal>
>>>                                         </goals>
>>>                                         <phase>compile</phase>
>>>                                 </execution>
>>>                                 <execution>
>>>                                         <id>test-compile</id>
>>>                                         <goals>
>>>                                                 <goal>testCompile</goal>
>>>                                         </goals>
>>>                                         <phase>test-compile</phase>
>>>                                 </execution>
>>>                 <execution>
>>>                    <phase>process-resources</phase>
>>>                    <goals>
>>>                      <goal>compile</goal>
>>>                    </goals>
>>>                 </execution>
>>>                         </executions>
>>>                 </plugin>
>>>                 <plugin>
>>>                         <artifactId>maven-compiler-plugin</artifactId>
>>>                         <configuration>
>>>                                 <source>1.7</source>
>>>                                 <target>1.7</target>
>>>                         </configuration>
>>>                 </plugin>
>>>         </plugins>
>>> </build>
>>>
>>>
>>> <dependencies>
>>>         <dependency> <!--Spark dependency -->
>>>         <groupId> org.apache.spark</groupId>
>>>         <artifactId>spark-core_2.10</artifactId>
>>>         <version>1.2.0-cdh5.3.0</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>org.apache.hadoop</groupId>
>>>         <artifactId>hadoop-client</artifactId>
>>>         <version>2.5.0-mr1-cdh5.3.0</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>org.scala-lang</groupId>
>>>         <artifactId>scala-library</artifactId>
>>>         <version>2.10.4</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>org.scala-lang</groupId>
>>>         <artifactId>scala-compiler</artifactId>
>>>         <version>2.10.4</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>com.101tec</groupId>
>>>         <artifactId>zkclient</artifactId>
>>>         <version>0.3</version>
>>>         </dependency>
>>>
>>>          <dependency>
>>>          <groupId>com.yammer.metrics</groupId>
>>>          <artifactId>metrics-core</artifactId>
>>>          <version>2.2.0</version>
>>>          </dependency>
>>>
>>>
>>>         <dependency>
>>>         <groupId>org.apache.hadoop</groupId>
>>>         <artifactId>hadoop-yarn-server-web-proxy</artifactId>
>>>         <version>2.5.0</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>org.apache.thrift</groupId>
>>>         <artifactId>libthrift</artifactId>
>>>         <version>0.9.2</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>com.google.guava</groupId>
>>>         <artifactId>guava</artifactId>
>>>         <version>18.0</version>
>>>         </dependency>
>>>
>>>          <dependency>
>>>         <groupId>junit</groupId>
>>>         <artifactId>junit</artifactId>
>>>         <version>3.8.1</version>
>>>         <scope>test</scope>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>org.apache.spark</groupId>
>>>         <artifactId>spark-mllib_2.10</artifactId>
>>>         <version>1.2.0</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>org.scalanlp</groupId>
>>>         <artifactId>breeze-math_2.10</artifactId>
>>>         <version>0.4</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>com.googlecode.netlib-java</groupId>
>>>         <artifactId>netlib-java</artifactId>
>>>         <version>1.0</version>
>>>         </dependency>
>>>
>>>         <dependency>
>>>         <groupId>org.jblas</groupId>
>>>         <artifactId>jblas</artifactId>
>>>         <version>1.2.3</version>
>>>         </dependency>
>>>
>>> </dependencies>
>>>
>>> </project>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>

---------------------------------------------------------------------
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Re: Getting error running MLlib example with new cluster

Reply via email to