Re: Getting error running MLlib example with new cluster

2015-05-11 Thread Su She
Got it to work on the cluster by changing the master to yarn-cluster
instead of local! I do have a couple follow up questions...

This is the example I was trying to
run:https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

1) The example still takes about 1 min 15 seconds to run (my cluster
has 3 m3.large nodes). This seems really long for building a model
based off data that is about 10 lines long. Is this normal?

2) Any guesses as to why it was able to run in the cluster, but not locally?

Thanks for the help!


On Mon, Apr 27, 2015 at 11:48 AM, Su She suhsheka...@gmail.com wrote:
 Hello Xiangrui,

 I am using this spark-submit command (as I do for all other jobs):

 /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit
 --class MLlib --master local[2] --jars $(echo
 /home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')
 /home/ec2-user/sparkApps/learning-spark/target/simple-project-1.1.jar

 Thank you for the help!

 Best,

 Su


 On Mon, Apr 27, 2015 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote:
 How did you run the example app? Did you use spark-submit? -Xiangrui

 On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
 Sorry, accidentally sent the last email before finishing.

 I had asked this question before, but wanted to ask again as I think
 it is now related to my pom file or project setup. Really appreciate the 
 help!

 I have been trying on/off for the past month to try to run this MLlib
 example: 
 https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

 I am able to build the project successfully. When I run it, it returns:

 features in spam: 8
 features in ham: 7

 and then freezes. According to the UI, the description of the job is
 count at DataValidators.scala.38. This corresponds to this line in
 the code:

 val model = lrLearner.run(trainingData)

 I've tried just about everything I can think of...changed numFeatures
 from 1 - 10,000, set executor memory to 1g, set up a new cluster, at
 this point I think I might have missed dependencies as that has
 usually been the problem in other spark apps I have tried to run. This
 is my pom file, that I have used for other successful spark apps.
 Please let me know if you think I need any additional dependencies or
 there are incompatibility issues, or a pom.xml that is better to use.
 Thank you!

 Cluster information:

 Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
 java version 1.7.0_25
 Scala version: 2.10.4
 hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)



 project xmlns = http://maven.apache.org/POM/4.0.0;
 xmlns:xsi=http://w3.org/2001/XMLSchema-instance; xsi:schemaLocation
 =http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/maven-v4_0_0.xsd;
 groupId edu.berkely/groupId
 artifactId simple-project /artifactId
 modelVersion 4.0.0/modelVersion
 name Simple Project /name
 packaging jar /packaging
 version 1.0 /version
 repositories
 repository
 idcloudera/id
 url 
 http://repository.cloudera.com/artifactory/cloudera-repos//url
 /repository

 repository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /repository

 /repositories

 pluginRepositories
 pluginRepository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /pluginRepository
 /pluginRepositories

 build
 plugins
 plugin
 groupIdorg.scala-tools/groupId
 artifactIdmaven-scala-plugin/artifactId
 executions

 execution
 idcompile/id
 goals
 goalcompile/goal
 /goals
 phasecompile/phase
 /execution
 execution
 idtest-compile/id
 goals
 goaltestCompile/goal
 /goals
 phasetest-compile/phase
 /execution
 execution
phaseprocess-resources/phase
goals
  goalcompile/goal
/goals
 /execution
 /executions
 /plugin
 plugin

Re: Getting error running MLlib example with new cluster

2015-05-11 Thread Sean Owen
That is mostly the YARN overhead. You're starting up a container for the AM
and executors, at least. That still sounds pretty slow, but the defaults
aren't tuned for fast startup.
On May 11, 2015 7:00 PM, Su She suhsheka...@gmail.com wrote:

 Got it to work on the cluster by changing the master to yarn-cluster
 instead of local! I do have a couple follow up questions...

 This is the example I was trying to
 run:
 https://github.com/holdenk/learning-spark-examples/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

 1) The example still takes about 1 min 15 seconds to run (my cluster
 has 3 m3.large nodes). This seems really long for building a model
 based off data that is about 10 lines long. Is this normal?

 2) Any guesses as to why it was able to run in the cluster, but not
 locally?

 Thanks for the help!


 On Mon, Apr 27, 2015 at 11:48 AM, Su She suhsheka...@gmail.com wrote:
  Hello Xiangrui,
 
  I am using this spark-submit command (as I do for all other jobs):
 
 
 /opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit
  --class MLlib --master local[2] --jars $(echo
  /home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')
  /home/ec2-user/sparkApps/learning-spark/target/simple-project-1.1.jar
 
  Thank you for the help!
 
  Best,
 
  Su
 
 
  On Mon, Apr 27, 2015 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote:
  How did you run the example app? Did you use spark-submit? -Xiangrui
 
  On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
  Sorry, accidentally sent the last email before finishing.
 
  I had asked this question before, but wanted to ask again as I think
  it is now related to my pom file or project setup. Really appreciate
 the help!
 
  I have been trying on/off for the past month to try to run this MLlib
  example:
 https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala
 
  I am able to build the project successfully. When I run it, it returns:
 
  features in spam: 8
  features in ham: 7
 
  and then freezes. According to the UI, the description of the job is
  count at DataValidators.scala.38. This corresponds to this line in
  the code:
 
  val model = lrLearner.run(trainingData)
 
  I've tried just about everything I can think of...changed numFeatures
  from 1 - 10,000, set executor memory to 1g, set up a new cluster, at
  this point I think I might have missed dependencies as that has
  usually been the problem in other spark apps I have tried to run. This
  is my pom file, that I have used for other successful spark apps.
  Please let me know if you think I need any additional dependencies or
  there are incompatibility issues, or a pom.xml that is better to use.
  Thank you!
 
  Cluster information:
 
  Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
  java version 1.7.0_25
  Scala version: 2.10.4
  hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)
 
 
 
  project xmlns = http://maven.apache.org/POM/4.0.0;
  xmlns:xsi=http://w3.org/2001/XMLSchema-instance; xsi:schemaLocation
  =http://maven.apache.org/POM/4.0.0
  http://maven.apache.org/maven-v4_0_0.xsd;
  groupId edu.berkely/groupId
  artifactId simple-project /artifactId
  modelVersion 4.0.0/modelVersion
  name Simple Project /name
  packaging jar /packaging
  version 1.0 /version
  repositories
  repository
  idcloudera/id
  url
 http://repository.cloudera.com/artifactory/cloudera-repos//url
  /repository
 
  repository
  idscala-tools.org/id
  nameScala-tools Maven2 Repository/name
  urlhttp://scala-tools.org/repo-releases/url
  /repository
 
  /repositories
 
  pluginRepositories
  pluginRepository
  idscala-tools.org/id
  nameScala-tools Maven2 Repository/name
  urlhttp://scala-tools.org/repo-releases/url
  /pluginRepository
  /pluginRepositories
 
  build
  plugins
  plugin
  groupIdorg.scala-tools/groupId
  artifactIdmaven-scala-plugin/artifactId
  executions
 
  execution
  idcompile/id
  goals
  goalcompile/goal
  /goals
  phasecompile/phase
  /execution
  execution
  idtest-compile/id
  goals
 
  goaltestCompile/goal
  /goals
  phasetest-compile/phase
  

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Xiangrui Meng
How did you run the example app? Did you use spark-submit? -Xiangrui

On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
 Sorry, accidentally sent the last email before finishing.

 I had asked this question before, but wanted to ask again as I think
 it is now related to my pom file or project setup. Really appreciate the help!

 I have been trying on/off for the past month to try to run this MLlib
 example: 
 https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

 I am able to build the project successfully. When I run it, it returns:

 features in spam: 8
 features in ham: 7

 and then freezes. According to the UI, the description of the job is
 count at DataValidators.scala.38. This corresponds to this line in
 the code:

 val model = lrLearner.run(trainingData)

 I've tried just about everything I can think of...changed numFeatures
 from 1 - 10,000, set executor memory to 1g, set up a new cluster, at
 this point I think I might have missed dependencies as that has
 usually been the problem in other spark apps I have tried to run. This
 is my pom file, that I have used for other successful spark apps.
 Please let me know if you think I need any additional dependencies or
 there are incompatibility issues, or a pom.xml that is better to use.
 Thank you!

 Cluster information:

 Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
 java version 1.7.0_25
 Scala version: 2.10.4
 hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)



 project xmlns = http://maven.apache.org/POM/4.0.0;
 xmlns:xsi=http://w3.org/2001/XMLSchema-instance; xsi:schemaLocation
 =http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/maven-v4_0_0.xsd;
 groupId edu.berkely/groupId
 artifactId simple-project /artifactId
 modelVersion 4.0.0/modelVersion
 name Simple Project /name
 packaging jar /packaging
 version 1.0 /version
 repositories
 repository
 idcloudera/id
 url http://repository.cloudera.com/artifactory/cloudera-repos//url
 /repository

 repository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /repository

 /repositories

 pluginRepositories
 pluginRepository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /pluginRepository
 /pluginRepositories

 build
 plugins
 plugin
 groupIdorg.scala-tools/groupId
 artifactIdmaven-scala-plugin/artifactId
 executions

 execution
 idcompile/id
 goals
 goalcompile/goal
 /goals
 phasecompile/phase
 /execution
 execution
 idtest-compile/id
 goals
 goaltestCompile/goal
 /goals
 phasetest-compile/phase
 /execution
 execution
phaseprocess-resources/phase
goals
  goalcompile/goal
/goals
 /execution
 /executions
 /plugin
 plugin
 artifactIdmaven-compiler-plugin/artifactId
 configuration
 source1.7/source
 target1.7/target
 /configuration
 /plugin
 /plugins
 /build


 dependencies
 dependency !--Spark dependency --
 groupId org.apache.spark/groupId
 artifactIdspark-core_2.10/artifactId
 version1.2.0-cdh5.3.0/version
 /dependency

 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version2.5.0-mr1-cdh5.3.0/version
 /dependency

 dependency
 groupIdorg.scala-lang/groupId
 artifactIdscala-library/artifactId
 version2.10.4/version
 /dependency

 dependency
 groupIdorg.scala-lang/groupId
 artifactIdscala-compiler/artifactId
 version2.10.4/version
 /dependency

 dependency
 groupIdcom.101tec/groupId
 artifactIdzkclient/artifactId
 version0.3/version
 /dependency

  

Re: Getting error running MLlib example with new cluster

2015-04-27 Thread Su She
Hello Xiangrui,

I am using this spark-submit command (as I do for all other jobs):

/opt/cloudera/parcels/CDH-5.3.0-1.cdh5.3.0.p0.30/lib/spark/bin/spark-submit
--class MLlib --master local[2] --jars $(echo
/home/ec2-user/sparkApps/learning-spark/lib/*.jar | tr ' ' ',')
/home/ec2-user/sparkApps/learning-spark/target/simple-project-1.1.jar

Thank you for the help!

Best,

Su


On Mon, Apr 27, 2015 at 9:58 AM, Xiangrui Meng men...@gmail.com wrote:
 How did you run the example app? Did you use spark-submit? -Xiangrui

 On Thu, Apr 23, 2015 at 2:27 PM, Su She suhsheka...@gmail.com wrote:
 Sorry, accidentally sent the last email before finishing.

 I had asked this question before, but wanted to ask again as I think
 it is now related to my pom file or project setup. Really appreciate the 
 help!

 I have been trying on/off for the past month to try to run this MLlib
 example: 
 https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

 I am able to build the project successfully. When I run it, it returns:

 features in spam: 8
 features in ham: 7

 and then freezes. According to the UI, the description of the job is
 count at DataValidators.scala.38. This corresponds to this line in
 the code:

 val model = lrLearner.run(trainingData)

 I've tried just about everything I can think of...changed numFeatures
 from 1 - 10,000, set executor memory to 1g, set up a new cluster, at
 this point I think I might have missed dependencies as that has
 usually been the problem in other spark apps I have tried to run. This
 is my pom file, that I have used for other successful spark apps.
 Please let me know if you think I need any additional dependencies or
 there are incompatibility issues, or a pom.xml that is better to use.
 Thank you!

 Cluster information:

 Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
 java version 1.7.0_25
 Scala version: 2.10.4
 hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)



 project xmlns = http://maven.apache.org/POM/4.0.0;
 xmlns:xsi=http://w3.org/2001/XMLSchema-instance; xsi:schemaLocation
 =http://maven.apache.org/POM/4.0.0
 http://maven.apache.org/maven-v4_0_0.xsd;
 groupId edu.berkely/groupId
 artifactId simple-project /artifactId
 modelVersion 4.0.0/modelVersion
 name Simple Project /name
 packaging jar /packaging
 version 1.0 /version
 repositories
 repository
 idcloudera/id
 url 
 http://repository.cloudera.com/artifactory/cloudera-repos//url
 /repository

 repository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /repository

 /repositories

 pluginRepositories
 pluginRepository
 idscala-tools.org/id
 nameScala-tools Maven2 Repository/name
 urlhttp://scala-tools.org/repo-releases/url
 /pluginRepository
 /pluginRepositories

 build
 plugins
 plugin
 groupIdorg.scala-tools/groupId
 artifactIdmaven-scala-plugin/artifactId
 executions

 execution
 idcompile/id
 goals
 goalcompile/goal
 /goals
 phasecompile/phase
 /execution
 execution
 idtest-compile/id
 goals
 goaltestCompile/goal
 /goals
 phasetest-compile/phase
 /execution
 execution
phaseprocess-resources/phase
goals
  goalcompile/goal
/goals
 /execution
 /executions
 /plugin
 plugin
 artifactIdmaven-compiler-plugin/artifactId
 configuration
 source1.7/source
 target1.7/target
 /configuration
 /plugin
 /plugins
 /build


 dependencies
 dependency !--Spark dependency --
 groupId org.apache.spark/groupId
 artifactIdspark-core_2.10/artifactId
 version1.2.0-cdh5.3.0/version
 /dependency

 dependency
 groupIdorg.apache.hadoop/groupId
 artifactIdhadoop-client/artifactId
 version2.5.0-mr1-cdh5.3.0/version
 /dependency

 

Getting error running MLlib example with new cluster

2015-04-23 Thread Su She
I had asked this question before, but wanted to ask again as I think
it is related to my pom file or project setup.

I have been trying on/off for the past month to try to run this MLlib example:

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Getting error running MLlib example with new cluster

2015-04-23 Thread Su She
Sorry, accidentally sent the last email before finishing.

I had asked this question before, but wanted to ask again as I think
it is now related to my pom file or project setup. Really appreciate the help!

I have been trying on/off for the past month to try to run this MLlib
example: 
https://github.com/databricks/learning-spark/blob/master/src/main/scala/com/oreilly/learningsparkexamples/scala/MLlib.scala

I am able to build the project successfully. When I run it, it returns:

features in spam: 8
features in ham: 7

and then freezes. According to the UI, the description of the job is
count at DataValidators.scala.38. This corresponds to this line in
the code:

val model = lrLearner.run(trainingData)

I've tried just about everything I can think of...changed numFeatures
from 1 - 10,000, set executor memory to 1g, set up a new cluster, at
this point I think I might have missed dependencies as that has
usually been the problem in other spark apps I have tried to run. This
is my pom file, that I have used for other successful spark apps.
Please let me know if you think I need any additional dependencies or
there are incompatibility issues, or a pom.xml that is better to use.
Thank you!

Cluster information:

Spark version: 1.2.0-SNAPSHOT (in my older cluster it is 1.2.0)
java version 1.7.0_25
Scala version: 2.10.4
hadoop version: hadoop 2.5.0-cdh5.3.3 (older cluster was 5.3.0)



project xmlns = http://maven.apache.org/POM/4.0.0;
xmlns:xsi=http://w3.org/2001/XMLSchema-instance; xsi:schemaLocation
=http://maven.apache.org/POM/4.0.0
http://maven.apache.org/maven-v4_0_0.xsd;
groupId edu.berkely/groupId
artifactId simple-project /artifactId
modelVersion 4.0.0/modelVersion
name Simple Project /name
packaging jar /packaging
version 1.0 /version
repositories
repository
idcloudera/id
url http://repository.cloudera.com/artifactory/cloudera-repos//url
/repository

repository
idscala-tools.org/id
nameScala-tools Maven2 Repository/name
urlhttp://scala-tools.org/repo-releases/url
/repository

/repositories

pluginRepositories
pluginRepository
idscala-tools.org/id
nameScala-tools Maven2 Repository/name
urlhttp://scala-tools.org/repo-releases/url
/pluginRepository
/pluginRepositories

build
plugins
plugin
groupIdorg.scala-tools/groupId
artifactIdmaven-scala-plugin/artifactId
executions

execution
idcompile/id
goals
goalcompile/goal
/goals
phasecompile/phase
/execution
execution
idtest-compile/id
goals
goaltestCompile/goal
/goals
phasetest-compile/phase
/execution
execution
   phaseprocess-resources/phase
   goals
 goalcompile/goal
   /goals
/execution
/executions
/plugin
plugin
artifactIdmaven-compiler-plugin/artifactId
configuration
source1.7/source
target1.7/target
/configuration
/plugin
/plugins
/build


dependencies
dependency !--Spark dependency --
groupId org.apache.spark/groupId
artifactIdspark-core_2.10/artifactId
version1.2.0-cdh5.3.0/version
/dependency

dependency
groupIdorg.apache.hadoop/groupId
artifactIdhadoop-client/artifactId
version2.5.0-mr1-cdh5.3.0/version
/dependency

dependency
groupIdorg.scala-lang/groupId
artifactIdscala-library/artifactId
version2.10.4/version
/dependency

dependency
groupIdorg.scala-lang/groupId
artifactIdscala-compiler/artifactId
version2.10.4/version
/dependency

dependency
groupIdcom.101tec/groupId
artifactIdzkclient/artifactId
version0.3/version
/dependency

 dependency
 groupIdcom.yammer.metrics/groupId
 artifactIdmetrics-core/artifactId
 version2.2.0/version
 /dependency


dependency
groupIdorg.apache.hadoop/groupId