Standalone Spark, How to find (driver's ) final status for an application

2019-09-25 Thread Nilkanth Patel
I am setting up *Spark 2.2.0 in standalone mode* (
https://spark.apache.org/docs/latest/spark-standalone.html) and submitting
spark jobs programatically using

 SparkLauncher sparkAppLauncher = new
SparkLauncher(userNameMap).setMaster(sparkMaster).setAppName(appName).;
 SparkAppHandle sparkAppHandle = sparkAppLauncher.startApplication();

 I do have* java client program that polls Job status for the jobs
submitted programatically*, for which i am using following REST endpoint.
 curl  http://192.168.1.139:8080/json/ which provide JSON response as
following,

{
  "url" : "spark://192.168.1.139:7077",
  "workers" : [ { "id" : "x", "host" : "x", "port" : x, "webuiaddress" : "x",
  "cores" : x,  "coresused" : x, "coresfree" : x,
"memory" : xx,
  "memoryused" : xx,  "memoryfree" : xx,  "state" :
"x", "lastheartbeat" : x
}, { ...},  ],
  "cores" : x,
  "coresused" : x,
  "memory" : x,
  "memoryused" : x,
  "activeapps" : [ ],
  "completedapps" : [ { "starttime" : x, "id" : "app-xx-", "name"
: "abc", "user" : "xx",
 "memoryperslave" : x, "submitdate" :
"x","state" : "FINISHED OR RUNNING", "duration" : x
  }, {...}],
  "activedrivers" : [ ],
  "status" : "x"}


In above response, I have observed state for *completedapps is always
FINISHED even if application fails*, while on UI (http://master:8080),
associated driver shows a failed state, like.

[image: image.png]

[image: image.png]

Referring to above example, Currently, My java client gets status  for
application (app-20190925115750-0003
) FINISHED,
even though it got failed (encountered exception) and associated driver
shows "FAILED" state.* I intend to show the final status in this case as
FAILED.*
It seems if i can co-relate, an application-id (app-20190925115750-0003
) to driver-id
(driver-20190925115748-0003), I can report a "FAILED" (final) status. I
could not find any co-relation between them (appID --> driver ID).

*Looking forward to your suggestions to resolving this or any possible
approaches to achieve this.* I have also come across some hidden REST APIs
like
http://xx.xx.xx.xx:6066/v1/submissions/status/driver-20190925115748-0003, which
seems have a limited info returned in response.


Thanks in advance.
Nilkanth.


Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Mu Kong
Thanks for your prompt responses!

@Steve

I actually put my keytabs to all the nodes already. And I used them to
kinit on each server.

But how can I make spark to use my key tab and principle when I start
cluster or submit the job? Or is there a way to let spark use ticket cache
on each node?

I tried --keytab and --principle when I submit the job, still get the same
error. I guess that's for YARN only.

On Fri, Jun 23, 2017 at 18:50 Steve Loughran <ste...@hortonworks.com> wrote:

> On 23 Jun 2017, at 10:22, Saisai Shao <sai.sai.s...@gmail.com> wrote:
>
> Spark running with standalone cluster manager currently doesn't support
> accessing security Hadoop. Basically the problem is that standalone mode
> Spark doesn't have the facility to distribute delegation tokens.
>
> Currently only Spark on YARN or local mode supports security Hadoop.
>
> Thanks
> Jerry
>
>
> There's possibly an ugly workaround where you ssh in to every node and log
> in direct to your kdc using a keytab you pushed out...that would eliminate
> the need for anything related to hadoop tokens. After all, that's
> essentially what spark-on-yarn does when when you give it keytab.
>
>
> see also:
> https://www.gitbook.com/book/steveloughran/kerberos_and_hadoop/details
>
> On Fri, Jun 23, 2017 at 5:10 PM, Mu Kong <kong.mu@gmail.com> wrote:
>
>> Hi, all!
>>
>> I was trying to read from a Kerberosed hadoop cluster from a standalone
>> spark cluster.
>> Right now, I encountered some authentication issues with Kerberos:
>>
>>
>> java.io.IOException: Failed on local exception: java.io.IOException: 
>> org.apache.hadoop.security.AccessControlException: Client cannot 
>> authenticate via:[TOKEN, KERBEROS]; Host Details : local host is: 
>> ""; destination host is: XXX;
>>
>>
>>
>> I checked with klist, and principle/realm is correct.
>> I also used hdfs command line to poke HDFS from all the nodes, and it
>> worked.
>> And if I submit job using local(client) mode, the job worked fine.
>>
>> I tried to put everything from hadoop/conf to spark/conf and hive/conf to
>> spark/conf.
>> Also tried edit spark/conf/spark-env.sh to add
>> SPARK_SUBMIT_OPTS/SPARK_MASTER_OPTS/SPARK_SLAVE_OPTS/HADOOP_CONF_DIR/HIVE_CONF_DIR,
>> and tried to export them in .bashrc as well.
>>
>> However, I'm still experiencing the same exception.
>>
>> Then I read some concerning posts about problems with
>> kerberosed hadoop, some post like the following one:
>> http://blog.stratio.com/spark-kerberos-safe-story/
>> , which indicates that we can not access to kerberosed hdfs using
>> standalone spark cluster.
>>
>> I'm using spark 2.1.1, is it still the case that we can't access
>> kerberosed hdfs with 2.1.1?
>>
>> Thanks!
>>
>>
>> Best regards,
>> Mu
>>
>>
>


Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Steve Loughran

On 23 Jun 2017, at 10:22, Saisai Shao 
<sai.sai.s...@gmail.com<mailto:sai.sai.s...@gmail.com>> wrote:

Spark running with standalone cluster manager currently doesn't support 
accessing security Hadoop. Basically the problem is that standalone mode Spark 
doesn't have the facility to distribute delegation tokens.

Currently only Spark on YARN or local mode supports security Hadoop.

Thanks
Jerry


There's possibly an ugly workaround where you ssh in to every node and log in 
direct to your kdc using a keytab you pushed out...that would eliminate the 
need for anything related to hadoop tokens. After all, that's essentially what 
spark-on-yarn does when when you give it keytab.


see also:  
https://www.gitbook.com/book/steveloughran/kerberos_and_hadoop/details

On Fri, Jun 23, 2017 at 5:10 PM, Mu Kong 
<kong.mu@gmail.com<mailto:kong.mu@gmail.com>> wrote:
Hi, all!

I was trying to read from a Kerberosed hadoop cluster from a standalone spark 
cluster.
Right now, I encountered some authentication issues with Kerberos:



java.io.IOException: Failed on local exception: java.io.IOException: 
org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
via:[TOKEN, KERBEROS]; Host Details : local host is: ""; 
destination host is: XXX;


I checked with klist, and principle/realm is correct.
I also used hdfs command line to poke HDFS from all the nodes, and it worked.
And if I submit job using local(client) mode, the job worked fine.

I tried to put everything from hadoop/conf to spark/conf and hive/conf to 
spark/conf.
Also tried edit spark/conf/spark-env.sh to add 
SPARK_SUBMIT_OPTS/SPARK_MASTER_OPTS/SPARK_SLAVE_OPTS/HADOOP_CONF_DIR/HIVE_CONF_DIR,
 and tried to export them in .bashrc as well.

However, I'm still experiencing the same exception.

Then I read some concerning posts about problems with kerberosed hadoop, some 
post like the following one:
http://blog.stratio.com/spark-kerberos-safe-story/
, which indicates that we can not access to kerberosed hdfs using standalone 
spark cluster.

I'm using spark 2.1.1, is it still the case that we can't access kerberosed 
hdfs with 2.1.1?

Thanks!


Best regards,
Mu





Re: Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Saisai Shao
Spark running with standalone cluster manager currently doesn't support
accessing security Hadoop. Basically the problem is that standalone mode
Spark doesn't have the facility to distribute delegation tokens.

Currently only Spark on YARN or local mode supports security Hadoop.

Thanks
Jerry

On Fri, Jun 23, 2017 at 5:10 PM, Mu Kong <kong.mu@gmail.com> wrote:

> Hi, all!
>
> I was trying to read from a Kerberosed hadoop cluster from a standalone
> spark cluster.
> Right now, I encountered some authentication issues with Kerberos:
>
>
> java.io.IOException: Failed on local exception: java.io.IOException: 
> org.apache.hadoop.security.AccessControlException: Client cannot authenticate 
> via:[TOKEN, KERBEROS]; Host Details : local host is: ""; 
> destination host is: XXX;
>
>
>
> I checked with klist, and principle/realm is correct.
> I also used hdfs command line to poke HDFS from all the nodes, and it
> worked.
> And if I submit job using local(client) mode, the job worked fine.
>
> I tried to put everything from hadoop/conf to spark/conf and hive/conf to
> spark/conf.
> Also tried edit spark/conf/spark-env.sh to add SPARK_SUBMIT_OPTS/SPARK_
> MASTER_OPTS/SPARK_SLAVE_OPTS/HADOOP_CONF_DIR/HIVE_CONF_DIR, and tried to
> export them in .bashrc as well.
>
> However, I'm still experiencing the same exception.
>
> Then I read some concerning posts about problems with
> kerberosed hadoop, some post like the following one:
> http://blog.stratio.com/spark-kerberos-safe-story/
> , which indicates that we can not access to kerberosed hdfs using
> standalone spark cluster.
>
> I'm using spark 2.1.1, is it still the case that we can't access
> kerberosed hdfs with 2.1.1?
>
> Thanks!
>
>
> Best regards,
> Mu
>
>


Question about standalone Spark cluster reading from Kerberosed hadoop

2017-06-23 Thread Mu Kong
Hi, all!

I was trying to read from a Kerberosed hadoop cluster from a standalone
spark cluster.
Right now, I encountered some authentication issues with Kerberos:


java.io.IOException: Failed on local exception: java.io.IOException:
org.apache.hadoop.security.AccessControlException: Client cannot
authenticate via:[TOKEN, KERBEROS]; Host Details : local host is:
""; destination host is: XXX;



I checked with klist, and principle/realm is correct.
I also used hdfs command line to poke HDFS from all the nodes, and it
worked.
And if I submit job using local(client) mode, the job worked fine.

I tried to put everything from hadoop/conf to spark/conf and hive/conf to
spark/conf.
Also tried edit spark/conf/spark-env.sh to add
SPARK_SUBMIT_OPTS/SPARK_MASTER_OPTS/SPARK_SLAVE_OPTS/HADOOP_CONF_DIR/HIVE_CONF_DIR,
and tried to export them in .bashrc as well.

However, I'm still experiencing the same exception.

Then I read some concerning posts about problems with
kerberosed hadoop, some post like the following one:
http://blog.stratio.com/spark-kerberos-safe-story/
, which indicates that we can not access to kerberosed hdfs using
standalone spark cluster.

I'm using spark 2.1.1, is it still the case that we can't access kerberosed
hdfs with 2.1.1?

Thanks!


Best regards,
Mu


Native libraries using only one core in standalone spark cluster

2016-09-26 Thread guangweiyu
Hi,

I'm trying to run a spark job that uses multiple cpu cores per spark
executor in a spark job. Specifically, it runs the gemm matrix multiply
routine from each partition on a large matrix that cannot be distributed.

For test purpose, I have a machine with 8 cores running standalone spark. I
started a spark context, setting "spark.task.cpus" to "8"; then I generated
an RDD with 1 partition only so there will be one executor using all cores.

The job is coded in Java, with JNI wrapper provided by fommil (netlib-java)
and underlying BLAS implementation from OpenBLAS, and the machine I'm
running is one local desktop with Intel(R) Core(TM) i7-4770K CPU @ 3.50GHz
(8 cores)
When I run the test using a local spark as "local[8]", I can see the routine
completes in about 200ms, and CPU utilization is near 100% for all cores.
This is nearly identical performance to running the same code without spark
straight from Java.

When I run the test attaching to the standalone spark by setting master as
"spark://:7077, the same code takes about 12 seconds, and monitoring the
cpu shows that only one thread is used at a time. This is also very close to
the performance I get if I ran the routine in Java with only one core.

I do not see any warning about failure to load native library, and if I
collect a map of  System.getenv(), I see that all the environment variables
seems to be correct (OPENBLAS_NUM_THREADS=8, LD_LIBRARY_PATH includes the
wrapper, etc..)

I also tried to replace OpenBLAS with MKL, with MKL_NUM_THREADS=8 and
MKL_DYNAMIC=false, but I got exactly same behaviour: local spark seems to
use all cores, but standalone spark would not use all cores.

I tried a lot of different settings on the native library's side but it
seems weird that local spark was okay but not the standalone spark.

Any help is greatly appreciated!

Guang



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Native-libraries-using-only-one-core-in-standalone-spark-cluster-tp27795.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
I got the error during run time. It was for mongo-spark-connector class
files.
My build.sbt is like this

name := "Test Advice Project"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies ++= Seq(
"org.mongodb.spark" %% "mongo-spark-connector" % "1.0.0",
"org.apache.spark" %% "spark-core" % "1.6.1" % "provided",
"org.apache.spark" %% "spark-sql" % "1.6.1" % "provided"
)

assemblyMergeStrategy in assembly := {
   case PathList("META-INF", xs @ _*) => MergeStrategy.discard
   case x => MergeStrategy.first
}

and I create the fat jar using sbt assembly
I have also added addSbtPlugin("com.eed3si9n" % "sbt-assembly" % "0.14.3")
in ~/.sbt/0.13/plugins/plugins.sbt for sbt assembly to work.

I think when you use provided it does not include those jars in your fat
jar.

I am using spark 1.6

Thanks
Sachin



On Wed, Jul 20, 2016 at 11:23 PM, Marco Mistroni <mmistr...@gmail.com>
wrote:

>  that will work but ideally you should not include any of the
> spark-releated jars as they are provided to you by the spark environment
> whenever you launch your app via spark-submit (this will prevent unexpected
> errors e.g. when you kick off your app using a different version of spark
> where some of the classes has been renamd or movedaround -tbh i don't think
> this is a case that happen often)
>
> Btw did you get the NoClassDefFoundException at compile time or run
> time?if at run time, what is your Spark Version  and what is the spark
> libraries version you used in your sbt?
> are you using a Spark version pre 1.4?
>
> kr
>  marco
>
>
>
>
>
>
> On Wed, Jul 20, 2016 at 6:13 PM, Sachin Mittal <sjmit...@gmail.com> wrote:
>
>> NoClassDefFound error was for spark classes like say SparkConext.
>> When running a standalone spark application I was not passing external
>> jars using --jars option.
>>
>> However I have fixed this by making a fat jar using sbt assembly plugin.
>>
>> Now all the dependencies are included in that jar and I use that jar in
>> spark-submit
>>
>> Thanks
>> Sachin
>>
>>
>> On Wed, Jul 20, 2016 at 9:42 PM, Marco Mistroni <mmistr...@gmail.com>
>> wrote:
>>
>>> Hello Sachin
>>>   pls paste the NoClassDefFound Exception so we can see what's failing,
>>> aslo please advise how are you running your Spark App
>>> For an extremely simple case, let's assume  you have your
>>> MyFirstSparkApp packaged in your   myFirstSparkApp.jar
>>> Then all you need to do would be to kick off
>>>
>>> spark-submit --class MyFirstSparkApp   myFirstSparkApp.jar
>>>
>>> if you have any external dependencies (not spark , let's assume you are
>>> using common-utils.jar) then you should be able to kick it off via
>>>
>>> spark-submit --class MyFirstSparkApp --jars common-utiils.jar
>>> myFirstSparkApp.jar
>>>
>>> I paste below the build.sbt i am using for my SparkExamples apps, hope
>>> this helps.
>>> kr
>>>  marco
>>>
>>> name := "SparkExamples"
>>>
>>> version := "1.0"
>>>
>>> scalaVersion := "2.10.5"
>>>
>>>
>>> // Add a single dependency
>>> libraryDependencies += "junit" % "junit" % "4.8" % "test"
>>> libraryDependencies += "org.mockito" % "mockito-core" % "1.9.5"
>>> libraryDependencies ++= Seq("org.slf4j" % "slf4j-api" % "1.7.5",
>>> "org.slf4j" % "slf4j-simple" % "1.7.5",
>>> "org.clapper" %% "grizzled-slf4j" % "1.0.2")
>>> libraryDependencies += "org.powermock" %
>>> "powermock-mockito-release-full" % "1.5.4" % "test"
>>> libraryDependencies += "org.apache.spark" %% "spark-core"   % "1.6.1" %
>>> "provided"
>>> libraryDependencies += "org.apache.spark" %% "spark-streaming"   %
>>> "1.6.1" % "provided"
>>> libraryDependencies += "org.apache.spark" %% "spark-mllib"   % "1.6.1"
>>> % "provided"
>>> libraryDependencies += "org.apache.spark" %% "spark-streaming-flume"   %
>>> "1.3.0"  % "provided"
>>> resolvers 

Re: Building standalone spark application via sbt

2016-07-20 Thread Marco Mistroni
 that will work but ideally you should not include any of the
spark-releated jars as they are provided to you by the spark environment
whenever you launch your app via spark-submit (this will prevent unexpected
errors e.g. when you kick off your app using a different version of spark
where some of the classes has been renamd or movedaround -tbh i don't think
this is a case that happen often)

Btw did you get the NoClassDefFoundException at compile time or run time?if
at run time, what is your Spark Version  and what is the spark libraries
version you used in your sbt?
are you using a Spark version pre 1.4?

kr
 marco






On Wed, Jul 20, 2016 at 6:13 PM, Sachin Mittal <sjmit...@gmail.com> wrote:

> NoClassDefFound error was for spark classes like say SparkConext.
> When running a standalone spark application I was not passing external
> jars using --jars option.
>
> However I have fixed this by making a fat jar using sbt assembly plugin.
>
> Now all the dependencies are included in that jar and I use that jar in
> spark-submit
>
> Thanks
> Sachin
>
>
> On Wed, Jul 20, 2016 at 9:42 PM, Marco Mistroni <mmistr...@gmail.com>
> wrote:
>
>> Hello Sachin
>>   pls paste the NoClassDefFound Exception so we can see what's failing,
>> aslo please advise how are you running your Spark App
>> For an extremely simple case, let's assume  you have your
>> MyFirstSparkApp packaged in your   myFirstSparkApp.jar
>> Then all you need to do would be to kick off
>>
>> spark-submit --class MyFirstSparkApp   myFirstSparkApp.jar
>>
>> if you have any external dependencies (not spark , let's assume you are
>> using common-utils.jar) then you should be able to kick it off via
>>
>> spark-submit --class MyFirstSparkApp --jars common-utiils.jar
>> myFirstSparkApp.jar
>>
>> I paste below the build.sbt i am using for my SparkExamples apps, hope
>> this helps.
>> kr
>>  marco
>>
>> name := "SparkExamples"
>>
>> version := "1.0"
>>
>> scalaVersion := "2.10.5"
>>
>>
>> // Add a single dependency
>> libraryDependencies += "junit" % "junit" % "4.8" % "test"
>> libraryDependencies += "org.mockito" % "mockito-core" % "1.9.5"
>> libraryDependencies ++= Seq("org.slf4j" % "slf4j-api" % "1.7.5",
>> "org.slf4j" % "slf4j-simple" % "1.7.5",
>> "org.clapper" %% "grizzled-slf4j" % "1.0.2")
>> libraryDependencies += "org.powermock" % "powermock-mockito-release-full"
>> % "1.5.4" % "test"
>> libraryDependencies += "org.apache.spark" %% "spark-core"   % "1.6.1" %
>> "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-streaming"   %
>> "1.6.1" % "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-mllib"   % "1.6.1"  %
>> "provided"
>> libraryDependencies += "org.apache.spark" %% "spark-streaming-flume"   %
>> "1.3.0"  % "provided"
>> resolvers += "softprops-maven" at "
>> http://dl.bintray.com/content/softprops/maven;
>>
>>
>>
>>
>>
>>
>>
>>
>> On Wed, Jul 20, 2016 at 3:39 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> you need an uber jar file.
>>>
>>> Have you actually followed the dependencies and project sub-directory
>>> build?
>>>
>>> check this.
>>>
>>>
>>> http://stackoverflow.com/questions/28459333/how-to-build-an-uber-jar-fat-jar-using-sbt-within-intellij-idea
>>>
>>> under three answers the top one.
>>>
>>> I started reading the official SBT tutorial
>>> <http://www.scala-sbt.org/0.13/tutorial/>.  .
>>>
>>> HTH
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
&g

Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
NoClassDefFound error was for spark classes like say SparkConext.
When running a standalone spark application I was not passing external jars
using --jars option.

However I have fixed this by making a fat jar using sbt assembly plugin.

Now all the dependencies are included in that jar and I use that jar in
spark-submit

Thanks
Sachin


On Wed, Jul 20, 2016 at 9:42 PM, Marco Mistroni <mmistr...@gmail.com> wrote:

> Hello Sachin
>   pls paste the NoClassDefFound Exception so we can see what's failing,
> aslo please advise how are you running your Spark App
> For an extremely simple case, let's assume  you have your  MyFirstSparkApp
> packaged in your   myFirstSparkApp.jar
> Then all you need to do would be to kick off
>
> spark-submit --class MyFirstSparkApp   myFirstSparkApp.jar
>
> if you have any external dependencies (not spark , let's assume you are
> using common-utils.jar) then you should be able to kick it off via
>
> spark-submit --class MyFirstSparkApp --jars common-utiils.jar
> myFirstSparkApp.jar
>
> I paste below the build.sbt i am using for my SparkExamples apps, hope
> this helps.
> kr
>  marco
>
> name := "SparkExamples"
>
> version := "1.0"
>
> scalaVersion := "2.10.5"
>
>
> // Add a single dependency
> libraryDependencies += "junit" % "junit" % "4.8" % "test"
> libraryDependencies += "org.mockito" % "mockito-core" % "1.9.5"
> libraryDependencies ++= Seq("org.slf4j" % "slf4j-api" % "1.7.5",
> "org.slf4j" % "slf4j-simple" % "1.7.5",
> "org.clapper" %% "grizzled-slf4j" % "1.0.2")
> libraryDependencies += "org.powermock" % "powermock-mockito-release-full"
> % "1.5.4" % "test"
> libraryDependencies += "org.apache.spark" %% "spark-core"   % "1.6.1" %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming"   % "1.6.1"
> % "provided"
> libraryDependencies += "org.apache.spark" %% "spark-mllib"   % "1.6.1"  %
> "provided"
> libraryDependencies += "org.apache.spark" %% "spark-streaming-flume"   %
> "1.3.0"  % "provided"
> resolvers += "softprops-maven" at "
> http://dl.bintray.com/content/softprops/maven;
>
>
>
>
>
>
>
>
> On Wed, Jul 20, 2016 at 3:39 PM, Mich Talebzadeh <
> mich.talebza...@gmail.com> wrote:
>
>> you need an uber jar file.
>>
>> Have you actually followed the dependencies and project sub-directory
>> build?
>>
>> check this.
>>
>>
>> http://stackoverflow.com/questions/28459333/how-to-build-an-uber-jar-fat-jar-using-sbt-within-intellij-idea
>>
>> under three answers the top one.
>>
>> I started reading the official SBT tutorial
>> <http://www.scala-sbt.org/0.13/tutorial/>.  .
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>> On 20 July 2016 at 09:54, Sachin Mittal <sjmit...@gmail.com> wrote:
>>
>>> Hi,
>>> I am following the example under
>>> https://spark.apache.org/docs/latest/quick-start.html
>>> For standalone scala application.
>>>
>>> I added all my dependencies via build.sbt (one dependency is under lib
>>> folder).
>>>
>>> When I run sbt package I see the jar created under
>>> target/scala-2.10/
>>>
>>> So compile seems to be working fine. However when I inspect that jar, it
>>> only contains my scala class.
>>> Unlike in java application we build a standalone jar, which contains all
>>> the dependencies inside that jar, here all the dependencies are missing.
>>>
>>> So as expected when I run the application via spark-submit I get th

Re: Building standalone spark application via sbt

2016-07-20 Thread Marco Mistroni
Hello Sachin
  pls paste the NoClassDefFound Exception so we can see what's failing,
aslo please advise how are you running your Spark App
For an extremely simple case, let's assume  you have your  MyFirstSparkApp
packaged in your   myFirstSparkApp.jar
Then all you need to do would be to kick off

spark-submit --class MyFirstSparkApp   myFirstSparkApp.jar

if you have any external dependencies (not spark , let's assume you are
using common-utils.jar) then you should be able to kick it off via

spark-submit --class MyFirstSparkApp --jars common-utiils.jar
myFirstSparkApp.jar

I paste below the build.sbt i am using for my SparkExamples apps, hope this
helps.
kr
 marco

name := "SparkExamples"

version := "1.0"

scalaVersion := "2.10.5"


// Add a single dependency
libraryDependencies += "junit" % "junit" % "4.8" % "test"
libraryDependencies += "org.mockito" % "mockito-core" % "1.9.5"
libraryDependencies ++= Seq("org.slf4j" % "slf4j-api" % "1.7.5",
"org.slf4j" % "slf4j-simple" % "1.7.5",
"org.clapper" %% "grizzled-slf4j" % "1.0.2")
libraryDependencies += "org.powermock" % "powermock-mockito-release-full" %
"1.5.4" % "test"
libraryDependencies += "org.apache.spark" %% "spark-core"   % "1.6.1" %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming"   % "1.6.1"
% "provided"
libraryDependencies += "org.apache.spark" %% "spark-mllib"   % "1.6.1"  %
"provided"
libraryDependencies += "org.apache.spark" %% "spark-streaming-flume"   %
"1.3.0"  % "provided"
resolvers += "softprops-maven" at "
http://dl.bintray.com/content/softprops/maven;








On Wed, Jul 20, 2016 at 3:39 PM, Mich Talebzadeh 
wrote:

> you need an uber jar file.
>
> Have you actually followed the dependencies and project sub-directory
> build?
>
> check this.
>
>
> http://stackoverflow.com/questions/28459333/how-to-build-an-uber-jar-fat-jar-using-sbt-within-intellij-idea
>
> under three answers the top one.
>
> I started reading the official SBT tutorial
> .  .
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
> On 20 July 2016 at 09:54, Sachin Mittal  wrote:
>
>> Hi,
>> I am following the example under
>> https://spark.apache.org/docs/latest/quick-start.html
>> For standalone scala application.
>>
>> I added all my dependencies via build.sbt (one dependency is under lib
>> folder).
>>
>> When I run sbt package I see the jar created under
>> target/scala-2.10/
>>
>> So compile seems to be working fine. However when I inspect that jar, it
>> only contains my scala class.
>> Unlike in java application we build a standalone jar, which contains all
>> the dependencies inside that jar, here all the dependencies are missing.
>>
>> So as expected when I run the application via spark-submit I get the
>> NoClassDefFoundError.
>>
>> Here is my build.sbt
>>
>> name := "Test Advice Project"
>> version := "1.0"
>> scalaVersion := "2.10.6"
>> libraryDependencies ++= Seq(
>> "org.apache.spark" %% "spark-core" % "1.6.1",
>> "org.apache.spark" %% "spark-sql" % "1.6.1"
>> )
>>
>> Can anyone please guide me to as what is going wrong and why sbt package
>> is not including all the dependencies jar classes in the new jar.
>>
>> Thanks
>> Sachin
>>
>>
>> On Tue, Jul 19, 2016 at 8:23 PM, Andrew Ehrlich 
>> wrote:
>>
>>> Yes, spark-core will depend on Hadoop and several other jars.  Here’s
>>> the list of dependencies:
>>> https://github.com/apache/spark/blob/master/core/pom.xml#L35
>>>
>>> Whether you need spark-sql depends on whether you will use the DataFrame
>>> API. Without spark-sql, you will just have the RDD API.
>>>
>>> On Jul 19, 2016, at 7:09 AM, Sachin Mittal  wrote:
>>>
>>>
>>> Hi,
>>> Can someone please guide me what all jars I need to place in my lib
>>> folder of the project to build a standalone scala application via sbt.
>>>
>>> Note I need to provide static dependencies and I cannot download the
>>> jars using libraryDependencies.
>>> So I need to provide all the jars upfront.
>>>
>>> So far I found that we need:
>>> spark-core_.jar
>>>
>>> Do we also need
>>> spark-sql_.jar
>>> and
>>> hadoop-core-.jar
>>>
>>> Is there any jar from spark side I may be missing? What I found that
>>> spark-core needs hadoop-core classes and if I don't add them then sbt was
>>> giving me this error:
>>> 

Re: Building standalone spark application via sbt

2016-07-20 Thread Mich Talebzadeh
you need an uber jar file.

Have you actually followed the dependencies and project sub-directory build?

check this.

http://stackoverflow.com/questions/28459333/how-to-build-an-uber-jar-fat-jar-using-sbt-within-intellij-idea

under three answers the top one.

I started reading the official SBT tutorial
.  .

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.



On 20 July 2016 at 09:54, Sachin Mittal  wrote:

> Hi,
> I am following the example under
> https://spark.apache.org/docs/latest/quick-start.html
> For standalone scala application.
>
> I added all my dependencies via build.sbt (one dependency is under lib
> folder).
>
> When I run sbt package I see the jar created under
> target/scala-2.10/
>
> So compile seems to be working fine. However when I inspect that jar, it
> only contains my scala class.
> Unlike in java application we build a standalone jar, which contains all
> the dependencies inside that jar, here all the dependencies are missing.
>
> So as expected when I run the application via spark-submit I get the
> NoClassDefFoundError.
>
> Here is my build.sbt
>
> name := "Test Advice Project"
> version := "1.0"
> scalaVersion := "2.10.6"
> libraryDependencies ++= Seq(
> "org.apache.spark" %% "spark-core" % "1.6.1",
> "org.apache.spark" %% "spark-sql" % "1.6.1"
> )
>
> Can anyone please guide me to as what is going wrong and why sbt package
> is not including all the dependencies jar classes in the new jar.
>
> Thanks
> Sachin
>
>
> On Tue, Jul 19, 2016 at 8:23 PM, Andrew Ehrlich 
> wrote:
>
>> Yes, spark-core will depend on Hadoop and several other jars.  Here’s the
>> list of dependencies:
>> https://github.com/apache/spark/blob/master/core/pom.xml#L35
>>
>> Whether you need spark-sql depends on whether you will use the DataFrame
>> API. Without spark-sql, you will just have the RDD API.
>>
>> On Jul 19, 2016, at 7:09 AM, Sachin Mittal  wrote:
>>
>>
>> Hi,
>> Can someone please guide me what all jars I need to place in my lib
>> folder of the project to build a standalone scala application via sbt.
>>
>> Note I need to provide static dependencies and I cannot download the jars
>> using libraryDependencies.
>> So I need to provide all the jars upfront.
>>
>> So far I found that we need:
>> spark-core_.jar
>>
>> Do we also need
>> spark-sql_.jar
>> and
>> hadoop-core-.jar
>>
>> Is there any jar from spark side I may be missing? What I found that
>> spark-core needs hadoop-core classes and if I don't add them then sbt was
>> giving me this error:
>> [error] bad symbolic reference. A signature in SparkContext.class refers
>> to term hadoop
>> [error] in package org.apache which is not available.
>>
>> So I was just confused on library dependency part when building an
>> application via sbt. Any inputs here would be helpful.
>>
>> Thanks
>> Sachin
>>
>>
>>
>>
>>
>


Re: Building standalone spark application via sbt

2016-07-20 Thread Sachin Mittal
Hi,
I am following the example under
https://spark.apache.org/docs/latest/quick-start.html
For standalone scala application.

I added all my dependencies via build.sbt (one dependency is under lib
folder).

When I run sbt package I see the jar created under
target/scala-2.10/

So compile seems to be working fine. However when I inspect that jar, it
only contains my scala class.
Unlike in java application we build a standalone jar, which contains all
the dependencies inside that jar, here all the dependencies are missing.

So as expected when I run the application via spark-submit I get the
NoClassDefFoundError.

Here is my build.sbt

name := "Test Advice Project"
version := "1.0"
scalaVersion := "2.10.6"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "1.6.1",
"org.apache.spark" %% "spark-sql" % "1.6.1"
)

Can anyone please guide me to as what is going wrong and why sbt package is
not including all the dependencies jar classes in the new jar.

Thanks
Sachin


On Tue, Jul 19, 2016 at 8:23 PM, Andrew Ehrlich  wrote:

> Yes, spark-core will depend on Hadoop and several other jars.  Here’s the
> list of dependencies:
> https://github.com/apache/spark/blob/master/core/pom.xml#L35
>
> Whether you need spark-sql depends on whether you will use the DataFrame
> API. Without spark-sql, you will just have the RDD API.
>
> On Jul 19, 2016, at 7:09 AM, Sachin Mittal  wrote:
>
>
> Hi,
> Can someone please guide me what all jars I need to place in my lib folder
> of the project to build a standalone scala application via sbt.
>
> Note I need to provide static dependencies and I cannot download the jars
> using libraryDependencies.
> So I need to provide all the jars upfront.
>
> So far I found that we need:
> spark-core_.jar
>
> Do we also need
> spark-sql_.jar
> and
> hadoop-core-.jar
>
> Is there any jar from spark side I may be missing? What I found that
> spark-core needs hadoop-core classes and if I don't add them then sbt was
> giving me this error:
> [error] bad symbolic reference. A signature in SparkContext.class refers
> to term hadoop
> [error] in package org.apache which is not available.
>
> So I was just confused on library dependency part when building an
> application via sbt. Any inputs here would be helpful.
>
> Thanks
> Sachin
>
>
>
>
>


Re: Building standalone spark application via sbt

2016-07-19 Thread Andrew Ehrlich
Yes, spark-core will depend on Hadoop and several other jars.  Here’s the list 
of dependencies: https://github.com/apache/spark/blob/master/core/pom.xml#L35 


Whether you need spark-sql depends on whether you will use the DataFrame API. 
Without spark-sql, you will just have the RDD API.

> On Jul 19, 2016, at 7:09 AM, Sachin Mittal  wrote:
> 
> 
> Hi,
> Can someone please guide me what all jars I need to place in my lib folder of 
> the project to build a standalone scala application via sbt.
> 
> Note I need to provide static dependencies and I cannot download the jars 
> using libraryDependencies.
> So I need to provide all the jars upfront.
> 
> So far I found that we need:
> spark-core_.jar
> 
> Do we also need
> spark-sql_.jar
> and
> hadoop-core-.jar
> 
> Is there any jar from spark side I may be missing? What I found that 
> spark-core needs hadoop-core classes and if I don't add them then sbt was 
> giving me this error:
> [error] bad symbolic reference. A signature in SparkContext.class refers to 
> term hadoop
> [error] in package org.apache which is not available.
> 
> So I was just confused on library dependency part when building an 
> application via sbt. Any inputs here would be helpful.
> 
> Thanks
> Sachin
> 
> 
> 



Building standalone spark application via sbt

2016-07-19 Thread Sachin Mittal
Hi,
Can someone please guide me what all jars I need to place in my lib folder
of the project to build a standalone scala application via sbt.

Note I need to provide static dependencies and I cannot download the jars
using libraryDependencies.
So I need to provide all the jars upfront.

So far I found that we need:
spark-core_.jar

Do we also need
spark-sql_.jar
and
hadoop-core-.jar

Is there any jar from spark side I may be missing? What I found that
spark-core needs hadoop-core classes and if I don't add them then sbt was
giving me this error:
[error] bad symbolic reference. A signature in SparkContext.class refers to
term hadoop
[error] in package org.apache which is not available.

So I was just confused on library dependency part when building an
application via sbt. Any inputs here would be helpful.

Thanks
Sachin


Re: [ Standalone Spark Cluster ] - Track node status

2016-06-09 Thread Mich Talebzadeh
Hi Rutuja,

I am not certain whether such tool exists or not, However, opening a JIRA
may be beneficial and would not do any harm.

You may look for workaround. Now my understanding is that your need is for
monitoring the health of the cluster?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 19:45, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
wrote:

>
>
> Thanks again Mich!
> If there does not exist any interface like REST API or CLI for this, I
> would like to open a JIRA on exposing such a REST interface in SPARK which
> would list all the worker nodes.
> Please let me know if this seems to be the right thing to do for the
> community.
>
>
> Regards,
> Rutuja Kulkarni
>
>
> On Wed, Jun 8, 2016 at 5:36 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> > wrote:
>
>> The other way is to log in to the individual nodes and do
>>
>>  jps
>>
>> 24819 Worker
>>
>> And you Processes identified as worker
>>
>> Also you can use jmonitor to see what they are doing resource wise
>>
>> You can of course write a small shell script to see if Worker(s) are up
>> and running in every node and alert if they are down?
>>
>> HTH
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 9 June 2016 at 01:27, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
>> wrote:
>>
>>> Thank you for the quick response.
>>> So the workers section would list all the running worker nodes in the
>>> standalone Spark cluster?
>>> I was also wondering if this is the only way to retrieve worker nodes or
>>> is there something like a Web API or CLI I could use?
>>> Thanks.
>>>
>>> Regards,
>>> Rutuja
>>>
>>> On Wed, Jun 8, 2016 at 4:02 PM, Mich Talebzadeh <
>>> mich.talebza...@gmail.com> wrote:
>>>
>>>> check port 8080 on the node that you started start-master.sh
>>>>
>>>>
>>>>
>>>> [image: Inline images 2]
>>>>
>>>> HTH
>>>>
>>>>
>>>> Dr Mich Talebzadeh
>>>>
>>>>
>>>>
>>>> LinkedIn * 
>>>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>>
>>>>
>>>>
>>>> http://talebzadehmich.wordpress.com
>>>>
>>>>
>>>>
>>>> On 8 June 2016 at 23:56, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
>>>> wrote:
>>>>
>>>>> Hello!
>>>>>
>>>>> I'm trying to setup a standalone spark cluster and wondering how to
>>>>> track status of all of it's nodes. I wonder if something like Yarn REST 
>>>>> API
>>>>> or HDFS CLI exists in Spark world that can provide status of nodes on such
>>>>> a cluster. Any pointers would be greatly appreciated.
>>>>>
>>>>> --
>>>>> *Regards,*
>>>>> *Rutuja Kulkarni*
>>>>>
>>>>>
>>>>>
>>>>
>>>
>>>
>>> --
>>> *Regards,*
>>> *Rutuja Kulkarni*
>>>
>>>
>>>
>>
>
>
> --
> *Regards,*
> *Rutuja Kulkarni*
>
>
>


Re: [ Standalone Spark Cluster ] - Track node status

2016-06-09 Thread Rutuja Kulkarni
Thanks again Mich!
If there does not exist any interface like REST API or CLI for this, I
would like to open a JIRA on exposing such a REST interface in SPARK which
would list all the worker nodes.
Please let me know if this seems to be the right thing to do for the
community.


Regards,
Rutuja Kulkarni


On Wed, Jun 8, 2016 at 5:36 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> The other way is to log in to the individual nodes and do
>
>  jps
>
> 24819 Worker
>
> And you Processes identified as worker
>
> Also you can use jmonitor to see what they are doing resource wise
>
> You can of course write a small shell script to see if Worker(s) are up
> and running in every node and alert if they are down?
>
> HTH
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 9 June 2016 at 01:27, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
> wrote:
>
>> Thank you for the quick response.
>> So the workers section would list all the running worker nodes in the
>> standalone Spark cluster?
>> I was also wondering if this is the only way to retrieve worker nodes or
>> is there something like a Web API or CLI I could use?
>> Thanks.
>>
>> Regards,
>> Rutuja
>>
>> On Wed, Jun 8, 2016 at 4:02 PM, Mich Talebzadeh <
>> mich.talebza...@gmail.com> wrote:
>>
>>> check port 8080 on the node that you started start-master.sh
>>>
>>>
>>>
>>> [image: Inline images 2]
>>>
>>> HTH
>>>
>>>
>>> Dr Mich Talebzadeh
>>>
>>>
>>>
>>> LinkedIn * 
>>> https://www.linkedin.com/profile/view?id=AAEAAAAWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>>
>>>
>>>
>>> http://talebzadehmich.wordpress.com
>>>
>>>
>>>
>>> On 8 June 2016 at 23:56, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
>>> wrote:
>>>
>>>> Hello!
>>>>
>>>> I'm trying to setup a standalone spark cluster and wondering how to
>>>> track status of all of it's nodes. I wonder if something like Yarn REST API
>>>> or HDFS CLI exists in Spark world that can provide status of nodes on such
>>>> a cluster. Any pointers would be greatly appreciated.
>>>>
>>>> --
>>>> *Regards,*
>>>> *Rutuja Kulkarni*
>>>>
>>>>
>>>>
>>>
>>
>>
>> --
>> *Regards,*
>> *Rutuja Kulkarni*
>>
>>
>>
>


-- 
*Regards,*
*Rutuja Kulkarni*


Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Mich Talebzadeh
The other way is to log in to the individual nodes and do

 jps

24819 Worker

And you Processes identified as worker

Also you can use jmonitor to see what they are doing resource wise

You can of course write a small shell script to see if Worker(s) are up and
running in every node and alert if they are down?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 9 June 2016 at 01:27, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
wrote:

> Thank you for the quick response.
> So the workers section would list all the running worker nodes in the
> standalone Spark cluster?
> I was also wondering if this is the only way to retrieve worker nodes or
> is there something like a Web API or CLI I could use?
> Thanks.
>
> Regards,
> Rutuja
>
> On Wed, Jun 8, 2016 at 4:02 PM, Mich Talebzadeh <mich.talebza...@gmail.com
> > wrote:
>
>> check port 8080 on the node that you started start-master.sh
>>
>>
>>
>> [image: Inline images 2]
>>
>> HTH
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 8 June 2016 at 23:56, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
>> wrote:
>>
>>> Hello!
>>>
>>> I'm trying to setup a standalone spark cluster and wondering how to
>>> track status of all of it's nodes. I wonder if something like Yarn REST API
>>> or HDFS CLI exists in Spark world that can provide status of nodes on such
>>> a cluster. Any pointers would be greatly appreciated.
>>>
>>> --
>>> *Regards,*
>>> *Rutuja Kulkarni*
>>>
>>>
>>>
>>
>
>
> --
> *Regards,*
> *Rutuja Kulkarni*
>
>
>


Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Rutuja Kulkarni
Thank you for the quick response.
So the workers section would list all the running worker nodes in the
standalone Spark cluster?
I was also wondering if this is the only way to retrieve worker nodes or is
there something like a Web API or CLI I could use?
Thanks.

Regards,
Rutuja

On Wed, Jun 8, 2016 at 4:02 PM, Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> check port 8080 on the node that you started start-master.sh
>
>
>
> [image: Inline images 2]
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> <https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 8 June 2016 at 23:56, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
> wrote:
>
>> Hello!
>>
>> I'm trying to setup a standalone spark cluster and wondering how to track
>> status of all of it's nodes. I wonder if something like Yarn REST API or
>> HDFS CLI exists in Spark world that can provide status of nodes on such a
>> cluster. Any pointers would be greatly appreciated.
>>
>> --
>> *Regards,*
>> *Rutuja Kulkarni*
>>
>>
>>
>


-- 
*Regards,*
*Rutuja Kulkarni*


Re: [ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Mich Talebzadeh
check port 8080 on the node that you started start-master.sh



[image: Inline images 2]

HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 8 June 2016 at 23:56, Rutuja Kulkarni <rutuja.kulkarn...@gmail.com>
wrote:

> Hello!
>
> I'm trying to setup a standalone spark cluster and wondering how to track
> status of all of it's nodes. I wonder if something like Yarn REST API or
> HDFS CLI exists in Spark world that can provide status of nodes on such a
> cluster. Any pointers would be greatly appreciated.
>
> --
> *Regards,*
> *Rutuja Kulkarni*
>
>
>


[ Standalone Spark Cluster ] - Track node status

2016-06-08 Thread Rutuja Kulkarni
Hello!

I'm trying to setup a standalone spark cluster and wondering how to track
status of all of it's nodes. I wonder if something like Yarn REST API or
HDFS CLI exists in Spark world that can provide status of nodes on such a
cluster. Any pointers would be greatly appreciated.

-- 
*Regards,*
*Rutuja Kulkarni*


python application cluster mode in standalone spark cluster

2016-05-25 Thread Jan Sourek
A the official documentation states 'Currently only YARN supports cluster
mode for Python applications.'
I would like to know if work is being done or planned to support cluster
mode for Python applications on standalone spark clusters?




--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/python-application-cluster-mode-in-standalone-spark-cluster-tp27020.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Takeshi Yamamuro
Hi,

How about checking Spark survey result 2015 in
https://databricks.com/blog/2015/09/24/spark-survey-results-2015-are-now-available.html
for the statistics?

// maropu

On Fri, Apr 15, 2016 at 4:52 AM, Mark Hamstra <m...@clearstorydata.com>
wrote:

> That's also available in standalone.
>
> On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov <
> apivova...@gmail.com> wrote:
>
>> Spark on Yarn supports dynamic resource allocation
>>
>> So, you can run several spark-shells / spark-submits / spark-jobserver /
>> zeppelin on one cluster without defining upfront how many executors /
>> memory you want to allocate to each app
>>
>> Great feature for regular users who just want to run Spark / Spark SQL
>>
>>
>> On Thu, Apr 14, 2016 at 12:05 PM, Sean Owen <so...@cloudera.com> wrote:
>>
>>> I don't think usage is the differentiating factor. YARN and standalone
>>> are pretty well supported. If you are only running a Spark cluster by
>>> itself with nothing else, standalone is probably simpler than setting
>>> up YARN just for Spark. However if you're running on a cluster that
>>> will host other applications, you'll need to integrate with a shared
>>> resource manager and its security model, and for anything
>>> Hadoop-related that's YARN. Standalone wouldn't make as much sense.
>>>
>>> On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
>>> <apivova...@gmail.com> wrote:
>>> > AWS EMR includes Spark on Yarn
>>> > Hortonworks and Cloudera platforms include Spark on Yarn as well
>>> >
>>> >
>>> > On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <
>>> arkadiusz.b...@gmail.com>
>>> > wrote:
>>> >>
>>> >> Hello,
>>> >>
>>> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
>>> >> production ?
>>> >>
>>> >> I would like to choose most supported and used technology in
>>> >> production for our project.
>>> >>
>>> >>
>>> >> BR,
>>> >>
>>> >> Arkadiusz Bicz
>>> >>
>>> >> -
>>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>> >> For additional commands, e-mail: user-h...@spark.apache.org
>>> >>
>>> >
>>>
>>
>>
>


-- 
---
Takeshi Yamamuro


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mark Hamstra
That's also available in standalone.

On Thu, Apr 14, 2016 at 12:47 PM, Alexander Pivovarov <apivova...@gmail.com>
wrote:

> Spark on Yarn supports dynamic resource allocation
>
> So, you can run several spark-shells / spark-submits / spark-jobserver /
> zeppelin on one cluster without defining upfront how many executors /
> memory you want to allocate to each app
>
> Great feature for regular users who just want to run Spark / Spark SQL
>
>
> On Thu, Apr 14, 2016 at 12:05 PM, Sean Owen <so...@cloudera.com> wrote:
>
>> I don't think usage is the differentiating factor. YARN and standalone
>> are pretty well supported. If you are only running a Spark cluster by
>> itself with nothing else, standalone is probably simpler than setting
>> up YARN just for Spark. However if you're running on a cluster that
>> will host other applications, you'll need to integrate with a shared
>> resource manager and its security model, and for anything
>> Hadoop-related that's YARN. Standalone wouldn't make as much sense.
>>
>> On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
>> <apivova...@gmail.com> wrote:
>> > AWS EMR includes Spark on Yarn
>> > Hortonworks and Cloudera platforms include Spark on Yarn as well
>> >
>> >
>> > On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <
>> arkadiusz.b...@gmail.com>
>> > wrote:
>> >>
>> >> Hello,
>> >>
>> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
>> >> production ?
>> >>
>> >> I would like to choose most supported and used technology in
>> >> production for our project.
>> >>
>> >>
>> >> BR,
>> >>
>> >> Arkadiusz Bicz
>> >>
>> >> -
>> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> >> For additional commands, e-mail: user-h...@spark.apache.org
>> >>
>> >
>>
>
>


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
Spark on Yarn supports dynamic resource allocation

So, you can run several spark-shells / spark-submits / spark-jobserver /
zeppelin on one cluster without defining upfront how many executors /
memory you want to allocate to each app

Great feature for regular users who just want to run Spark / Spark SQL


On Thu, Apr 14, 2016 at 12:05 PM, Sean Owen <so...@cloudera.com> wrote:

> I don't think usage is the differentiating factor. YARN and standalone
> are pretty well supported. If you are only running a Spark cluster by
> itself with nothing else, standalone is probably simpler than setting
> up YARN just for Spark. However if you're running on a cluster that
> will host other applications, you'll need to integrate with a shared
> resource manager and its security model, and for anything
> Hadoop-related that's YARN. Standalone wouldn't make as much sense.
>
> On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
> <apivova...@gmail.com> wrote:
> > AWS EMR includes Spark on Yarn
> > Hortonworks and Cloudera platforms include Spark on Yarn as well
> >
> >
> > On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <
> arkadiusz.b...@gmail.com>
> > wrote:
> >>
> >> Hello,
> >>
> >> Is there any statistics regarding YARN vs Standalone Spark Usage in
> >> production ?
> >>
> >> I would like to choose most supported and used technology in
> >> production for our project.
> >>
> >>
> >> BR,
> >>
> >> Arkadiusz Bicz
> >>
> >> -
> >> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> >> For additional commands, e-mail: user-h...@spark.apache.org
> >>
> >
>


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Sean Owen
I don't think usage is the differentiating factor. YARN and standalone
are pretty well supported. If you are only running a Spark cluster by
itself with nothing else, standalone is probably simpler than setting
up YARN just for Spark. However if you're running on a cluster that
will host other applications, you'll need to integrate with a shared
resource manager and its security model, and for anything
Hadoop-related that's YARN. Standalone wouldn't make as much sense.

On Thu, Apr 14, 2016 at 6:46 PM, Alexander Pivovarov
<apivova...@gmail.com> wrote:
> AWS EMR includes Spark on Yarn
> Hortonworks and Cloudera platforms include Spark on Yarn as well
>
>
> On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <arkadiusz.b...@gmail.com>
> wrote:
>>
>> Hello,
>>
>> Is there any statistics regarding YARN vs Standalone Spark Usage in
>> production ?
>>
>> I would like to choose most supported and used technology in
>> production for our project.
>>
>>
>> BR,
>>
>> Arkadiusz Bicz
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Mich Talebzadeh
Hi Alex,

Do you mean using Spark with Yarn-client compared to using Spark Local?

HTH

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
<https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw>*



http://talebzadehmich.wordpress.com



On 14 April 2016 at 18:46, Alexander Pivovarov <apivova...@gmail.com> wrote:

> AWS EMR includes Spark on Yarn
> Hortonworks and Cloudera platforms include Spark on Yarn as well
>
>
> On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <arkadiusz.b...@gmail.com>
> wrote:
>
>> Hello,
>>
>> Is there any statistics regarding YARN vs Standalone Spark Usage in
>> production ?
>>
>> I would like to choose most supported and used technology in
>> production for our project.
>>
>>
>> BR,
>>
>> Arkadiusz Bicz
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: YARN vs Standalone Spark Usage in production

2016-04-14 Thread Alexander Pivovarov
AWS EMR includes Spark on Yarn
Hortonworks and Cloudera platforms include Spark on Yarn as well


On Thu, Apr 14, 2016 at 7:29 AM, Arkadiusz Bicz <arkadiusz.b...@gmail.com>
wrote:

> Hello,
>
> Is there any statistics regarding YARN vs Standalone Spark Usage in
> production ?
>
> I would like to choose most supported and used technology in
> production for our project.
>
>
> BR,
>
> Arkadiusz Bicz
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


YARN vs Standalone Spark Usage in production

2016-04-14 Thread Arkadiusz Bicz
Hello,

Is there any statistics regarding YARN vs Standalone Spark Usage in
production ?

I would like to choose most supported and used technology in
production for our project.


BR,

Arkadiusz Bicz

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark jobs run extremely slow on yarn cluster compared to standalone spark

2016-02-14 Thread Yuval.Itzchakov
Your question lacks sufficient information for us to actually provide help.
Have you looked at the Spark UI to see which part of the graph is taking the
longest? Have you tried logging your methods?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jobs-run-extremely-slow-on-yarn-cluster-compared-to-standalone-spark-tp26215p26221.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark jobs run extremely slow on yarn cluster compared to standalone spark

2016-02-12 Thread pdesai
Hi there,

I am doing a POC with Spark and I have noticed that if I run my job on
standalone spark installation, it finishes in a second(It's a small sample
job). But when I run same job on spark cluster with Yarn, it takes 4-5 min
in simple execution. 
Are there any best practices that I need to follow for spark cluster
configuration. I have left all default settings. During spark-submit I
specify num-executors=3, executor-memory=512m, executor-cores-1.

I am using Java Spark SQL API.

Thanks,
Purvi



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-jobs-run-extremely-slow-on-yarn-cluster-compared-to-standalone-spark-tp26215.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Cannot connect to standalone spark cluster

2015-10-14 Thread Akhil Das
Open a spark-shell by:

MASTER=Ellens-MacBook-Pro.local:7077 bin/spark-shell

And if its able to connect, then check your java projects build file and
make sure you are having the proper spark version.

Thanks
Best Regards

On Sat, Oct 10, 2015 at 3:07 AM, ekraffmiller <ellen.kraffmil...@gmail.com>
wrote:

> Hi,
> I'm trying to run a java application that connects to a local standalone
> spark cluster.  I start the cluster with the default configuration, using
> start-all.sh.  When I go to the web page for the cluster, it is started ok.
> I can connect to this cluster with SparkR, but when I use the same master
> URL to connect from within Java, I get an error message.
>
> I'm using Spark 1.5.
>
> Here is a snippet of the error message:
>
>  ReliableDeliverySupervisor: Association with remote system
> [akka.tcp://sparkMaster@Ellens-MacBook-Pro.local:7077] has failed, address
> is now gated for [5000] ms. Reason: [Disassociated]
> 15/10/09 17:31:41 INFO AppClient$ClientEndpoint: Connecting to master
> spark://Ellens-MacBook-Pro.local:7077...
> 15/10/09 17:31:41 WARN ReliableDeliverySupervisor: Association with remote
> system [akka.tcp://sparkMaster@Ellens-MacBook-Pro.local:7077] has failed,
> address is now gated for [5000] ms. Reason: [Disassociated]
> 15/10/09 17:32:01 INFO AppClient$ClientEndpoint: Connecting to master
> spark://Ellens-MacBook-Pro.local:7077...
> 15/10/09 17:32:01 ERROR SparkUncaughtExceptionHandler: Uncaught exception
> in
> thread Thread[appclient-registration-retry-thread,5,main]
> java.util.concurrent.RejectedExecutionException: Task
> java.util.concurrent.FutureTask@54e2b678 rejected from
> java.util.concurrent.ThreadPoolExecutor@5d9f3e0d[Running, pool size = 1,
> active threads = 1, queued tasks = 0, completed tasks = 2]
> at
>
> java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
> at
> java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)
>
> Thanks,
> Ellen
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-connect-to-standalone-spark-cluster-tp25004.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Cannot connect to standalone spark cluster

2015-10-09 Thread ekraffmiller
Hi,
I'm trying to run a java application that connects to a local standalone
spark cluster.  I start the cluster with the default configuration, using
start-all.sh.  When I go to the web page for the cluster, it is started ok. 
I can connect to this cluster with SparkR, but when I use the same master
URL to connect from within Java, I get an error message.  

I'm using Spark 1.5.

Here is a snippet of the error message:

 ReliableDeliverySupervisor: Association with remote system
[akka.tcp://sparkMaster@Ellens-MacBook-Pro.local:7077] has failed, address
is now gated for [5000] ms. Reason: [Disassociated] 
15/10/09 17:31:41 INFO AppClient$ClientEndpoint: Connecting to master
spark://Ellens-MacBook-Pro.local:7077...
15/10/09 17:31:41 WARN ReliableDeliverySupervisor: Association with remote
system [akka.tcp://sparkMaster@Ellens-MacBook-Pro.local:7077] has failed,
address is now gated for [5000] ms. Reason: [Disassociated] 
15/10/09 17:32:01 INFO AppClient$ClientEndpoint: Connecting to master
spark://Ellens-MacBook-Pro.local:7077...
15/10/09 17:32:01 ERROR SparkUncaughtExceptionHandler: Uncaught exception in
thread Thread[appclient-registration-retry-thread,5,main]
java.util.concurrent.RejectedExecutionException: Task
java.util.concurrent.FutureTask@54e2b678 rejected from
java.util.concurrent.ThreadPoolExecutor@5d9f3e0d[Running, pool size = 1,
active threads = 1, queued tasks = 0, completed tasks = 2]
at
java.util.concurrent.ThreadPoolExecutor$AbortPolicy.rejectedExecution(ThreadPoolExecutor.java:2047)
at
java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:823)

Thanks,
Ellen



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Cannot-connect-to-standalone-spark-cluster-tp25004.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Convert Simple Kafka Consumer to standalone Spark JavaStream Consumer

2015-07-21 Thread Hafsa Asif
Hi, I have  a simple High level Kafka Consumer like :
package matchinguu.kafka.consumer;


import kafka.consumer.Consumer;
import kafka.consumer.ConsumerConfig;
import kafka.consumer.ConsumerIterator;
import kafka.consumer.KafkaStream;
import kafka.javaapi.consumer.ConsumerConnector;

import java.util.*;

public class SimpleHLConsumer {

private final ConsumerConnector consumer;
private final String topic;

public SimpleHLConsumer(String zookeeper, String groupId, String topic)
{
Properties props = new Properties();
props.put(zookeeper.connect, zookeeper);
props.put(group.id, groupId);
props.put(zookeeper.session.timeout.ms, 500);
props.put(zookeeper.sync.time.ms, 250);
props.put(auto.commit.interval.ms, 1000);


consumer = Consumer.createJavaConsumerConnector(new
ConsumerConfig(props));
this.topic = topic;
}

public void testConsumer() {
MapString, Integer topicCount = new HashMapString, Integer();
topicCount.put(topic, 1);

MapString, Listlt;KafkaStreamlt;byte[], byte[] consumerStreams
= consumer.createMessageStreams(topicCount);
ListKafkaStreamlt;byte[], byte[] streams =
consumerStreams.get(topic);
for (final KafkaStream stream : streams) {

ConsumerIteratorbyte[], byte[] it = stream.iterator();
while (it.hasNext()) {
System.out.println();
System.out.println(Message from Single Topic:  + new
String(it.next().message()));
}
}
if (consumer != null) {
System.out.println(Shutdown Happens);
consumer.shutdown();
}

}

public static void main(String[] args) {
System.out.println(Consumer is now reading messages from
producer);
//String topic = args[0];
String topic = test;
SimpleHLConsumer simpleHLConsumer = new
SimpleHLConsumer(localhost:2181, testgroup, topic);
simpleHLConsumer.testConsumer();
   }

}

I want to get my messages through Spark Java Streaming with Kafka
integration. Can anyone help me to reform this code so that I can get same
output with Spark Kafka integration.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Convert-Simple-Kafka-Consumer-to-standalone-Spark-JavaStream-Consumer-tp23930.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Convert Simple Kafka Consumer to standalone Spark JavaStream Consumer

2015-07-21 Thread Tathagata Das
From what I understand about your code, it is getting data from different
partitions of a topic - get all data from partition 1, then from partition
2, etc. Though you have configured it to read from just one partition
(topicCount has count = 1). So I am not sure what your intention is, read
all partitions serially, or in parallel.

If you want to start of Kafka + Spark Streaming, I strongly suggest reading
the Kafka integration guide -
https://spark.apache.org/docs/latest/streaming-kafka-integration.html
and run the examples for the two ways
-
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/KafkaWordCount.scala
-
https://github.com/apache/spark/blob/master/examples/src/main/scala/org/apache/spark/examples/streaming/DirectKafkaWordCount.scala

Since you understand the high level consumer idea, you may want to start
with the first receiver-based approach, which uses HL consumer as well, and
takes topicCounts.


On Tue, Jul 21, 2015 at 8:23 AM, Hafsa Asif hafsa.a...@matchinguu.com
wrote:

 Hi, I have  a simple High level Kafka Consumer like :
 package matchinguu.kafka.consumer;


 import kafka.consumer.Consumer;
 import kafka.consumer.ConsumerConfig;
 import kafka.consumer.ConsumerIterator;
 import kafka.consumer.KafkaStream;
 import kafka.javaapi.consumer.ConsumerConnector;

 import java.util.*;

 public class SimpleHLConsumer {

 private final ConsumerConnector consumer;
 private final String topic;

 public SimpleHLConsumer(String zookeeper, String groupId, String topic)
 {
 Properties props = new Properties();
 props.put(zookeeper.connect, zookeeper);
 props.put(group.id, groupId);
 props.put(zookeeper.session.timeout.ms, 500);
 props.put(zookeeper.sync.time.ms, 250);
 props.put(auto.commit.interval.ms, 1000);


 consumer = Consumer.createJavaConsumerConnector(new
 ConsumerConfig(props));
 this.topic = topic;
 }

 public void testConsumer() {
 MapString, Integer topicCount = new HashMapString, Integer();
 topicCount.put(topic, 1);

 MapString, Listlt;KafkaStreamlt;byte[], byte[]
 consumerStreams
 = consumer.createMessageStreams(topicCount);
 ListKafkaStreamlt;byte[], byte[] streams =
 consumerStreams.get(topic);
 for (final KafkaStream stream : streams) {

 ConsumerIteratorbyte[], byte[] it = stream.iterator();
 while (it.hasNext()) {
 System.out.println();
 System.out.println(Message from Single Topic:  + new
 String(it.next().message()));
 }
 }
 if (consumer != null) {
 System.out.println(Shutdown Happens);
 consumer.shutdown();
 }

 }

 public static void main(String[] args) {
 System.out.println(Consumer is now reading messages from
 producer);
 //String topic = args[0];
 String topic = test;
 SimpleHLConsumer simpleHLConsumer = new
 SimpleHLConsumer(localhost:2181, testgroup, topic);
 simpleHLConsumer.testConsumer();
}

 }

 I want to get my messages through Spark Java Streaming with Kafka
 integration. Can anyone help me to reform this code so that I can get same
 output with Spark Kafka integration.



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/Convert-Simple-Kafka-Consumer-to-standalone-Spark-JavaStream-Consumer-tp23930.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: When to use underlying data management layer versus standalone Spark?

2015-06-24 Thread Sandy Ryza
Hi Michael,

Spark itself is an execution engine, not a storage system.  While it has
facilities for caching data in memory, think about these the way you would
think about a process on a single machine leveraging memory - the source
data needs to be stored somewhere, and you need to be able to access it
quickly in case there's a failure.

To echo what Sonal said, it depends on the needs of your application.  If
you expect to mostly write jobs that read and write data in batch, storing
data on HDFS in a binary format like Avro or Parquet will give you the bet
performance.  If other systems need random access to your data, you'd want
to consider a system like HBase and Cassandra, though these are likely to
suffer a little bit on performance and incur higher operational overhead.

-Sandy

On Tue, Jun 23, 2015 at 11:21 PM, Sonal Goyal sonalgoy...@gmail.com wrote:

 When you deploy spark over hadoop, you typically want to leverage the
 replication of hdfs or your data is already in hadoop. Again, if your data
 is already in Cassandra or if you want to do updateable atomic row
 operations and access to your data as well as run analytic jobs, that may
 be another case.
 On Jun 24, 2015 1:17 AM, commtech michael.leon...@opco.com wrote:

 Hi,

 I work at a large financial institution in New York. We're looking into
 Spark and trying to learn more about the deployment/use cases for
 real-time
 analytics with Spark. When would it be better to deploy standalone Spark
 versus Spark on top of a more comprehensive data management layer (Hadoop,
 Cassandra, MongoDB, etc.)? If you do deploy on top of one of these, are
 there different use cases where one of these database management layers
 are
 better or worse?

 Any color would be very helpful. Thank you in advance.

 Sincerely,
 Michael





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/When-to-use-underlying-data-management-layer-versus-standalone-Spark-tp23455.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread commtech
Hi,

I work at a large financial institution in New York. We're looking into
Spark and trying to learn more about the deployment/use cases for real-time
analytics with Spark. When would it be better to deploy standalone Spark
versus Spark on top of a more comprehensive data management layer (Hadoop,
Cassandra, MongoDB, etc.)? If you do deploy on top of one of these, are
there different use cases where one of these database management layers are
better or worse?

Any color would be very helpful. Thank you in advance.

Sincerely,
Michael





--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/When-to-use-underlying-data-management-layer-versus-standalone-Spark-tp23455.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: When to use underlying data management layer versus standalone Spark?

2015-06-23 Thread canan chen
I don't think this is the correct question.  Spark can be deployed on
different cluster manager frameworks like standard alone, yarn  mesos.
Spark can't run without these cluster manager framework, that means spark
depend on cluster manager framework.

And the data management layer is the upstream of spark which is independent
with spark. But spark do provide apis to access different data management
layer.
It should depend on your upstream application which data store should use,
it's not related with spark.


On Wed, Jun 24, 2015 at 3:46 AM, commtech michael.leon...@opco.com wrote:

 Hi,

 I work at a large financial institution in New York. We're looking into
 Spark and trying to learn more about the deployment/use cases for real-time
 analytics with Spark. When would it be better to deploy standalone Spark
 versus Spark on top of a more comprehensive data management layer (Hadoop,
 Cassandra, MongoDB, etc.)? If you do deploy on top of one of these, are
 there different use cases where one of these database management layers are
 better or worse?

 Any color would be very helpful. Thank you in advance.

 Sincerely,
 Michael





 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/When-to-use-underlying-data-management-layer-versus-standalone-Spark-tp23455.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




How to start Thrift JDBC server as part of standalone spark application?

2015-04-23 Thread Vladimir Grigor
Hello,

I would like to export RDD/DataFrames via JDBC SQL interface from the
standalone application for currently stable Spark v1.3.1.

I found one way of doing it but it requires the use of @DeveloperAPI method
HiveThriftServer2.startWithContext(sqlContext)

Is there a better, production level approach to do that?

Full code snippet is below:
// you can run it via:
// ../spark/bin/spark-submit --master local[*] --class SimpleApp
target/scala-2.10/simple-project_2.10-1.0.jar src/test/resources/1.json
tableFromJson


import org.apache.spark.sql.hive.thriftserver.HiveThriftServer2
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.hive.HiveContext

object SimpleApp {

  def main(args: Array[String]) {

if (args.length != 2) {
  Console.err.println(Usage: app source_json_file table_name)
  System.exit(1)
}
val sourceFile = args(0)
val tableName = args(1)

val sparkConf = new SparkConf().setAppName(Simple Application)
val sc = new SparkContext(sparkConf)
val sqlContext = new HiveContext(sc)

val df = sqlContext.jsonFile(sourceFile)
df.registerTempTable(tableName)

println(Registered temp table %s for data source
%s.format(tableName, sourceFile))

HiveThriftServer2.startWithContext(sqlContext)

  }
}





Best, Vladimir Grigor


distcp problems on ec2 standalone spark cluster

2015-03-09 Thread roni
I got pass the issues with the cluster not started problem by adding Yarn
to mapreduce.framework.name .
But when I try to to distcp , if I use uRI with s3://path to my bucket .. I
get invalid path even though the bucket exists.
If I use s3n:// it just hangs.
Did anyone else  face anything like that ?

I also noticed that this script puts the image of cloudera. hadoop. Does it
matter?
Thanks
-R


Re: distcp on ec2 standalone spark cluster

2015-03-08 Thread Akhil Das
Did you follow these steps? https://wiki.apache.org/hadoop/AmazonS3  Also
make sure your jobtracker/mapreduce processes are running fine.

Thanks
Best Regards

On Sun, Mar 8, 2015 at 7:32 AM, roni roni.epi...@gmail.com wrote:

 Did you get this to work?
 I got pass the issues with the cluster not startetd problem
 I am having problem where distcp with s3 URI says incorrect forlder path
 and
 s3n:// hangs.
 stuck for 2 days :(
 Thanks
 -R



 --
 View this message in context:
 http://apache-spark-user-list.1001560.n3.nabble.com/distcp-on-ec2-standalone-spark-cluster-tp13652p21957.html
 Sent from the Apache Spark User List mailing list archive at Nabble.com.

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: distcp on ec2 standalone spark cluster

2015-03-07 Thread roni
Did you get this to work?
I got pass the issues with the cluster not startetd problem
I am having problem where distcp with s3 URI says incorrect forlder path and
s3n:// hangs.
stuck for 2 days :(
Thanks
-R



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/distcp-on-ec2-standalone-spark-cluster-tp13652p21957.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Standalone spark

2015-02-25 Thread Sean Owen
Spark and Hadoop should be listed as 'provided' dependency in your
Maven or SBT build. But that should make it available at compile time.

On Wed, Feb 25, 2015 at 10:42 PM, boci boci.b...@gmail.com wrote:
 Hi,

 I have a little question. I want to develop a spark based application, but
 spark depend to hadoop-client library. I think it's not necessary (spark
 standalone) so I excluded from sbt file.. the result is interesting. My
 trait where I create the spark context not compiled.

 The error:
 ...
  scala.reflect.internal.Types$TypeError: bad symbolic reference. A signature
 in SparkContext.class refers to term mapred
 [error] in package org.apache.hadoop which is not available.
 [error] It may be completely missing from the current classpath, or the
 version on
 [error] the classpath might be incompatible with the version used when
 compiling SparkContext.class.
 ...

 I used this class for integration test. I'm using windows and I don't want
 to using hadoop for integration test. How can I solve this?

 Thanks
 Janos


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Standalone spark

2015-02-25 Thread boci
Thanks dude... I think I will pull up a docker container for integration
test

--
Skype: boci13, Hangout: boci.b...@gmail.com

On Thu, Feb 26, 2015 at 12:22 AM, Sean Owen so...@cloudera.com wrote:

 Yes, been on the books for a while ...
 https://issues.apache.org/jira/browse/SPARK-2356
 That one just may always be a known 'gotcha' in Windows; it's kind of
 a Hadoop gotcha. I don't know that Spark 100% works on Windows and it
 isn't tested on Windows.

 On Wed, Feb 25, 2015 at 11:05 PM, boci boci.b...@gmail.com wrote:
  Thanks your fast answer...
  in windows it's not working, because hadoop (surprise suprise) need
  winutils.exe. Without this it's not working, but if you not set the
 hadoop
  directory You simply get
 
  15/02/26 00:03:16 ERROR Shell: Failed to locate the winutils binary in
 the
  hadoop binary path
  java.io.IOException: Could not locate executable null\bin\winutils.exe in
  the Hadoop binaries.
 
  b0c1
 
 
 
 --
  Skype: boci13, Hangout: boci.b...@gmail.com
 
  On Wed, Feb 25, 2015 at 11:50 PM, Sean Owen so...@cloudera.com wrote:
 
  Spark and Hadoop should be listed as 'provided' dependency in your
  Maven or SBT build. But that should make it available at compile time.
 
  On Wed, Feb 25, 2015 at 10:42 PM, boci boci.b...@gmail.com wrote:
   Hi,
  
   I have a little question. I want to develop a spark based application,
   but
   spark depend to hadoop-client library. I think it's not necessary
 (spark
   standalone) so I excluded from sbt file.. the result is interesting.
 My
   trait where I create the spark context not compiled.
  
   The error:
   ...
scala.reflect.internal.Types$TypeError: bad symbolic reference. A
   signature
   in SparkContext.class refers to term mapred
   [error] in package org.apache.hadoop which is not available.
   [error] It may be completely missing from the current classpath, or
 the
   version on
   [error] the classpath might be incompatible with the version used when
   compiling SparkContext.class.
   ...
  
   I used this class for integration test. I'm using windows and I don't
   want
   to using hadoop for integration test. How can I solve this?
  
   Thanks
   Janos
  
 
 



Re: Standalone spark

2015-02-25 Thread Sean Owen
Yes, been on the books for a while ...
https://issues.apache.org/jira/browse/SPARK-2356
That one just may always be a known 'gotcha' in Windows; it's kind of
a Hadoop gotcha. I don't know that Spark 100% works on Windows and it
isn't tested on Windows.

On Wed, Feb 25, 2015 at 11:05 PM, boci boci.b...@gmail.com wrote:
 Thanks your fast answer...
 in windows it's not working, because hadoop (surprise suprise) need
 winutils.exe. Without this it's not working, but if you not set the hadoop
 directory You simply get

 15/02/26 00:03:16 ERROR Shell: Failed to locate the winutils binary in the
 hadoop binary path
 java.io.IOException: Could not locate executable null\bin\winutils.exe in
 the Hadoop binaries.

 b0c1


 --
 Skype: boci13, Hangout: boci.b...@gmail.com

 On Wed, Feb 25, 2015 at 11:50 PM, Sean Owen so...@cloudera.com wrote:

 Spark and Hadoop should be listed as 'provided' dependency in your
 Maven or SBT build. But that should make it available at compile time.

 On Wed, Feb 25, 2015 at 10:42 PM, boci boci.b...@gmail.com wrote:
  Hi,
 
  I have a little question. I want to develop a spark based application,
  but
  spark depend to hadoop-client library. I think it's not necessary (spark
  standalone) so I excluded from sbt file.. the result is interesting. My
  trait where I create the spark context not compiled.
 
  The error:
  ...
   scala.reflect.internal.Types$TypeError: bad symbolic reference. A
  signature
  in SparkContext.class refers to term mapred
  [error] in package org.apache.hadoop which is not available.
  [error] It may be completely missing from the current classpath, or the
  version on
  [error] the classpath might be incompatible with the version used when
  compiling SparkContext.class.
  ...
 
  I used this class for integration test. I'm using windows and I don't
  want
  to using hadoop for integration test. How can I solve this?
 
  Thanks
  Janos
 



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Whether standalone spark support kerberos?

2015-02-05 Thread Kostas Sakellis
Standalone mode does not support talking to a kerberized HDFS. If you want
to talk to a kerberized (secure) HDFS cluster i suggest you use Spark on
Yarn.

On Wed, Feb 4, 2015 at 2:29 AM, Jander g jande...@gmail.com wrote:

 Hope someone helps me. Thanks.

 On Wed, Feb 4, 2015 at 6:14 PM, Jander g jande...@gmail.com wrote:

 We have a standalone spark cluster for kerberos test.

 But when reading from hdfs, i get error output: Can't get Master Kerberos
 principal for use as renewer.

 So Whether standalone spark support kerberos? can anyone confirm it? or
 what i missed?

 Thanks in advance.

 --
 Thanks,
 Jander




 --
 Thanks,
 Jander



Re: Whether standalone spark support kerberos?

2015-02-04 Thread Jander g
Hope someone helps me. Thanks.

On Wed, Feb 4, 2015 at 6:14 PM, Jander g jande...@gmail.com wrote:

 We have a standalone spark cluster for kerberos test.

 But when reading from hdfs, i get error output: Can't get Master Kerberos
 principal for use as renewer.

 So Whether standalone spark support kerberos? can anyone confirm it? or
 what i missed?

 Thanks in advance.

 --
 Thanks,
 Jander




-- 
Thanks,
Jander


Whether standalone spark support kerberos?

2015-02-04 Thread Jander g
We have a standalone spark cluster for kerberos test.

But when reading from hdfs, i get error output: Can't get Master Kerberos
principal for use as renewer.

So Whether standalone spark support kerberos? can anyone confirm it? or
what i missed?

Thanks in advance.

-- 
Thanks,
Jander


Standalone Spark program

2014-12-18 Thread Akshat Aranya
Hi,

I am building a Spark-based service which requires initialization of a
SparkContext in a main():

def main(args: Array[String]) {
val conf = new SparkConf(false)
  .setMaster(spark://foo.example.com:7077)
  .setAppName(foobar)

val sc = new SparkContext(conf)
val rdd = sc.parallelize(0 until 255)
val res =  rdd.mapPartitions(it = it).take(1)
println(sres=$res)
sc.stop()
}

This code works fine via REPL, but not as a standalone program; it causes a
ClassNotFoundException.  This has me confused about how code is shipped out
to executors.  When using via REPL, does the mapPartitions closure, it=it,
get sent out when the REPL statement is executed?  When this code is run as
a standalone program (not via spark-submit), is the compiled code expected
to be present at the the executor?

Thanks,
Akshat


Re: Standalone Spark program

2014-12-18 Thread Akhil Das
You can build a jar of your project and add it to the sparkContext
(sc.addJar(/path/to/your/project.jar)) then it will get shipped to the
worker and hence no classNotfoundException!

Thanks
Best Regards

On Thu, Dec 18, 2014 at 10:06 PM, Akshat Aranya aara...@gmail.com wrote:

 Hi,

 I am building a Spark-based service which requires initialization of a
 SparkContext in a main():

 def main(args: Array[String]) {
 val conf = new SparkConf(false)
   .setMaster(spark://foo.example.com:7077)
   .setAppName(foobar)

 val sc = new SparkContext(conf)
 val rdd = sc.parallelize(0 until 255)
 val res =  rdd.mapPartitions(it = it).take(1)
 println(sres=$res)
 sc.stop()
 }

 This code works fine via REPL, but not as a standalone program; it causes
 a ClassNotFoundException.  This has me confused about how code is shipped
 out to executors.  When using via REPL, does the mapPartitions closure,
 it=it, get sent out when the REPL statement is executed?  When this code
 is run as a standalone program (not via spark-submit), is the compiled code
 expected to be present at the the executor?

 Thanks,
 Akshat




Re: Standalone Spark program

2014-12-18 Thread Andrew Or
Hey Akshat,

What is the class that is not found, is it a Spark class or classes that
you define in your own application? If the latter, then Akhil's solution
should work (alternatively you can also pass the jar through the --jars
command line option in spark-submit).

If it's a Spark class, however, it's likely that the Spark assembly jar is
not present on the worker nodes. When you build Spark on the cluster, you
will need to rsync it to the same path on all the nodes in your cluster.
For more information, see
http://spark.apache.org/docs/latest/spark-standalone.html.

-Andrew

2014-12-18 10:29 GMT-08:00 Akhil Das ak...@sigmoidanalytics.com:

 You can build a jar of your project and add it to the sparkContext
 (sc.addJar(/path/to/your/project.jar)) then it will get shipped to the
 worker and hence no classNotfoundException!

 Thanks
 Best Regards

 On Thu, Dec 18, 2014 at 10:06 PM, Akshat Aranya aara...@gmail.com wrote:

 Hi,

 I am building a Spark-based service which requires initialization of a
 SparkContext in a main():

 def main(args: Array[String]) {
 val conf = new SparkConf(false)
   .setMaster(spark://foo.example.com:7077)
   .setAppName(foobar)

 val sc = new SparkContext(conf)
 val rdd = sc.parallelize(0 until 255)
 val res =  rdd.mapPartitions(it = it).take(1)
 println(sres=$res)
 sc.stop()
 }

 This code works fine via REPL, but not as a standalone program; it causes
 a ClassNotFoundException.  This has me confused about how code is shipped
 out to executors.  When using via REPL, does the mapPartitions closure,
 it=it, get sent out when the REPL statement is executed?  When this code
 is run as a standalone program (not via spark-submit), is the compiled code
 expected to be present at the the executor?

 Thanks,
 Akshat




Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;

I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
when trying to run distcp:

ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.

at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

Any idea?

Thanks!
Tomer

On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
 If I recall, you should be able to start Hadoop MapReduce using
 ~/ephemeral-hdfs/sbin/start-mapred.sh.

 On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote:

 Hi,

 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.

 Is there a way to activate it, or is there a spark alternative to distcp?

 Thanks,
 Tomer

 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Frank Austin Nothaft
Tomer,

To use distcp, you need to have a Hadoop compute cluster up. start-dfs just 
restarts HDFS. I don’t have a Spark 1.0.2 cluster up right now, but there 
should be a start-mapred*.sh or start-all.sh script that will launch the Hadoop 
MapReduce cluster that you will need for distcp.

Regards,

Frank Austin Nothaft
fnoth...@berkeley.edu
fnoth...@eecs.berkeley.edu
202-340-0466

On Sep 8, 2014, at 12:28 AM, Tomer Benyamini tomer@gmail.com wrote:

 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;
 
 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 Any idea?
 
 Thanks!
 Tomer
 
 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
 If I recall, you should be able to start Hadoop MapReduce using
 ~/ephemeral-hdfs/sbin/start-mapred.sh.
 
 On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote:
 
 Hi,
 
 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.
 
 Is there a way to activate it, or is there a spark alternative to distcp?
 
 Thanks,
 Tomer
 
 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 



Re: Standalone spark cluster. Can't submit job programmatically - java.io.InvalidClassException

2014-09-08 Thread DrKhu
After wasting a lot of time, I've found the problem. Despite I haven't used
hadoop/hdfs in my application, hadoop client matters. The problem was in
hadoop-client version, it was different than the version of hadoop, spark
was built for. Spark's hadoop version 1.2.1, but in my application that was
2.4.

When I changed the version of hadoop client to 1.2.1 in my app, I'm able to
execute spark code on cluster.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Standalone-spark-cluster-Can-t-submit-job-programmatically-java-io-InvalidClassException-tp13456p13688.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Nicholas Chammas
Tomer,

Did you try start-all.sh? It worked for me the last time I tried using
distcp, and it worked for this guy too
http://stackoverflow.com/a/18083790/877069.

Nick
​

On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:

 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;

 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 Any idea?

 Thanks!
 Tomer

 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
  If I recall, you should be able to start Hadoop MapReduce using
  ~/ephemeral-hdfs/sbin/start-mapred.sh.
 
  On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com
 wrote:
 
  Hi,
 
  I would like to copy log files from s3 to the cluster's
  ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
  running on the cluster - I'm getting the exception below.
 
  Is there a way to activate it, or is there a spark alternative to
 distcp?
 
  Thanks,
  Tomer
 
  mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
  org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
  Invalid mapreduce.jobtracker.address configuration value for
  LocalJobRunner : XXX:9001
 
  ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
  java.io.IOException: Cannot initialize Cluster. Please check your
  configuration for mapreduce.framework.name and the correspond server
  addresses.
 
  at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
  at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
  at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
  at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
  at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
  at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
  at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
Still no luck, even when running stop-all.sh followed by start-all.sh.

On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
nicholas.cham...@gmail.com wrote:
 Tomer,

 Did you try start-all.sh? It worked for me the last time I tried using
 distcp, and it worked for this guy too.

 Nick


 On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:

 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;

 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 Any idea?

 Thanks!
 Tomer

 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
  If I recall, you should be able to start Hadoop MapReduce using
  ~/ephemeral-hdfs/sbin/start-mapred.sh.
 
  On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com
  wrote:
 
  Hi,
 
  I would like to copy log files from s3 to the cluster's
  ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
  running on the cluster - I'm getting the exception below.
 
  Is there a way to activate it, or is there a spark alternative to
  distcp?
 
  Thanks,
  Tomer
 
  mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
  org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
  Invalid mapreduce.jobtracker.address configuration value for
  LocalJobRunner : XXX:9001
 
  ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
  java.io.IOException: Cannot initialize Cluster. Please check your
  configuration for mapreduce.framework.name and the correspond server
  addresses.
 
  at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
  at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
  at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
  at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
  at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
  at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
  at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
what did you see in the log? was there anything related to mapreduce?
can you log into your hdfs (data) node, use jps to list all java process and 
confirm whether there is a tasktracker process (or nodemanager) running with 
datanode process


-- 
Ye Xianjin
Sent with Sparrow (http://www.sparrowmailapp.com/?sig)


On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:

 Still no luck, even when running stop-all.sh (http://stop-all.sh) followed by 
 start-all.sh (http://start-all.sh).
 
 On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com (mailto:nicholas.cham...@gmail.com) wrote:
  Tomer,
  
  Did you try start-all.sh (http://start-all.sh)? It worked for me the last 
  time I tried using
  distcp, and it worked for this guy too.
  
  Nick
  
  
  On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com 
  (mailto:tomer@gmail.com) wrote:
   
   ~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh) does not 
   exist on spark-1.0.2;
   
   I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh 
   (http://stop-dfs.sh) and
   ~/ephemeral-hdfs/sbin/start-dfs.sh (http://start-dfs.sh), but still 
   getting the same error
   when trying to run distcp:
   
   ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
   
   java.io.IOException: Cannot initialize Cluster. Please check your
   configuration for mapreduce.framework.name 
   (http://mapreduce.framework.name) and the correspond server
   addresses.
   
   at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
   
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
   
   at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
   
   at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
   
   at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
   
   at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
   
   at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
   
   at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
   
   Any idea?
   
   Thanks!
   Tomer
   
   On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com 
   (mailto:rosenvi...@gmail.com) wrote:
If I recall, you should be able to start Hadoop MapReduce using
~/ephemeral-hdfs/sbin/start-mapred.sh (http://start-mapred.sh).

On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com 
(mailto:tomer@gmail.com)
wrote:
 
 Hi,
 
 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.
 
 Is there a way to activate it, or is there a spark alternative to
 distcp?
 
 Thanks,
 Tomer
 
 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name 
 (http://mapreduce.framework.name) and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at 
 org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 


   
   
   -
   To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
   (mailto:user-unsubscr...@spark.apache.org)
   For additional commands, e-mail: user-h...@spark.apache.org 
   (mailto:user-h...@spark.apache.org)
   
  
  
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
 (mailto:user-unsubscr...@spark.apache.org)
 For additional commands, e-mail: user-h...@spark.apache.org 
 (mailto:user-h...@spark.apache.org)
 
 




Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Tomer Benyamini
No tasktracker or nodemanager. This is what I see:

On the master:

org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
org.apache.hadoop.hdfs.server.namenode.NameNode

On the data node (slave):

org.apache.hadoop.hdfs.server.datanode.DataNode



On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin advance...@gmail.com wrote:
 what did you see in the log? was there anything related to mapreduce?
 can you log into your hdfs (data) node, use jps to list all java process and
 confirm whether there is a tasktracker process (or nodemanager) running with
 datanode process

 --
 Ye Xianjin
 Sent with Sparrow

 On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:

 Still no luck, even when running stop-all.sh followed by start-all.sh.

 On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:

 Tomer,

 Did you try start-all.sh? It worked for me the last time I tried using
 distcp, and it worked for this guy too.

 Nick


 On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:


 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;

 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 Any idea?

 Thanks!
 Tomer

 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:

 If I recall, you should be able to start Hadoop MapReduce using
 ~/ephemeral-hdfs/sbin/start-mapred.sh.

 On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com
 wrote:


 Hi,

 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.

 Is there a way to activate it, or is there a spark alternative to
 distcp?

 Thanks,
 Tomer

 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-08 Thread Ye Xianjin
well, this means you didn't start a compute cluster. Most likely because the 
wrong value of mapreduce.jobtracker.address cause the slave node cannot start 
the node manager. ( I am not familiar with the ec2 script, so I don't know 
whether the slave node has node manager installed or not.) 
Can you check the slave node the hadoop daemon log to see whether you started 
the nodemanager  but failed or there is no nodemanager to start? The log file 
location defaults to
/var/log/hadoop-xxx if my memory is correct.

Sent from my iPhone

 On 2014年9月9日, at 0:08, Tomer Benyamini tomer@gmail.com wrote:
 
 No tasktracker or nodemanager. This is what I see:
 
 On the master:
 
 org.apache.hadoop.yarn.server.resourcemanager.ResourceManager
 org.apache.hadoop.hdfs.server.namenode.SecondaryNameNode
 org.apache.hadoop.hdfs.server.namenode.NameNode
 
 On the data node (slave):
 
 org.apache.hadoop.hdfs.server.datanode.DataNode
 
 
 
 On Mon, Sep 8, 2014 at 6:39 PM, Ye Xianjin advance...@gmail.com wrote:
 what did you see in the log? was there anything related to mapreduce?
 can you log into your hdfs (data) node, use jps to list all java process and
 confirm whether there is a tasktracker process (or nodemanager) running with
 datanode process
 
 --
 Ye Xianjin
 Sent with Sparrow
 
 On Monday, September 8, 2014 at 11:13 PM, Tomer Benyamini wrote:
 
 Still no luck, even when running stop-all.sh followed by start-all.sh.
 
 On Mon, Sep 8, 2014 at 5:57 PM, Nicholas Chammas
 nicholas.cham...@gmail.com wrote:
 
 Tomer,
 
 Did you try start-all.sh? It worked for me the last time I tried using
 distcp, and it worked for this guy too.
 
 Nick
 
 
 On Mon, Sep 8, 2014 at 3:28 AM, Tomer Benyamini tomer@gmail.com wrote:
 
 
 ~/ephemeral-hdfs/sbin/start-mapred.sh does not exist on spark-1.0.2;
 
 I restarted hdfs using ~/ephemeral-hdfs/sbin/stop-dfs.sh and
 ~/ephemeral-hdfs/sbin/start-dfs.sh, but still getting the same error
 when trying to run distcp:
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 Any idea?
 
 Thanks!
 Tomer
 
 On Sun, Sep 7, 2014 at 9:27 PM, Josh Rosen rosenvi...@gmail.com wrote:
 
 If I recall, you should be able to start Hadoop MapReduce using
 ~/ephemeral-hdfs/sbin/start-mapred.sh.
 
 On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com
 wrote:
 
 
 Hi,
 
 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.
 
 Is there a way to activate it, or is there a spark alternative to
 distcp?
 
 Thanks,
 Tomer
 
 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001
 
 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.
 
 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 
 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org
 
 

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: 

Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Hi,

I would like to make sure I'm not exceeding the quota on the local
cluster's hdfs. I have a couple of questions:

1. How do I know the quota? Here's the output of hadoop fs -count -q
which essentially does not tell me a lot

root@ip-172-31-7-49 ~]$ hadoop fs -count -q /
  2147483647  2147482006none inf
 4 163725412205559 /

2. What should I do to increase the quota? Should I bring down the
existing slaves and upgrade to ones with more storage? Is there a way
to add disks to existing slaves? I'm using the default m1.large slaves
set up using the spark-ec2 script.

Thanks,
Tomer

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Ognen Duzlevski


On 9/7/2014 7:27 AM, Tomer Benyamini wrote:

2. What should I do to increase the quota? Should I bring down the
existing slaves and upgrade to ones with more storage? Is there a way
to add disks to existing slaves? I'm using the default m1.large slaves
set up using the spark-ec2 script.

Take a look at: http://www.ec2instances.info/

There you will find the available EC2 instances with their associated 
costs and how much ephemeral space they come with. Once you pick an 
instance you get only so much ephemeral space. You can always add drives 
but they will be EBS and not physically attached to the instance.


Ognen

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Adding quota to the ephemeral hdfs on a standalone spark cluster on ec2

2014-09-07 Thread Tomer Benyamini
Thanks! I found the hdfs ui via this port - http://[master-ip]:50070/.
It shows 1 node hdfs though, although I have 4 slaves on my cluster.
Any idea why?

On Sun, Sep 7, 2014 at 4:29 PM, Ognen Duzlevski
ognen.duzlev...@gmail.com wrote:

 On 9/7/2014 7:27 AM, Tomer Benyamini wrote:

 2. What should I do to increase the quota? Should I bring down the
 existing slaves and upgrade to ones with more storage? Is there a way
 to add disks to existing slaves? I'm using the default m1.large slaves
 set up using the spark-ec2 script.

 Take a look at: http://www.ec2instances.info/

 There you will find the available EC2 instances with their associated costs
 and how much ephemeral space they come with. Once you pick an instance you
 get only so much ephemeral space. You can always add drives but they will be
 EBS and not physically attached to the instance.

 Ognen

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org


-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
Hi,

I would like to copy log files from s3 to the cluster's
ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
running on the cluster - I'm getting the exception below.

Is there a way to activate it, or is there a spark alternative to distcp?

Thanks,
Tomer

mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
Invalid mapreduce.jobtracker.address configuration value for
LocalJobRunner : XXX:9001

ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

java.io.IOException: Cannot initialize Cluster. Please check your
configuration for mapreduce.framework.name and the correspond server
addresses.

at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Tomer Benyamini
I've installed a spark standalone cluster on ec2 as defined here -
https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if
mr1/2 is part of this installation.


On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin advance...@gmail.com wrote:
 Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce
 cluster on your hdfs?
 And from the error message, it seems that you didn't specify your jobtracker
 address.

 --
 Ye Xianjin
 Sent with Sparrow

 On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote:

 Hi,

 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.

 Is there a way to activate it, or is there a spark alternative to distcp?

 Thanks,
 Tomer

 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Nicholas Chammas
I think you need to run start-all.sh or something similar on the EC2
cluster. MR is installed but is not running by default on EC2 clusters spun
up by spark-ec2.
​

On Sun, Sep 7, 2014 at 12:33 PM, Tomer Benyamini tomer@gmail.com
wrote:

 I've installed a spark standalone cluster on ec2 as defined here -
 https://spark.apache.org/docs/latest/ec2-scripts.html. I'm not sure if
 mr1/2 is part of this installation.


 On Sun, Sep 7, 2014 at 7:25 PM, Ye Xianjin advance...@gmail.com wrote:
  Distcp requires a mr1(or mr2) cluster to start. Do you have a mapreduce
  cluster on your hdfs?
  And from the error message, it seems that you didn't specify your
 jobtracker
  address.
 
  --
  Ye Xianjin
  Sent with Sparrow
 
  On Sunday, September 7, 2014 at 9:42 PM, Tomer Benyamini wrote:
 
  Hi,
 
  I would like to copy log files from s3 to the cluster's
  ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
  running on the cluster - I'm getting the exception below.
 
  Is there a way to activate it, or is there a spark alternative to distcp?
 
  Thanks,
  Tomer
 
  mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
  org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
  Invalid mapreduce.jobtracker.address configuration value for
  LocalJobRunner : XXX:9001
 
  ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered
 
  java.io.IOException: Cannot initialize Cluster. Please check your
  configuration for mapreduce.framework.name and the correspond server
  addresses.
 
  at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)
 
  at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)
 
  at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)
 
  at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)
 
  at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)
 
  at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)
 
  at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)
 
  at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)
 
  -
  To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
  For additional commands, e-mail: user-h...@spark.apache.org
 
 

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




Re: distcp on ec2 standalone spark cluster

2014-09-07 Thread Josh Rosen
If I recall, you should be able to start Hadoop MapReduce using
~/ephemeral-hdfs/sbin/start-mapred.sh.

On Sun, Sep 7, 2014 at 6:42 AM, Tomer Benyamini tomer@gmail.com wrote:

 Hi,

 I would like to copy log files from s3 to the cluster's
 ephemeral-hdfs. I tried to use distcp, but I guess mapred is not
 running on the cluster - I'm getting the exception below.

 Is there a way to activate it, or is there a spark alternative to distcp?

 Thanks,
 Tomer

 mapreduce.Cluster (Cluster.java:initialize(114)) - Failed to use
 org.apache.hadoop.mapred.LocalClientProtocolProvider due to error:
 Invalid mapreduce.jobtracker.address configuration value for
 LocalJobRunner : XXX:9001

 ERROR tools.DistCp (DistCp.java:run(126)) - Exception encountered

 java.io.IOException: Cannot initialize Cluster. Please check your
 configuration for mapreduce.framework.name and the correspond server
 addresses.

 at org.apache.hadoop.mapreduce.Cluster.initialize(Cluster.java:121)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:83)

 at org.apache.hadoop.mapreduce.Cluster.init(Cluster.java:76)

 at org.apache.hadoop.tools.DistCp.createMetaFolderPath(DistCp.java:352)

 at org.apache.hadoop.tools.DistCp.execute(DistCp.java:146)

 at org.apache.hadoop.tools.DistCp.run(DistCp.java:118)

 at org.apache.hadoop.util.ToolRunner.run(ToolRunner.java:70)

 at org.apache.hadoop.tools.DistCp.main(DistCp.java:374)

 -
 To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
 For additional commands, e-mail: user-h...@spark.apache.org




can't submit my application on standalone spark cluster

2014-08-06 Thread Andres Gomez Ferrer
Hi all,

My name is Andres and I'm starting to use Apache Spark.

I try to submit my spark.jar to my cluster using this:

spark-submit --class net.redborder.spark.RedBorderApplication --master 
spark://pablo02:7077 redborder-spark-selfcontained.jar

But when I did it .. My worker die .. and my driver too!

This is my driver log:

[INFO] 2014-08-06 06:30:12,025 [Driver-akka.actor.default-dispatcher-3]  
akka.event.slf4j.Slf4jLogger applyOrElse - Slf4jLogger started
[INFO] 2014-08-06 06:30:12,061 [Driver-akka.actor.default-dispatcher-3]  
Remoting apply$mcV$sp - Starting remoting
[ERROR] 2014-08-06 06:30:12,089 [Driver-akka.actor.default-dispatcher-6]  
akka.actor.ActorSystemImpl apply$mcV$sp - Uncaught fatal error from thread 
[Driver-akka.actor.default-dispatcher-3] shutting down ActorSystem [Driver]
java.lang.VerifyError: (class: 
org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker 
signature: 
(Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;)
 Wrong return type in function
at 
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:282)
at 
akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:239)
at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown 
Source)
at java.lang.reflect.Constructor.newInstance(Unknown Source)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
at scala.util.Try$.apply(Try.scala:161)
at 
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at 
akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
at scala.util.Success.flatMap(Try.scala:200)
at 
akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:618)
at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:610)
at 
scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
at 
scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
at 
akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:610)
at 
akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:450)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at 
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at 
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at 
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at 
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
[INFO] 2014-08-06 06:30:12,093 [Driver-akka.actor.default-dispatcher-5]  
akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp - Shutting 
down remote daemon.
[INFO] 2014-08-06 06:30:12,095 [Driver-akka.actor.default-dispatcher-5]  
akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp - Remote 
daemon shut down; proceeding with flushing remote transports.
[INFO] 2014-08-06 06:30:12,102 [Driver-akka.actor.default-dispatcher-3]  
Remoting apply$mcV$sp - Remoting shut down
[INFO] 2014-08-06 06:30:12,104 [Driver-akka.actor.default-dispatcher-3]  
akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp - Remoting 
shut down.
[ERROR] [08/06/2014 06:30:22.065] [main] [Remoting] Remoting error: [Startup 
timed out] [
akka.remote.RemoteTransportException: Startup timed out
at 
akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
at akka.remote.Remoting.start(Remoting.scala:191)
at 
akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
at akka.actor.ActorSystemImpl.start(ActorSystem.scala:588)
at akka.actor.ActorSystem$.apply(ActorSystem.scala:111)
  

Re: can't submit my application on standalone spark cluster

2014-08-06 Thread Akhil Das
Looks like a netty conflict there, most likely you are having mutiple
versions of netty jars (eg:
netty-3.6.6.Final.jar, netty-3.2.2.Final.jar, netty-all-4.0.13.Final.jar),
you only require 3.6.6 i believe. a quick fix would be to remove the rest
of them.

Thanks
Best Regards


On Wed, Aug 6, 2014 at 3:05 PM, Andres Gomez Ferrer ago...@redborder.net
wrote:

 Hi all,

 My name is Andres and I'm starting to use Apache Spark.

 I try to submit my spark.jar to my cluster using this:

 spark-submit --class net.redborder.spark.RedBorderApplication --master
 spark://pablo02:7077 redborder-spark-selfcontained.jar

 But when I did it .. My worker die .. and my driver too!

 This is my driver log:

 [INFO] 2014-08-06 06:30:12,025 [Driver-akka.actor.default-dispatcher-3]
  akka.event.slf4j.Slf4jLogger applyOrElse - Slf4jLogger started
 [INFO] 2014-08-06 06:30:12,061 [Driver-akka.actor.default-dispatcher-3]
  Remoting apply$mcV$sp - Starting remoting
 [ERROR] 2014-08-06 06:30:12,089 [Driver-akka.actor.default-dispatcher-6]
  akka.actor.ActorSystemImpl apply$mcV$sp - Uncaught fatal error from thread
 [Driver-akka.actor.default-dispatcher-3] shutting down ActorSystem [Driver]
 java.lang.VerifyError: (class:
 org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker
 signature:
 (Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;)
 Wrong return type in function
 at
 akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:282)
 at
 akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:239)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
 at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
 Source)
 at java.lang.reflect.Constructor.newInstance(Unknown Source)
 at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
 at scala.util.Try$.apply(Try.scala:161)
 at
 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
 at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
 at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
 at scala.util.Success.flatMap(Try.scala:200)
 at
 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
 at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:618)
 at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:610)
 at
 scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
 at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
 at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
 at
 scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
 at
 akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:610)
 at
 akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:450)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
 at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
 at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
 at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
 at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 [INFO] 2014-08-06 06:30:12,093 [Driver-akka.actor.default-dispatcher-5]
  akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp -
 Shutting down remote daemon.
 [INFO] 2014-08-06 06:30:12,095 [Driver-akka.actor.default-dispatcher-5]
  akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp -
 Remote daemon shut down; proceeding with flushing remote transports.
 [INFO] 2014-08-06 06:30:12,102 [Driver-akka.actor.default-dispatcher-3]
  Remoting apply$mcV$sp - Remoting shut down
 [INFO] 2014-08-06 06:30:12,104 [Driver-akka.actor.default-dispatcher-3]
  akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp -
 Remoting shut down.
 [ERROR] [08/06/2014 06:30:22.065] [main] [Remoting] Remoting error:
 [Startup timed out] [
 akka.remote.RemoteTransportException: Startup timed out
 at
 akka.remote.Remoting.akka$remote$Remoting$$notifyError(Remoting.scala:129)
 at akka.remote.Remoting.start(Remoting.scala:191)
 at
 akka.remote.RemoteActorRefProvider.init(RemoteActorRefProvider.scala:184)
 at akka.actor.ActorSystemImpl._start$lzycompute(ActorSystem.scala:579)
 at akka.actor.ActorSystemImpl._start(ActorSystem.scala:577)
 at 

Re: can't submit my application on standalone spark cluster

2014-08-06 Thread Andrew Or
Hi Andres,

If you're using the EC2 scripts to start your standalone cluster, you can
use ~/spark-ec2/copy-dir --delete ~/spark to sync your jars across the
cluster. Note that you will need to restart the Master and the Workers
afterwards through sbin/start-all.sh and sbin/stop-all.sh. If you're
not using the EC2 scripts, you will have to rsync the directory manually
(copy-dir just calls rsync internally).

-Andrew


2014-08-06 2:39 GMT-07:00 Akhil Das ak...@sigmoidanalytics.com:

 Looks like a netty conflict there, most likely you are having mutiple
 versions of netty jars (eg:
 netty-3.6.6.Final.jar, netty-3.2.2.Final.jar, netty-all-4.0.13.Final.jar),
 you only require 3.6.6 i believe. a quick fix would be to remove the rest
 of them.

 Thanks
 Best Regards


 On Wed, Aug 6, 2014 at 3:05 PM, Andres Gomez Ferrer ago...@redborder.net
 wrote:

 Hi all,

 My name is Andres and I'm starting to use Apache Spark.

 I try to submit my spark.jar to my cluster using this:

 spark-submit --class net.redborder.spark.RedBorderApplication --master
 spark://pablo02:7077 redborder-spark-selfcontained.jar

 But when I did it .. My worker die .. and my driver too!

 This is my driver log:

 [INFO] 2014-08-06 06:30:12,025 [Driver-akka.actor.default-dispatcher-3]
  akka.event.slf4j.Slf4jLogger applyOrElse - Slf4jLogger started
 [INFO] 2014-08-06 06:30:12,061 [Driver-akka.actor.default-dispatcher-3]
  Remoting apply$mcV$sp - Starting remoting
 [ERROR] 2014-08-06 06:30:12,089 [Driver-akka.actor.default-dispatcher-6]
  akka.actor.ActorSystemImpl apply$mcV$sp - Uncaught fatal error from thread
 [Driver-akka.actor.default-dispatcher-3] shutting down ActorSystem [Driver]
 java.lang.VerifyError: (class:
 org/jboss/netty/channel/socket/nio/NioWorkerPool, method: createWorker
 signature:
 (Ljava/util/concurrent/Executor;)Lorg/jboss/netty/channel/socket/nio/AbstractNioWorker;)
 Wrong return type in function
  at
 akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:282)
 at
 akka.remote.transport.netty.NettyTransport.init(NettyTransport.scala:239)
  at sun.reflect.NativeConstructorAccessorImpl.newInstance0(Native Method)
 at sun.reflect.NativeConstructorAccessorImpl.newInstance(Unknown Source)
  at sun.reflect.DelegatingConstructorAccessorImpl.newInstance(Unknown
 Source)
 at java.lang.reflect.Constructor.newInstance(Unknown Source)
  at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$2.apply(DynamicAccess.scala:78)
 at scala.util.Try$.apply(Try.scala:161)
  at
 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:73)
 at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
  at
 akka.actor.ReflectiveDynamicAccess$$anonfun$createInstanceFor$3.apply(DynamicAccess.scala:84)
 at scala.util.Success.flatMap(Try.scala:200)
  at
 akka.actor.ReflectiveDynamicAccess.createInstanceFor(DynamicAccess.scala:84)
 at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:618)
  at akka.remote.EndpointManager$$anonfun$8.apply(Remoting.scala:610)
 at
 scala.collection.TraversableLike$WithFilter$$anonfun$map$2.apply(TraversableLike.scala:722)
  at scala.collection.Iterator$class.foreach(Iterator.scala:727)
 at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
  at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
 at scala.collection.AbstractIterable.foreach(Iterable.scala:54)
  at
 scala.collection.TraversableLike$WithFilter.map(TraversableLike.scala:721)
 at
 akka.remote.EndpointManager.akka$remote$EndpointManager$$listens(Remoting.scala:610)
  at
 akka.remote.EndpointManager$$anonfun$receive$2.applyOrElse(Remoting.scala:450)
 at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
  at akka.actor.ActorCell.invoke(ActorCell.scala:456)
 at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
  at akka.dispatch.Mailbox.run(Mailbox.scala:219)
 at
 akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
  at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
 at
 scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
  at
 scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
 at
 scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
 [INFO] 2014-08-06 06:30:12,093 [Driver-akka.actor.default-dispatcher-5]
  akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp -
 Shutting down remote daemon.
 [INFO] 2014-08-06 06:30:12,095 [Driver-akka.actor.default-dispatcher-5]
  akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp -
 Remote daemon shut down; proceeding with flushing remote transports.
 [INFO] 2014-08-06 06:30:12,102 [Driver-akka.actor.default-dispatcher-3]
  Remoting apply$mcV$sp - Remoting shut down
 [INFO] 2014-08-06 06:30:12,104 [Driver-akka.actor.default-dispatcher-3]
  akka.remote.RemoteActorRefProvider$RemotingTerminator apply$mcV$sp -
 Remoting shut down.