Re: How to enable core dump in spark

2016-06-02 Thread prateek arora

please help me to solve my problem

Regards
Prateek



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/How-to-enable-core-dump-in-spark-tp27065p27081.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



KafkaLog4jAppender in spark executor and driver does not work well

2016-06-02 Thread jian he
Hi, http://stackoverflow.com/questions/32843186/custom-log4j-appender-in-spark-executor, that have the same problem in spark 1.6.1.And in spark driver, also have this issue:log4j:ERROR Could not instantiate class [kafka.producer.KafkaLog4jAppender].
java.lang.ClassNotFoundException: kafka.producer.KafkaLog4jAppender
	at java.net.URLClassLoader$1.run(URLClassLoader.java:366)
	at java.net.URLClassLoader$1.run(URLClassLoader.java:355)
	at java.security.AccessController.doPrivileged(Native Method)
	at java.net.URLClassLoader.findClass(URLClassLoader.java:354)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:425)
	at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308)
	at java.lang.ClassLoader.loadClass(ClassLoader.java:358)
	at java.lang.Class.forName0(Native Method)
	at java.lang.Class.forName(Class.java:190)
	at org.apache.log4j.helpers.Loader.loadClass(Loader.java:198)
	at org.apache.log4j.helpers.OptionConverter.instantiateByClassName(OptionConverter.java:327)
	at org.apache.log4j.helpers.OptionConverter.instantiateByKey(OptionConverter.java:124)
	at org.apache.log4j.PropertyConfigurator.parseAppender(PropertyConfigurator.java:785)
	at org.apache.log4j.PropertyConfigurator.parseCategory(PropertyConfigurator.java:768)
	at org.apache.log4j.PropertyConfigurator.parseCatsAndRenderers(PropertyConfigurator.java:672)
	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:516)
	at org.apache.log4j.PropertyConfigurator.doConfigure(PropertyConfigurator.java:580)
	at org.apache.log4j.helpers.OptionConverter.selectAndConfigure(OptionConverter.java:526)
	at org.apache.log4j.LogManager.(LogManager.java:127)
	at org.apache.spark.Logging$class.initializeLogging(Logging.scala:121)
	at org.apache.spark.Logging$class.initializeIfNecessary(Logging.scala:106)
	at org.apache.spark.Logging$class.log(Logging.scala:50)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.log(CoarseGrainedExecutorBackend.scala:138)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.run(CoarseGrainedExecutorBackend.scala:149)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend$.main(CoarseGrainedExecutorBackend.scala:253)
	at org.apache.spark.executor.CoarseGrainedExecutorBackend.main(CoarseGrainedExecutorBackend.scala)
log4j:ERROR Could not instantiate appender named "KAFKA".Anyone can help me?
——Thank you.jian hehejian...@gmail.com



-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Chris Fregly
i recently powered through this Spark + ElasticSearch integration, as well.

you see this + many other Spark integrations with the PANCAKE STACK
 here:  https://github.com/fluxcapacitor/pipeline

all configs found here:
https://github.com/fluxcapacitor/pipeline/tree/master/config

in particular, the Stanford CoreNLP + Spark ML Pipeline integration was the
most-difficult, but we got it working finally with some hard-coding and
finger-crossing!


On Thu, Jun 2, 2016 at 4:09 PM, Nick Pentreath 
wrote:

> Fair enough.
>
> However, if you take a look at the deployment guide (
> http://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies)
> you will see that the generally advised approach is to package your app
> dependencies into a fat JAR and submit (possibly using the --jars option
> too). This also means you specify the Scala and other library versions in
> your project pom.xml or sbt file, avoiding having to manually decide which
> artefact to include on your classpath  :)
>
> On Thu, 2 Jun 2016 at 16:06 Kevin Burton  wrote:
>
>> Yeah.. thanks Nick. Figured that out since your last email... I deleted
>> the 2.10 by accident but then put 2+2 together.
>>
>> Got it working now.
>>
>> Still sticking to my story that it's somewhat complicated to setup :)
>>
>> Kevin
>>
>> On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath 
>> wrote:
>>
>>> Which Scala version is Spark built against? I'd guess it's 2.10 since
>>> you're using spark-1.6, and you're using the 2.11 jar for es-hadoop.
>>>
>>>
>>> On Thu, 2 Jun 2016 at 15:50 Kevin Burton  wrote:
>>>
 Thanks.

 I'm trying to run it in a standalone cluster with an existing / large
 100 node ES install.

 I'm using the standard 1.6.1 -2.6 distribution with
 elasticsearch-hadoop-2.3.2...

 I *think* I'm only supposed to use the
 elasticsearch-spark_2.11-2.3.2.jar with it...

 but now I get the following exception:


 java.lang.NoSuchMethodError:
 scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
 at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52)
 at
 org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37)
 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
 at
 $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
 at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
 at $iwC$$iwC$$iwC$$iwC$$iwC.(:57)
 at $iwC$$iwC$$iwC$$iwC.(:59)
 at $iwC$$iwC$$iwC.(:61)
 at $iwC$$iwC.(:63)
 at $iwC.(:65)
 at (:67)
 at .(:71)
 at .()
 at .(:7)
 at .()
 at $print()
 at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
 at
 sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
 at
 sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
 at java.lang.reflect.Method.invoke(Method.java:497)
 at
 org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
 at
 org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
 at
 org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
 at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
 at
 org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
 at
 org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
 at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
 at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
 at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
 at org.apache.spark.repl.SparkILoop.org
 $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
 at
 org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
 at
 scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
 at org.apache.spark.repl.SparkILoop.org
 

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Fair enough.

However, if you take a look at the deployment guide (
http://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies)
you will see that the generally advised approach is to package your app
dependencies into a fat JAR and submit (possibly using the --jars option
too). This also means you specify the Scala and other library versions in
your project pom.xml or sbt file, avoiding having to manually decide which
artefact to include on your classpath  :)

On Thu, 2 Jun 2016 at 16:06 Kevin Burton  wrote:

> Yeah.. thanks Nick. Figured that out since your last email... I deleted
> the 2.10 by accident but then put 2+2 together.
>
> Got it working now.
>
> Still sticking to my story that it's somewhat complicated to setup :)
>
> Kevin
>
> On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath 
> wrote:
>
>> Which Scala version is Spark built against? I'd guess it's 2.10 since
>> you're using spark-1.6, and you're using the 2.11 jar for es-hadoop.
>>
>>
>> On Thu, 2 Jun 2016 at 15:50 Kevin Burton  wrote:
>>
>>> Thanks.
>>>
>>> I'm trying to run it in a standalone cluster with an existing / large
>>> 100 node ES install.
>>>
>>> I'm using the standard 1.6.1 -2.6 distribution with
>>> elasticsearch-hadoop-2.3.2...
>>>
>>> I *think* I'm only supposed to use the
>>> elasticsearch-spark_2.11-2.3.2.jar with it...
>>>
>>> but now I get the following exception:
>>>
>>>
>>> java.lang.NoSuchMethodError:
>>> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>>> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52)
>>> at
>>> org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37)
>>> at
>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
>>> at
>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>>> at $iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>>> at $iwC$$iwC$$iwC$$iwC.(:59)
>>> at $iwC$$iwC$$iwC.(:61)
>>> at $iwC$$iwC.(:63)
>>> at $iwC.(:65)
>>> at (:67)
>>> at .(:71)
>>> at .()
>>> at .(:7)
>>> at .()
>>> at $print()
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>> at
>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>>> at
>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>>> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>>> at
>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>>> at
>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>>> at
>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>>> at
>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>>> at org.apache.spark.repl.SparkILoop.org
>>> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>>> at
>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>>> at
>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>> at org.apache.spark.repl.SparkILoop.org
>>> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
>>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>> at org.apache.spark.repl.Main.main(Main.scala)
>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>> at
>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>> at
>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>> at
>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>>> at
>>> 

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Yeah.. thanks Nick. Figured that out since your last email... I deleted the
2.10 by accident but then put 2+2 together.

Got it working now.

Still sticking to my story that it's somewhat complicated to setup :)

Kevin

On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath 
wrote:

> Which Scala version is Spark built against? I'd guess it's 2.10 since
> you're using spark-1.6, and you're using the 2.11 jar for es-hadoop.
>
>
> On Thu, 2 Jun 2016 at 15:50 Kevin Burton  wrote:
>
>> Thanks.
>>
>> I'm trying to run it in a standalone cluster with an existing / large 100
>> node ES install.
>>
>> I'm using the standard 1.6.1 -2.6 distribution with
>> elasticsearch-hadoop-2.3.2...
>>
>> I *think* I'm only supposed to use the
>> elasticsearch-spark_2.11-2.3.2.jar with it...
>>
>> but now I get the following exception:
>>
>>
>> java.lang.NoSuchMethodError:
>> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52)
>> at
>> org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37)
>> at
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
>> at
>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
>> at $iwC$$iwC$$iwC$$iwC$$iwC.(:57)
>> at $iwC$$iwC$$iwC$$iwC.(:59)
>> at $iwC$$iwC$$iwC.(:61)
>> at $iwC$$iwC.(:63)
>> at $iwC.(:65)
>> at (:67)
>> at .(:71)
>> at .()
>> at .(:7)
>> at .()
>> at $print()
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at
>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>> at
>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>> at
>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>> at
>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>> at
>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>> at
>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>> at org.apache.spark.repl.SparkILoop.org
>> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>> at
>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>> at
>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>> at org.apache.spark.repl.SparkILoop.org
>> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>> at org.apache.spark.repl.Main.main(Main.scala)
>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>> at
>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>> at
>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>> at java.lang.reflect.Method.invoke(Method.java:497)
>> at
>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>
>>
>> On Thu, Jun 2, 2016 at 3:45 PM, Nick Pentreath 
>> wrote:
>>
>>> Hey there
>>>
>>> When I used es-hadoop, I just pulled in the dependency into my pom.xml,
>>> with spark as a "provided" dependency, and built a fat jar with assembly.
>>>
>>> Then with spark-submit use the --jars option to include your assembly
>>> jar (IIRC I sometimes also needed to use --driver-classpath too, but
>>> perhaps not with recent Spark 

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Which Scala version is Spark built against? I'd guess it's 2.10 since
you're using spark-1.6, and you're using the 2.11 jar for es-hadoop.


On Thu, 2 Jun 2016 at 15:50 Kevin Burton  wrote:

> Thanks.
>
> I'm trying to run it in a standalone cluster with an existing / large 100
> node ES install.
>
> I'm using the standard 1.6.1 -2.6 distribution with
> elasticsearch-hadoop-2.3.2...
>
> I *think* I'm only supposed to use the
> elasticsearch-spark_2.11-2.3.2.jar with it...
>
> but now I get the following exception:
>
>
> java.lang.NoSuchMethodError:
> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52)
> at
> org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
> at
> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
> at $iwC$$iwC$$iwC$$iwC$$iwC.(:57)
> at $iwC$$iwC$$iwC$$iwC.(:59)
> at $iwC$$iwC$$iwC.(:61)
> at $iwC$$iwC.(:63)
> at $iwC.(:65)
> at (:67)
> at .(:71)
> at .()
> at .(:7)
> at .()
> at $print()
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
> at
> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
> at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
> at
> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
> at
> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
> at org.apache.spark.repl.SparkILoop.org
> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
> at org.apache.spark.repl.Main$.main(Main.scala:31)
> at org.apache.spark.repl.Main.main(Main.scala)
> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
> at
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
> at java.lang.reflect.Method.invoke(Method.java:497)
> at
> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
> at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>
>
> On Thu, Jun 2, 2016 at 3:45 PM, Nick Pentreath 
> wrote:
>
>> Hey there
>>
>> When I used es-hadoop, I just pulled in the dependency into my pom.xml,
>> with spark as a "provided" dependency, and built a fat jar with assembly.
>>
>> Then with spark-submit use the --jars option to include your assembly jar
>> (IIRC I sometimes also needed to use --driver-classpath too, but perhaps
>> not with recent Spark versions).
>>
>>
>>
>> On Thu, 2 Jun 2016 at 15:34 Kevin Burton  wrote:
>>
>>> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's
>>> not super easy.
>>>
>>> I wish there was an easier way to get this stuff to work.. Last time I
>>> tried to use spark more I was having similar problems with classpath setup
>>> and Cassandra.
>>>
>>> Seems a huge opportunity to make this easier 

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
Thanks.

I'm trying to run it in a standalone cluster with an existing / large 100
node ES install.

I'm using the standard 1.6.1 -2.6 distribution with
elasticsearch-hadoop-2.3.2...

I *think* I'm only supposed to use the
elasticsearch-spark_2.11-2.3.2.jar with it...

but now I get the following exception:


java.lang.NoSuchMethodError:
scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52)
at
org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:40)
at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:45)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53)
at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:55)
at $iwC$$iwC$$iwC$$iwC$$iwC.(:57)
at $iwC$$iwC$$iwC$$iwC.(:59)
at $iwC$$iwC$$iwC.(:61)
at $iwC$$iwC.(:63)
at $iwC.(:65)
at (:67)
at .(:71)
at .()
at .(:7)
at .()
at $print()
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
at
org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
at org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
at
org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
at
scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
at org.apache.spark.repl.SparkILoop.org
$apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
at org.apache.spark.repl.Main$.main(Main.scala:31)
at org.apache.spark.repl.Main.main(Main.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at
org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)


On Thu, Jun 2, 2016 at 3:45 PM, Nick Pentreath 
wrote:

> Hey there
>
> When I used es-hadoop, I just pulled in the dependency into my pom.xml,
> with spark as a "provided" dependency, and built a fat jar with assembly.
>
> Then with spark-submit use the --jars option to include your assembly jar
> (IIRC I sometimes also needed to use --driver-classpath too, but perhaps
> not with recent Spark versions).
>
>
>
> On Thu, 2 Jun 2016 at 15:34 Kevin Burton  wrote:
>
>> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's
>> not super easy.
>>
>> I wish there was an easier way to get this stuff to work.. Last time I
>> tried to use spark more I was having similar problems with classpath setup
>> and Cassandra.
>>
>> Seems a huge opportunity to make this easier for new developers.  This
>> stuff isn't rocket science but it can (needlessly) waste a ton of time.
>>
>> ... anyway... I'm have since figured out I have to specific *specific*
>> jars from the elasticsearch-hadoop distribution and use those.
>>
>> Right now I'm using :
>>
>>
>> 

Re: Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Nick Pentreath
Hey there

When I used es-hadoop, I just pulled in the dependency into my pom.xml,
with spark as a "provided" dependency, and built a fat jar with assembly.

Then with spark-submit use the --jars option to include your assembly jar
(IIRC I sometimes also needed to use --driver-classpath too, but perhaps
not with recent Spark versions).



On Thu, 2 Jun 2016 at 15:34 Kevin Burton  wrote:

> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's
> not super easy.
>
> I wish there was an easier way to get this stuff to work.. Last time I
> tried to use spark more I was having similar problems with classpath setup
> and Cassandra.
>
> Seems a huge opportunity to make this easier for new developers.  This
> stuff isn't rocket science but it can (needlessly) waste a ton of time.
>
> ... anyway... I'm have since figured out I have to specific *specific*
> jars from the elasticsearch-hadoop distribution and use those.
>
> Right now I'm using :
>
>
> SPARK_CLASSPATH=/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar:/usr/share/apache-spark/lib/*
>
> ... but I"m getting:
>
> java.lang.NoClassDefFoundError: Could not initialize class
> org.elasticsearch.hadoop.util.Version
> at
> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
> at
> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
> at
> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>
> ... but I think its caused by this:
>
> 16/06/03 00:26:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
> localhost): java.lang.Error: Multiple ES-Hadoop versions detected in the
> classpath; please use only one
> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar
>
> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar
>
> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar
>
> at org.elasticsearch.hadoop.util.Version.(Version.java:73)
> at
> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
> at
> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
> at
> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
> at org.apache.spark.scheduler.Task.run(Task.scala:89)
> at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
> at java.lang.Thread.run(Thread.java:745)
>
> .. still tracking this down but was wondering if there is someting obvious
> I'm dong wrong.  I'm going to take out elasticsearch-hadoop-2.3.2.jar and
> try again.
>
> Lots of trial and error here :-/
>
> Kevin
>
> --
>
> We’re hiring if you know of any awesome Java Devops or Linux Operations
> Engineers!
>
> Founder/CEO Spinn3r.com
> Location: *San Francisco, CA*
> blog: http://burtonator.wordpress.com
> … or check out my Google+ profile
> 
>
>


Classpath hell and Elasticsearch 2.3.2...

2016-06-02 Thread Kevin Burton
I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say it's
not super easy.

I wish there was an easier way to get this stuff to work.. Last time I
tried to use spark more I was having similar problems with classpath setup
and Cassandra.

Seems a huge opportunity to make this easier for new developers.  This
stuff isn't rocket science but it can (needlessly) waste a ton of time.

... anyway... I'm have since figured out I have to specific *specific* jars
from the elasticsearch-hadoop distribution and use those.

Right now I'm using :

SPARK_CLASSPATH=/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar:/usr/share/apache-spark/lib/*

... but I"m getting:

java.lang.NoClassDefFoundError: Could not initialize class
org.elasticsearch.hadoop.util.Version
at
org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
at
org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
at
org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)

... but I think its caused by this:

16/06/03 00:26:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0,
localhost): java.lang.Error: Multiple ES-Hadoop versions detected in the
classpath; please use only one
jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar
jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar
jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar

at org.elasticsearch.hadoop.util.Version.(Version.java:73)
at
org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
at
org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
at
org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

.. still tracking this down but was wondering if there is someting obvious
I'm dong wrong.  I'm going to take out elasticsearch-hadoop-2.3.2.jar and
try again.

Lots of trial and error here :-/

Kevin

-- 

We’re hiring if you know of any awesome Java Devops or Linux Operations
Engineers!

Founder/CEO Spinn3r.com
Location: *San Francisco, CA*
blog: http://burtonator.wordpress.com
… or check out my Google+ profile



Re: how to increase threads per executor

2016-06-02 Thread Mich Talebzadeh
interesting. a vm with one core!

one simple test

can you try running with

--executor-cores=1

and see it works ok please



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 2 June 2016 at 23:15, Andres M Jimenez T  wrote:

> Mich, thanks for your time,
>
>
> i am launching spark-submit as follows:
>
>
> bin/spark-submit --class com.example.SparkStreamingImpl --master
> spark://dev1.dev:7077 --verbose --driver-memory 1g --executor-memory 1g
> --conf "spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.port=8090
> -Dcom.sun.management.jmxremote.rmi.port=8091
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false" --conf
> "spark.executor.extraJavaOptions=-Dcom.sun.management.jmxremote
> -Dcom.sun.management.jmxremote.port=8092
> -Dcom.sun.management.jmxremote.rmi.port=8093
> -Dcom.sun.management.jmxremote.authenticate=false
> -Dcom.sun.management.jmxremote.ssl=false" --conf
> "spark.scheduler.mode=FAIR" --conf /home/Processing.jar
>
>
> When i use --executor-cores=12 i get "Initial job has not accepted any
> resources; check your cluster UI to ensure that workers are registered and
> have sufficient resources".
>
>
> This, because my nodes are single core, but i want to use more than one
> thread per core, is this possible?
>
>
> root@dev1:/home/spark-1.6.1-bin-hadoop2.6# lscpu
> Architecture:  x86_64
> CPU op-mode(s):32-bit, 64-bit
> Byte Order:Little Endian
> CPU(s):1
> On-line CPU(s) list:   0
> Thread(s) per core:1
> Core(s) per socket:1
> Socket(s): 1
> NUMA node(s):  1
> Vendor ID: GenuineIntel
> CPU family:6
> Model: 58
> Model name:Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
> Stepping:  0
> CPU MHz:   2999.999
> BogoMIPS:  5999.99
> Hypervisor vendor: VMware
> Virtualization type:   full
> L1d cache: 32K
> L1i cache: 32K
> L2 cache:  256K
> L3 cache:  25600K
> NUMA node0 CPU(s): 0
>
>
> Thanks
>
>
>
> --
> *From:* Mich Talebzadeh 
> *Sent:* Thursday, June 2, 2016 5:00 PM
> *To:* Andres M Jimenez T
> *Cc:* user@spark.apache.org
> *Subject:* Re: how to increase threads per executor
>
> What are passing as parameters to Spark-submit?
>
>
> ${SPARK_HOME}/bin/spark-submit \
> --executor-cores=12 \
>
> Also check
>
> http://spark.apache.org/docs/latest/configuration.html
> Configuration - Spark 1.6.1 Documentation
> 
> spark.apache.org
> Spark Configuration. Spark Properties. Dynamically Loading Spark
> Properties; Viewing Spark Properties; Available Properties. Application
> Properties; Runtime Environment
>
>
> Execution Behavior/spark.executor.cores
>
>
> HTH
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>
> On 2 June 2016 at 17:29, Andres M Jimenez T  wrote:
>
>> Hi,
>>
>>
>> I am working with Spark 1.6.1, using kafka direct connect for streaming
>> data.
>>
>> Using spark scheduler and 3 slaves.
>>
>> Kafka topic is partitioned with a value of 10.
>>
>>
>> The problem i have is, there is only one thread per executor running my
>> function (logic implementation).
>>
>>
>> Can anybody tell me how can i increase threads per executor to get better
>> use of CPUs?
>>
>>
>> Thanks
>>
>>
>> Here is the code i have implemented:
>>
>>
>> *Driver*:
>>
>>
>> JavaStreamingContext ssc = new JavaStreamingContext(conf, new
>> Duration(1));
>>
>> //prepare streaming from kafka
>>
>> Set topicsSet = new
>> HashSet<>(Arrays.asList("stage1-in,stage1-retry".split(",")));
>>
>> Map kafkaParams = new HashMap<>();
>>
>> kafkaParams.put("metadata.broker.list", kafkaBrokers);
>>
>> kafkaParams.put("group.id", SparkStreamingImpl.class.getName());
>>
>>
>> JavaPairInputDStream inputMessages =
>> KafkaUtils.createDirectStream(
>>
>> ssc,
>>
>> String.class,
>>
>> String.class,
>>
>> StringDecoder.class,
>>
>> StringDecoder.class,
>>
>> kafkaParams,
>>
>> topicsSet
>>
>> );
>>
>>
>> inputMessages.foreachRDD(new ForeachRDDFunction());
>>
>>
>> *ForeachFunction*:
>>
>>
>> class ForeachFunction implements VoidFunction> {
>>
>> private static final Counter foreachConcurrent =
>> ProcessingMetrics.metrics.counter( "foreach-concurrency" );
>>
>> public ForeachFunction() {
>>
>> LOG.info("Creating a new 

Re: how to increase threads per executor

2016-06-02 Thread Andres M Jimenez T
Mich, thanks for your time,


i am launching spark-submit as follows:


bin/spark-submit --class com.example.SparkStreamingImpl --master 
spark://dev1.dev:7077 --verbose --driver-memory 1g --executor-memory 1g --conf 
"spark.driver.extraJavaOptions=-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port=8090 
-Dcom.sun.management.jmxremote.rmi.port=8091 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false" --conf 
"spark.executor.extraJavaOptions=-Dcom.sun.management.jmxremote 
-Dcom.sun.management.jmxremote.port=8092 
-Dcom.sun.management.jmxremote.rmi.port=8093 
-Dcom.sun.management.jmxremote.authenticate=false 
-Dcom.sun.management.jmxremote.ssl=false" --conf "spark.scheduler.mode=FAIR" 
--conf /home/Processing.jar


When i use --executor-cores=12 i get "Initial job has not accepted any 
resources; check your cluster UI to ensure that workers are registered and have 
sufficient resources".


This, because my nodes are single core, but i want to use more than one thread 
per core, is this possible?


root@dev1:/home/spark-1.6.1-bin-hadoop2.6# lscpu
Architecture:  x86_64
CPU op-mode(s):32-bit, 64-bit
Byte Order:Little Endian
CPU(s):1
On-line CPU(s) list:   0
Thread(s) per core:1
Core(s) per socket:1
Socket(s): 1
NUMA node(s):  1
Vendor ID: GenuineIntel
CPU family:6
Model: 58
Model name:Intel(R) Xeon(R) CPU E5-2690 v2 @ 3.00GHz
Stepping:  0
CPU MHz:   2999.999
BogoMIPS:  5999.99
Hypervisor vendor: VMware
Virtualization type:   full
L1d cache: 32K
L1i cache: 32K
L2 cache:  256K
L3 cache:  25600K
NUMA node0 CPU(s): 0



Thanks




From: Mich Talebzadeh 
Sent: Thursday, June 2, 2016 5:00 PM
To: Andres M Jimenez T
Cc: user@spark.apache.org
Subject: Re: how to increase threads per executor

What are passing as parameters to Spark-submit?


${SPARK_HOME}/bin/spark-submit \
--executor-cores=12 \

Also check

http://spark.apache.org/docs/latest/configuration.html
Configuration - Spark 1.6.1 
Documentation
spark.apache.org
Spark Configuration. Spark Properties. Dynamically Loading Spark Properties; 
Viewing Spark Properties; Available Properties. Application Properties; Runtime 
Environment



Execution Behavior/spark.executor.cores


HTH



Dr Mich Talebzadeh



LinkedIn  
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw



http://talebzadehmich.wordpress.com



On 2 June 2016 at 17:29, Andres M Jimenez T 
> wrote:

Hi,


I am working with Spark 1.6.1, using kafka direct connect for streaming data.

Using spark scheduler and 3 slaves.

Kafka topic is partitioned with a value of 10.


The problem i have is, there is only one thread per executor running my 
function (logic implementation).


Can anybody tell me how can i increase threads per executor to get better use 
of CPUs?


Thanks


Here is the code i have implemented:


Driver:


JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1));

//prepare streaming from kafka

Set topicsSet = new 
HashSet<>(Arrays.asList("stage1-in,stage1-retry".split(",")));

Map kafkaParams = new HashMap<>();

kafkaParams.put("metadata.broker.list", kafkaBrokers);

kafkaParams.put("group.id", 
SparkStreamingImpl.class.getName());


JavaPairInputDStream inputMessages = 
KafkaUtils.createDirectStream(

ssc,

String.class,

String.class,

StringDecoder.class,

StringDecoder.class,

kafkaParams,

topicsSet

);


inputMessages.foreachRDD(new ForeachRDDFunction());


ForeachFunction:


class ForeachFunction implements VoidFunction> {

private static final Counter foreachConcurrent = 
ProcessingMetrics.metrics.counter( "foreach-concurrency" );

public ForeachFunction() {

LOG.info("Creating a new ForeachFunction");

}


public void call(Tuple2 t) throws Exception {

foreachConcurrent.inc();

LOG.info("processing message [" + t._1() + "]");

try {

Thread.sleep(1000);

} catch (Exception e) { }

foreachConcurrent.dec();

}

}


ForeachRDDFunction:


class ForeachRDDFunction implements VoidFunction> {

private static final Counter foreachRDDConcurrent = 
ProcessingMetrics.metrics.counter( "foreachRDD-concurrency" );

private ForeachFunction foreachFunction = new ForeachFunction();

public ForeachRDDFunction() {

LOG.info("Creating a new ForeachRDDFunction");

}


public void call(JavaPairRDD t) throws Exception {

foreachRDDConcurrent.inc();

LOG.info("call from inputMessages.foreachRDD with [" + t.partitions().size() + 
"] 

Re: MLLIB, Random Forest and user defined loss function?

2016-06-02 Thread Yan Burdett
I have a similar question about distance function for KMeans. I believe
only Euclidean distance is currently supported.


On Thursday, June 2, 2016, xweb  wrote:

> Does MLLIB allow user to specify own loss functions?
> Specially need it for Random forests.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-Random-Forest-and-user-defined-loss-function-tp27080.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org 
> For additional commands, e-mail: user-h...@spark.apache.org 
>
>


MLLIB, Random Forest and user defined loss function?

2016-06-02 Thread xweb
Does MLLIB allow user to specify own loss functions? 
Specially need it for Random forests.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/MLLIB-Random-Forest-and-user-defined-loss-function-tp27080.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Stream reading from database using spark streaming

2016-06-02 Thread Mich Talebzadeh
ok that is fine. so the source is an IMDB something like Oracle TimesTen
that I have worked with before.
The second source is some organised data (I assume you mean structured
tabular data


   1. Data is read from source one, the IMDB. The assumption is that within
   the batch interval that data is not going to change
   2. Data is read from source 2 which you will confirm what it is and
   again that data is not going to change within the batch interval
   3. You then want to register each RDD as temp table and do SQL on temp
   tables?

An example something like below

val sparkConf = new SparkConf().
 setAppName("CEP_streaming_with_JDBC").
 set("spark.driver.allowMultipleContexts", "true").
 set("spark.hadoop.validateOutputSpecs", "false")

  val sc = new SparkContext(sparkConf)
  // Create sqlContext based on HiveContext
  val sqlContext = new HiveContext(sc)
  import sqlContext.implicits._
  val HiveContext = new org.apache.spark.sql.hive.HiveContext(sc)
  HiveContext.sql("use oraclehadoop")
  var _ORACLEserver : String = "jdbc:oracle:thin:@rhes564:1521:mydb12"
  var _username : String = "xyz"
  var _password : String = "xyz123"

  // Get data from Oracle table
  val d = HiveContext.load("jdbc",
  Map("url" -> _ORACLEserver,
  "dbtable" -> "(SELECT amount_sold, time_id, TO_CHAR(channel_id) AS
channel_id FROM scratchpad.sales)",
  "user" -> _username,
  "password" -> _password))

*d.registerTempTable("tmp")*
val ssc = new StreamingContext(sparkConf, Seconds(2))
ssc.checkpoint("checkpoint")
val kafkaParams = Map[String, String]("bootstrap.servers" ->
"rhes564:9092", "schema.registry.url" -> "http://rhes564:8081;,
"zookeeper.connect" -> "rhes564:2181", "group.id" ->
"CEP_streaming_with_JDBC" )
val topics = Set("newtopic")
val dstream = KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, kafkaParams, topics)
dstream.cache()




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 2 June 2016 at 20:45, Zakaria Hili  wrote:

> Ted Yu
> That's not a spark streaming, its just a simple batch.
>
> Mich Talebzadeh
> in fact, I'm not working with mysql, but I'm working with an In-memory new
> SQL database, I said Mysql because if I found something compatible with
> mysql, it will work with NewSQL, which is very efficient for reading.
> why spark streaming with a database, the answer: I have another stream
> processing who  push data (organized data) into in memory database, and
> spark should read this data and use dataframe +Machine learning for
> prediction.
>
>
>
>
> 2016-06-02 19:12 GMT+02:00 Mich Talebzadeh :
>
>> I don't understand this.  How are you going to read from RDBMS database,
>> through JDBC?
>>
>> How often are you going to sample the transactional tables?
>>
>> You may find that a JDBC connection will take longer than your sliding
>> window length.
>>
>> Is this for real time analytics?
>>
>> Thanks
>>
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>> On 2 June 2016 at 18:08, Ted Yu  wrote:
>>
>>>
>>> http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/
>>>
>>>
>>> https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
>>>
>>> FYI
>>>
>>> On Thu, Jun 2, 2016 at 6:26 AM, Zakaria Hili  wrote:
>>>
 I want to use spark streaming to read data from RDBMS database like
 mysql.

 but I don't know how to do this using JavaStreamingContext

  JavaStreamingContext jssc = new JavaStreamingContext(conf, 
 Durations.milliseconds(500));DataFrame df = jssc. ??

 I search in the internet but I didn't find anything

 thank you in advance.
 ᐧ

>>>
>>>
>>
> ᐧ
>


Re: ImportError: No module named numpy

2016-06-02 Thread Eike von Seggern
Hi,

are you using Spark on one machine or many?

If on many, are you sure numpy is correctly installed on all machines?

To check that the environment is set-up correctly, you can try something
like

import os
pythonpaths = sc.range(10).map(lambda i:
os.environ.get("PYTHONPATH")).collect()
print(pythonpaths)

HTH

Eike

2016-06-02 15:32 GMT+02:00 Bhupendra Mishra :

> did not resolved. :(
>
> On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández 
> wrote:
>
>>
>> On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra <
>> bhupendra.mis...@gmail.com> wrote:
>>>
>>> and i have already exported environment variable in spark-env.sh as
>>> follows.. error still there  error: ImportError: No module named numpy
>>>
>>> export PYSPARK_PYTHON=/usr/bin/python
>>>
>>
>> According the documentation at
>> http://spark.apache.org/docs/latest/configuration.html#environment-variables
>> the PYSPARK_PYTHON environment variable is for poniting to the Python
>> interpreter binary.
>>
>> If you check the programming guide
>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html#installing-and-configuring-pyspark
>> it says you need to add your custom path to PYTHONPATH (the script
>> automatically adds the bin/pyspark there).
>>
>> So typically in Linux you would need to add the following (assuming you
>> installed numpy there):
>>
>> export PYTHONPATH=$PYTHONPATH:/usr/lib/python2.7/dist-packages
>>
>> Hope that helps.
>>
>>
>>
>>
>>> On Thu, Jun 2, 2016 at 12:04 AM, Julio Antonio Soto de Vicente <
>>> ju...@esbet.es> wrote:
>>>
 Try adding to spark-env.sh (renaming if you still have it with
 .template at the end):

 PYSPARK_PYTHON=/path/to/your/bin/python

 Where your bin/python is your actual Python environment with Numpy
 installed.


 El 1 jun 2016, a las 20:16, Bhupendra Mishra <
 bhupendra.mis...@gmail.com> escribió:

 I have numpy installed but where I should setup PYTHONPATH?


 On Wed, Jun 1, 2016 at 11:39 PM, Sergio Fernández 
 wrote:

> sudo pip install numpy
>
> On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra <
> bhupendra.mis...@gmail.com> wrote:
>
>> Thanks .
>> How can this be resolved?
>>
>> On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau 
>> wrote:
>>
>>> Generally this means numpy isn't installed on the system or your
>>> PYTHONPATH has somehow gotten pointed somewhere odd,
>>>
>>> On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra <
>>> bhupendra.mis...@gmail.com> wrote:
>>>
 If any one please can help me with following error.

  File
 "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py",
 line 25, in 

 ImportError: No module named numpy


 Thanks in advance!


>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co
>


>>>
>>
>>
>> --
>> Sergio Fernández
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e: sergio.fernan...@redlink.co
>> w: http://redlink.co
>>
>
>


-- 

*Jan Eike von Seggern*
Data Scientist

*Sevenval Technologies GmbH *

FRONT-END-EXPERTS SINCE 1999

Köpenicker Straße 154 | 10997 Berlin

office   +49 30 707 190 - 229
mail eike.segg...@sevenval.com

www.sevenval.com

Sitz: Köln, HRB 79823
Geschäftsführung: Jan Webering (CEO), Thorsten May, Sascha Langfus,
Joern-Carlos Kuntze

*Wir erhöhen den Return On Investment bei Ihren Mobile und Web-Projekten.
Sprechen Sie uns an:*http://roi.sevenval.com/
---
FOLLOW US on

[image: Sevenval blog]


[image: sevenval on twitter]

 [image: sevenval on linkedin]
[image:
sevenval on pinterest]



Re: Seeking advice on realtime querying over JDBC

2016-06-02 Thread Mich Talebzadeh
what is the source of your data? is that an RDMS database plus the topics
streamed via Kafka from other sources?




Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 2 June 2016 at 18:58, Cody Koeninger  wrote:

> Why are you wanting to expose spark over jdbc as opposed to just
> inserting the records from kafka into a jdbc compatible data store?
>
> On Thu, Jun 2, 2016 at 12:47 PM, Sunita Arvind 
> wrote:
> > Hi Experts,
> >
> > We are trying to get a kafka stream ingested in Spark and expose the
> > registered table over JDBC for querying. Here are some questions:
> > 1. Spark Streaming supports single context per application right? If I
> have
> > multiple customers and would like to create a kafka topic for each of
> them
> > and 1 streaming context for every topic is this doable? As per the
> current
> > spark documentation,
> >
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#initializing-streamingcontext
> > I can have only 1 active streaming context at a time. Is there no way
> around
> > that? The use case here is, if I am looking at a 5 min window, the window
> > should have records for that customer only, which is possible only by
> having
> > customer specific streaming context.
> >
> > 2. If I am able to create multiple contexts in this fashion, can I
> register
> > them as temp tables in my application and expose them over JDBC. Going by
> >
> https://forums.databricks.com/questions/1464/how-to-configure-thrift-server-to-use-a-custom-spa.html
> ,
> > looks like I can connect the thrift server to a single sparkSQL Context.
> > Having multiple streaming contexts means I automatically have multiple
> SQL
> > contexts?
> >
> > 3. Can I use SQLContext or do I need to have HiveContext in order to see
> the
> > tables registered via Spark application through the JDBC?
> >
> > regards
> > Sunita
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Seeking advice on realtime querying over JDBC

2016-06-02 Thread Cody Koeninger
Why are you wanting to expose spark over jdbc as opposed to just
inserting the records from kafka into a jdbc compatible data store?

On Thu, Jun 2, 2016 at 12:47 PM, Sunita Arvind  wrote:
> Hi Experts,
>
> We are trying to get a kafka stream ingested in Spark and expose the
> registered table over JDBC for querying. Here are some questions:
> 1. Spark Streaming supports single context per application right? If I have
> multiple customers and would like to create a kafka topic for each of them
> and 1 streaming context for every topic is this doable? As per the current
> spark documentation,
> http://spark.apache.org/docs/latest/streaming-programming-guide.html#initializing-streamingcontext
> I can have only 1 active streaming context at a time. Is there no way around
> that? The use case here is, if I am looking at a 5 min window, the window
> should have records for that customer only, which is possible only by having
> customer specific streaming context.
>
> 2. If I am able to create multiple contexts in this fashion, can I register
> them as temp tables in my application and expose them over JDBC. Going by
> https://forums.databricks.com/questions/1464/how-to-configure-thrift-server-to-use-a-custom-spa.html,
> looks like I can connect the thrift server to a single sparkSQL Context.
> Having multiple streaming contexts means I automatically have multiple SQL
> contexts?
>
> 3. Can I use SQLContext or do I need to have HiveContext in order to see the
> tables registered via Spark application through the JDBC?
>
> regards
> Sunita

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Seeking advice on realtime querying over JDBC

2016-06-02 Thread Sunita Arvind
Hi Experts,

We are trying to get a kafka stream ingested in Spark and expose the
registered table over JDBC for querying. Here are some questions:
1. Spark Streaming supports single context per application right? If I have
multiple customers and would like to create a kafka topic for each of them
and 1 streaming context for every topic is this doable? As per the current
spark documentation,
http://spark.apache.org/docs/latest/streaming-programming-guide.html#initializing-streamingcontext
I can have only 1 active streaming context at a time. Is there no way
around that? The use case here is, if I am looking at a 5 min window, the
window should have records for that customer only, which is possible only
by having customer specific streaming context.

2. If I am able to create multiple contexts in this fashion, can I register
them as temp tables in my application and expose them over JDBC. Going by
https://forums.databricks.com/questions/1464/how-to-configure-thrift-server-to-use-a-custom-spa.html,
looks like I can connect the thrift server to a single sparkSQL Context.
Having multiple streaming contexts means I automatically have multiple SQL
contexts?

3. Can I use SQLContext or do I need to have HiveContext in order to see
the tables registered via Spark application through the JDBC?

regards
Sunita


Re: Stream reading from database using spark streaming

2016-06-02 Thread Mich Talebzadeh
I don't understand this.  How are you going to read from RDBMS database,
through JDBC?

How often are you going to sample the transactional tables?

You may find that a JDBC connection will take longer than your sliding
window length.

Is this for real time analytics?

Thanks



Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 2 June 2016 at 18:08, Ted Yu  wrote:

>
> http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/
>
>
> https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD
>
> FYI
>
> On Thu, Jun 2, 2016 at 6:26 AM, Zakaria Hili  wrote:
>
>> I want to use spark streaming to read data from RDBMS database like mysql.
>>
>> but I don't know how to do this using JavaStreamingContext
>>
>>  JavaStreamingContext jssc = new JavaStreamingContext(conf, 
>> Durations.milliseconds(500));DataFrame df = jssc. ??
>>
>> I search in the internet but I didn't find anything
>>
>> thank you in advance.
>> ᐧ
>>
>
>


Re: Stream reading from database using spark streaming

2016-06-02 Thread Ted Yu
http://www.sparkexpert.com/2015/03/28/loading-database-data-into-spark-using-data-sources-api/

https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.rdd.JdbcRDD

FYI

On Thu, Jun 2, 2016 at 6:26 AM, Zakaria Hili  wrote:

> I want to use spark streaming to read data from RDBMS database like mysql.
>
> but I don't know how to do this using JavaStreamingContext
>
>  JavaStreamingContext jssc = new JavaStreamingContext(conf, 
> Durations.milliseconds(500));DataFrame df = jssc. ??
>
> I search in the internet but I didn't find anything
>
> thank you in advance.
> ᐧ
>


Re: how to increase threads per executor

2016-06-02 Thread Mich Talebzadeh
What are passing as parameters to Spark-submit?


${SPARK_HOME}/bin/spark-submit \
--executor-cores=12 \

Also check

http://spark.apache.org/docs/latest/configuration.html

Execution Behavior/spark.executor.cores


HTH


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 2 June 2016 at 17:29, Andres M Jimenez T  wrote:

> Hi,
>
>
> I am working with Spark 1.6.1, using kafka direct connect for streaming
> data.
>
> Using spark scheduler and 3 slaves.
>
> Kafka topic is partitioned with a value of 10.
>
>
> The problem i have is, there is only one thread per executor running my
> function (logic implementation).
>
>
> Can anybody tell me how can i increase threads per executor to get better
> use of CPUs?
>
>
> Thanks
>
>
> Here is the code i have implemented:
>
>
> *Driver*:
>
>
> JavaStreamingContext ssc = new JavaStreamingContext(conf, new
> Duration(1));
>
> //prepare streaming from kafka
>
> Set topicsSet = new
> HashSet<>(Arrays.asList("stage1-in,stage1-retry".split(",")));
>
> Map kafkaParams = new HashMap<>();
>
> kafkaParams.put("metadata.broker.list", kafkaBrokers);
>
> kafkaParams.put("group.id", SparkStreamingImpl.class.getName());
>
>
> JavaPairInputDStream inputMessages =
> KafkaUtils.createDirectStream(
>
> ssc,
>
> String.class,
>
> String.class,
>
> StringDecoder.class,
>
> StringDecoder.class,
>
> kafkaParams,
>
> topicsSet
>
> );
>
>
> inputMessages.foreachRDD(new ForeachRDDFunction());
>
>
> *ForeachFunction*:
>
>
> class ForeachFunction implements VoidFunction> {
>
> private static final Counter foreachConcurrent =
> ProcessingMetrics.metrics.counter( "foreach-concurrency" );
>
> public ForeachFunction() {
>
> LOG.info("Creating a new ForeachFunction");
>
> }
>
>
> public void call(Tuple2 t) throws Exception {
>
> foreachConcurrent.inc();
>
> LOG.info("processing message [" + t._1() + "]");
>
> try {
>
> Thread.sleep(1000);
>
> } catch (Exception e) { }
>
> foreachConcurrent.dec();
>
> }
>
> }
>
>
> *ForeachRDDFunction*:
>
>
> class ForeachRDDFunction implements VoidFunction String>> {
>
> private static final Counter foreachRDDConcurrent =
> ProcessingMetrics.metrics.counter( "foreachRDD-concurrency" );
>
> private ForeachFunction foreachFunction = new ForeachFunction();
>
> public ForeachRDDFunction() {
>
> LOG.info("Creating a new ForeachRDDFunction");
>
> }
>
>
> public void call(JavaPairRDD t) throws Exception {
>
> foreachRDDConcurrent.inc();
>
> LOG.info("call from inputMessages.foreachRDD with [" +
> t.partitions().size() + "] partitions");
>
> for (Partition p : t.partitions()) {
>
> if (p instanceof KafkaRDDPartition){
>
> LOG.info("partition [" + p.index() + "] with count [" +
> ((KafkaRDDPartition) p).count() + "]");
>
> }
>
> }
>
> t.foreachAsync(foreachFunction);
>
> foreachRDDConcurrent.dec();
>
> }
>
> }
>
>
> *The log from driver that tells me my RDD is partitioned to process in
> parallel*:
>
>
> [Stage 70:>  (3 + 3) / 20][Stage 71:>  (0 + 0) / 20][Stage 72:>  (0 + 0) /
> 20]16/06/02 08:32:10 INFO SparkStreamingImpl: call from
> inputMessages.foreachRDD with [20] partitions
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [0] with count [24]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [1] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [2] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [3] with count [19]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [4] with count [19]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [5] with count [20]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [6] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [7] with count [23]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [8] with count [21]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [9] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [10] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [11] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [12] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [13] with count [26]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [14] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [15] with count [27]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [16] with count [0]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [17] with count [16]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [18] with count [15]
>
> 16/06/02 08:32:10 INFO SparkStreamingImpl: partition [19] with count [0]
>
>
> *The log from one of executors showing 

how to increase threads per executor

2016-06-02 Thread Andres M Jimenez T
Hi,


I am working with Spark 1.6.1, using kafka direct connect for streaming data.

Using spark scheduler and 3 slaves.

Kafka topic is partitioned with a value of 10.


The problem i have is, there is only one thread per executor running my 
function (logic implementation).


Can anybody tell me how can i increase threads per executor to get better use 
of CPUs?


Thanks


Here is the code i have implemented:


Driver:


JavaStreamingContext ssc = new JavaStreamingContext(conf, new Duration(1));

//prepare streaming from kafka

Set topicsSet = new 
HashSet<>(Arrays.asList("stage1-in,stage1-retry".split(",")));

Map kafkaParams = new HashMap<>();

kafkaParams.put("metadata.broker.list", kafkaBrokers);

kafkaParams.put("group.id", SparkStreamingImpl.class.getName());


JavaPairInputDStream inputMessages = 
KafkaUtils.createDirectStream(

ssc,

String.class,

String.class,

StringDecoder.class,

StringDecoder.class,

kafkaParams,

topicsSet

);


inputMessages.foreachRDD(new ForeachRDDFunction());


ForeachFunction:


class ForeachFunction implements VoidFunction> {

private static final Counter foreachConcurrent = 
ProcessingMetrics.metrics.counter( "foreach-concurrency" );

public ForeachFunction() {

LOG.info("Creating a new ForeachFunction");

}


public void call(Tuple2 t) throws Exception {

foreachConcurrent.inc();

LOG.info("processing message [" + t._1() + "]");

try {

Thread.sleep(1000);

} catch (Exception e) { }

foreachConcurrent.dec();

}

}


ForeachRDDFunction:


class ForeachRDDFunction implements VoidFunction> {

private static final Counter foreachRDDConcurrent = 
ProcessingMetrics.metrics.counter( "foreachRDD-concurrency" );

private ForeachFunction foreachFunction = new ForeachFunction();

public ForeachRDDFunction() {

LOG.info("Creating a new ForeachRDDFunction");

}


public void call(JavaPairRDD t) throws Exception {

foreachRDDConcurrent.inc();

LOG.info("call from inputMessages.foreachRDD with [" + t.partitions().size() + 
"] partitions");

for (Partition p : t.partitions()) {

if (p instanceof KafkaRDDPartition){

LOG.info("partition [" + p.index() + "] with count [" + ((KafkaRDDPartition) 
p).count() + "]");

}

}

t.foreachAsync(foreachFunction);

foreachRDDConcurrent.dec();

}

}


The log from driver that tells me my RDD is partitioned to process in parallel:


[Stage 70:>  (3 + 3) / 20][Stage 71:>  (0 + 0) / 20][Stage 72:>  (0 + 0) / 
20]16/06/02 08:32:10 INFO SparkStreamingImpl: call from 
inputMessages.foreachRDD with [20] partitions

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [0] with count [24]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [1] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [2] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [3] with count [19]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [4] with count [19]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [5] with count [20]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [6] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [7] with count [23]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [8] with count [21]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [9] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [10] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [11] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [12] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [13] with count [26]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [14] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [15] with count [27]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [16] with count [0]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [17] with count [16]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [18] with count [15]

16/06/02 08:32:10 INFO SparkStreamingImpl: partition [19] with count [0]


The log from one of executors showing exactly one message per second was 
processed (only by one thread):


16/06/02 08:32:46 INFO SparkStreamingImpl: processing message 
[f2b22bb9-3bd8-4e5b-b9fb-afa7e8c4deb8]

16/06/02 08:32:47 INFO SparkStreamingImpl: processing message 
[e267cde2-ffea-4f7a-9934-f32a3b7218cc]

16/06/02 08:32:48 INFO SparkStreamingImpl: processing message 
[f055fe3c-0f72-4f41-9a31-df544f1e1cd3]

16/06/02 08:32:49 INFO SparkStreamingImpl: processing message 
[854faaa5-0abe-49a2-b13a-c290a3720b0e]

16/06/02 08:32:50 INFO SparkStreamingImpl: processing message 
[1bc0a141-b910-45fe-9881-e2066928fbc6]

16/06/02 08:32:51 INFO SparkStreamingImpl: processing message 
[67fb99c6-1ca1-4dfb-bffe-43b927fdec07]

16/06/02 08:32:52 INFO SparkStreamingImpl: processing message 
[de7d5934-bab2-4019-917e-c339d864ba18]

16/06/02 08:32:53 INFO SparkStreamingImpl: processing 

Partitioning Data to optimize combineByKey

2016-06-02 Thread Nathan Case
Hello,

I am trying to process a dataset that is approximately 2 tb using a cluster
with 4.5 tb of ram.  The data is in parquet format and is initially loaded
into a dataframe.  A subset of the data is then queried for and converted
to RDD for more complicated processing.  The first stage of that processing
is to mapToPair to use each rows id as the key in a tuple.  Then the data
goes through a combineByKey operation to group all values with the same
key.  This operation always exceeds the maximum cluster memory and the job
eventually fails.  While it is shuffling there is a lot of "spilling
in-memory map to disk" messages.  I am wondering if I were to have the data
initially partitioned such that all the rows with the same id resided
within the same partition if it would need to do left shuffling and perform
correctly.

To do the initial load I am using:

sqlContext.read().parquet(inputPathArray).repartition(1, new
Column("id"));

I am not sure if this is the correct way to partition a dataframe so that
is my first question is the above correct.

My next question is that when I go from the dataframe to rdd using:

JavaRDD locationsForSpecificKey = sqlc.sql("SELECT * FROM
standardlocationrecords WHERE customerID = " + customerID + " AND
partnerAppID = " + partnerAppID)
.toJavaRDD().map(new LocationRecordFromRow()::apply);

is the partition scheme from the dataframe preserved or do I need to
repartition after doing a mapToPair using:

rdd.partitionBy and passing in a custom HashPartitioner that uses the hash
of the ID field.

My goal is to reduce the shuffling when doing the final combineByKey to
prevent the job from running out of memory and failing.  Any help would be
greatly appreciated.

Thanks,
Nathan


How to generate seeded random numbers in GraphX Pregel API vertex procedure?

2016-06-02 Thread Roman Pastukhov
As far as I understand, best way to generate seeded random numbers in Spark
is to use mapPartititons with a seeded Random instance for each partition.
But graph.pregel in GraphX does not have anything similar to mapPartitions.

Can something like this be done in GraphX Pregel API?


Re: ImportError: No module named numpy

2016-06-02 Thread nguyen duc tuan
​​
You should set both PYSPARK_DRIVER_PYTHON and PYSPARK_PYTHON the path to
your python interpreter.

2016-06-02 20:32 GMT+07:00 Bhupendra Mishra :

> did not resolved. :(
>
> On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández 
> wrote:
>
>>
>> On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra <
>> bhupendra.mis...@gmail.com> wrote:
>>>
>>> and i have already exported environment variable in spark-env.sh as
>>> follows.. error still there  error: ImportError: No module named numpy
>>>
>>> export PYSPARK_PYTHON=/usr/bin/python
>>>
>>
>> According the documentation at
>> http://spark.apache.org/docs/latest/configuration.html#environment-variables
>> the PYSPARK_PYTHON environment variable is for poniting to the Python
>> interpreter binary.
>>
>> If you check the programming guide
>> https://spark.apache.org/docs/0.9.0/python-programming-guide.html#installing-and-configuring-pyspark
>> it says you need to add your custom path to PYTHONPATH (the script
>> automatically adds the bin/pyspark there).
>>
>> So typically in Linux you would need to add the following (assuming you
>> installed numpy there):
>>
>> export PYTHONPATH=$PYTHONPATH:/usr/lib/python2.7/dist-packages
>>
>> Hope that helps.
>>
>>
>>
>>
>>> On Thu, Jun 2, 2016 at 12:04 AM, Julio Antonio Soto de Vicente <
>>> ju...@esbet.es> wrote:
>>>
 Try adding to spark-env.sh (renaming if you still have it with
 .template at the end):

 PYSPARK_PYTHON=/path/to/your/bin/python

 Where your bin/python is your actual Python environment with Numpy
 installed.


 El 1 jun 2016, a las 20:16, Bhupendra Mishra <
 bhupendra.mis...@gmail.com> escribió:

 I have numpy installed but where I should setup PYTHONPATH?


 On Wed, Jun 1, 2016 at 11:39 PM, Sergio Fernández 
 wrote:

> sudo pip install numpy
>
> On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra <
> bhupendra.mis...@gmail.com> wrote:
>
>> Thanks .
>> How can this be resolved?
>>
>> On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau 
>> wrote:
>>
>>> Generally this means numpy isn't installed on the system or your
>>> PYTHONPATH has somehow gotten pointed somewhere odd,
>>>
>>> On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra <
>>> bhupendra.mis...@gmail.com> wrote:
>>>
 If any one please can help me with following error.

  File
 "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py",
 line 25, in 

 ImportError: No module named numpy


 Thanks in advance!


>>>
>>>
>>> --
>>> Cell : 425-233-8271
>>> Twitter: https://twitter.com/holdenkarau
>>>
>>
>>
>
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co
>


>>>
>>
>>
>> --
>> Sergio Fernández
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e: sergio.fernan...@redlink.co
>> w: http://redlink.co
>>
>
>


Re: ImportError: No module named numpy

2016-06-02 Thread Bhupendra Mishra
did not resolved. :(

On Thu, Jun 2, 2016 at 3:01 PM, Sergio Fernández  wrote:

>
> On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra <
> bhupendra.mis...@gmail.com> wrote:
>>
>> and i have already exported environment variable in spark-env.sh as
>> follows.. error still there  error: ImportError: No module named numpy
>>
>> export PYSPARK_PYTHON=/usr/bin/python
>>
>
> According the documentation at
> http://spark.apache.org/docs/latest/configuration.html#environment-variables
> the PYSPARK_PYTHON environment variable is for poniting to the Python
> interpreter binary.
>
> If you check the programming guide
> https://spark.apache.org/docs/0.9.0/python-programming-guide.html#installing-and-configuring-pyspark
> it says you need to add your custom path to PYTHONPATH (the script
> automatically adds the bin/pyspark there).
>
> So typically in Linux you would need to add the following (assuming you
> installed numpy there):
>
> export PYTHONPATH=$PYTHONPATH:/usr/lib/python2.7/dist-packages
>
> Hope that helps.
>
>
>
>
>> On Thu, Jun 2, 2016 at 12:04 AM, Julio Antonio Soto de Vicente <
>> ju...@esbet.es> wrote:
>>
>>> Try adding to spark-env.sh (renaming if you still have it with .template
>>> at the end):
>>>
>>> PYSPARK_PYTHON=/path/to/your/bin/python
>>>
>>> Where your bin/python is your actual Python environment with Numpy
>>> installed.
>>>
>>>
>>> El 1 jun 2016, a las 20:16, Bhupendra Mishra 
>>> escribió:
>>>
>>> I have numpy installed but where I should setup PYTHONPATH?
>>>
>>>
>>> On Wed, Jun 1, 2016 at 11:39 PM, Sergio Fernández 
>>> wrote:
>>>
 sudo pip install numpy

 On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra <
 bhupendra.mis...@gmail.com> wrote:

> Thanks .
> How can this be resolved?
>
> On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau 
> wrote:
>
>> Generally this means numpy isn't installed on the system or your
>> PYTHONPATH has somehow gotten pointed somewhere odd,
>>
>> On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra <
>> bhupendra.mis...@gmail.com> wrote:
>>
>>> If any one please can help me with following error.
>>>
>>>  File
>>> "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py",
>>> line 25, in 
>>>
>>> ImportError: No module named numpy
>>>
>>>
>>> Thanks in advance!
>>>
>>>
>>
>>
>> --
>> Cell : 425-233-8271
>> Twitter: https://twitter.com/holdenkarau
>>
>
>


 --
 Sergio Fernández
 Partner Technology Manager
 Redlink GmbH
 m: +43 6602747925
 e: sergio.fernan...@redlink.co
 w: http://redlink.co

>>>
>>>
>>
>
>
> --
> Sergio Fernández
> Partner Technology Manager
> Redlink GmbH
> m: +43 6602747925
> e: sergio.fernan...@redlink.co
> w: http://redlink.co
>


Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Ted Yu
Not much information in the attachment.

There was TimeoutException w.r.t. BlockManagerMaster.removeRdd().

Any chance of more logs ?

Thanks

On Thu, Jun 2, 2016 at 2:07 AM, Vishnu Nair  wrote:

> Hi Ted
>
> We use Hadoop 2.6 & Spark 1.3.1. I also attached the error file to this
> mail, please have a look at it.
>
> Thanks
>
> On Thu, Jun 2, 2016 at 11:51 AM, Ted Yu  wrote:
>
>> Can you show the error in bit more detail ?
>>
>> Which release of hadoop / Spark are you using ?
>>
>> Is CapacityScheduler being used ?
>>
>> Thanks
>>
>> On Thu, Jun 2, 2016 at 1:32 AM, Prabeesh K.  wrote:
>>
>>> Hi I am using the below command to run a spark job and I get an error
>>> like "Container preempted by scheduler"
>>>
>>> I am not sure if it's related to the wrong usage of Memory:
>>>
>>> nohup ~/spark1.3/bin/spark-submit \ --num-executors 50 \ --master yarn \
>>> --deploy-mode cluster \ --queue adhoc \ --driver-memory 18G \
>>> --executor-memory 12G \ --class main.ru..bigdata.externalchurn.Main
>>> \ --conf "spark.task.maxFailures=100" \ --conf
>>> "spark.yarn.max.executor.failures=1" \ --conf "spark.executor.cores=1"
>>> \ --conf "spark.akka.frameSize=50" \ --conf
>>> "spark.storage.memoryFraction=0.5" \ --conf
>>> "spark.driver.maxResultSize=10G" \
>>> ~/external-flow/externalChurn-1.0-SNAPSHOT-shaded.jar \
>>> prepareTraining=true \ prepareTrainingMNP=true \ prepareMap=false \
>>> bouldozerMode=true \ &> ~/external-flow/run.log & echo "STARTED" tail -f
>>> ~/external-flow/run.log
>>>
>>> Thanks,
>>>
>>>
>>>
>>>
>>>
>>
>


Stream reading from database using spark streaming

2016-06-02 Thread Zakaria Hili
I want to use spark streaming to read data from RDBMS database like mysql.

but I don't know how to do this using JavaStreamingContext

 JavaStreamingContext jssc = new JavaStreamingContext(conf,
Durations.milliseconds(500));DataFrame df = jssc. ??

I search in the internet but I didn't find anything

thank you in advance.
ᐧ


Re: Ignore features in Random Forest

2016-06-02 Thread Neha Mehta
Thanks Yuhao.

Regards,
Neha

On Thu, Jun 2, 2016 at 11:51 AM, Yuhao Yang  wrote:

> Hi Neha,
>
> This looks like a feature engineering task. I think VectorSlicer can help
> with your case. Please refer to
> http://spark.apache.org/docs/latest/ml-features.html#vectorslicer .
>
> Regards,
> Yuhao
>
> 2016-06-01 21:18 GMT+08:00 Neha Mehta :
>
>> Hi,
>>
>> I am performing Regression using Random Forest. In my input vector, I
>> want the algorithm to ignore certain columns/features while training the
>> classifier and also while prediction. These are basically Id columns. I
>> checked the documentation and could not find any information on the same.
>>
>> Request help with the same.
>>
>> Thanks & Regards,
>> Neha
>>
>
>


Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Mich Talebzadeh
thanks for that.

I will have a look

Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com



On 2 June 2016 at 10:46, Elliot West  wrote:

> Related to this, there exists an API in Hive to simplify the integrations
> of other frameworks with Hive's ACID feature:
>
> See:
> https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API
>
> It contains code for maintaining heartbeats, handling locks and
> transactions, and submitting mutations in a distributed environment.
>
> We have used it to write to transactional tables from Cascading based
> processes.
>
> Elliot.
>
>
> On 2 June 2016 at 09:54, Mich Talebzadeh 
> wrote:
>
>>
>> Hi,
>>
>> Spark does not support transactions because as I understand there is a
>> piece in the execution side that needs to send heartbeats to Hive metastore
>> saying a transaction is still alive". That has not been implemented in
>> Spark yet to my knowledge."
>>
>> Any idea on the timelines when we are going to have support for
>> transactions in Spark for Hive ORC tables. This will really be useful.
>>
>>
>> Thanks,
>>
>>
>> Dr Mich Talebzadeh
>>
>>
>>
>> LinkedIn * 
>> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
>> *
>>
>>
>>
>> http://talebzadehmich.wordpress.com
>>
>>
>>
>
>


Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Jacek Laskowski
Hi,

Few things for closer examination:

* is yarn master URL accepted in 1.3? I thought it was only in later
releases. Since you're seeing the issue it seems it does work.

* I've never seen specifying confs using a single string. Can you check in
the Web ui they're applied?

* what about this  in class?

Can you publish the application logs from the driver and executors?

Jacek
On 2 Jun 2016 10:33 a.m., "Prabeesh K."  wrote:

> Hi I am using the below command to run a spark job and I get an error like
> "Container preempted by scheduler"
>
> I am not sure if it's related to the wrong usage of Memory:
>
> nohup ~/spark1.3/bin/spark-submit \ --num-executors 50 \ --master yarn \
> --deploy-mode cluster \ --queue adhoc \ --driver-memory 18G \
> --executor-memory 12G \ --class main.ru..bigdata.externalchurn.Main
> \ --conf "spark.task.maxFailures=100" \ --conf
> "spark.yarn.max.executor.failures=1" \ --conf "spark.executor.cores=1"
> \ --conf "spark.akka.frameSize=50" \ --conf
> "spark.storage.memoryFraction=0.5" \ --conf
> "spark.driver.maxResultSize=10G" \
> ~/external-flow/externalChurn-1.0-SNAPSHOT-shaded.jar \
> prepareTraining=true \ prepareTrainingMNP=true \ prepareMap=false \
> bouldozerMode=true \ &> ~/external-flow/run.log & echo "STARTED" tail -f
> ~/external-flow/run.log
>
> Thanks,
>
>
>
>
>


Re: Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Elliot West
Related to this, there exists an API in Hive to simplify the integrations
of other frameworks with Hive's ACID feature:

See:
https://cwiki.apache.org/confluence/display/Hive/HCatalog+Streaming+Mutation+API

It contains code for maintaining heartbeats, handling locks and
transactions, and submitting mutations in a distributed environment.

We have used it to write to transactional tables from Cascading based
processes.

Elliot.


On 2 June 2016 at 09:54, Mich Talebzadeh  wrote:

>
> Hi,
>
> Spark does not support transactions because as I understand there is a
> piece in the execution side that needs to send heartbeats to Hive metastore
> saying a transaction is still alive". That has not been implemented in
> Spark yet to my knowledge."
>
> Any idea on the timelines when we are going to have support for
> transactions in Spark for Hive ORC tables. This will really be useful.
>
>
> Thanks,
>
>
> Dr Mich Talebzadeh
>
>
>
> LinkedIn * 
> https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
> *
>
>
>
> http://talebzadehmich.wordpress.com
>
>
>


Re: ImportError: No module named numpy

2016-06-02 Thread Sergio Fernández
On Thu, Jun 2, 2016 at 9:59 AM, Bhupendra Mishra  wrote:
>
> and i have already exported environment variable in spark-env.sh as
> follows.. error still there  error: ImportError: No module named numpy
>
> export PYSPARK_PYTHON=/usr/bin/python
>

According the documentation at
http://spark.apache.org/docs/latest/configuration.html#environment-variables
the PYSPARK_PYTHON environment variable is for poniting to the Python
interpreter binary.

If you check the programming guide
https://spark.apache.org/docs/0.9.0/python-programming-guide.html#installing-and-configuring-pyspark
it says you need to add your custom path to PYTHONPATH (the script
automatically adds the bin/pyspark there).

So typically in Linux you would need to add the following (assuming you
installed numpy there):

export PYTHONPATH=$PYTHONPATH:/usr/lib/python2.7/dist-packages

Hope that helps.




> On Thu, Jun 2, 2016 at 12:04 AM, Julio Antonio Soto de Vicente <
> ju...@esbet.es> wrote:
>
>> Try adding to spark-env.sh (renaming if you still have it with .template
>> at the end):
>>
>> PYSPARK_PYTHON=/path/to/your/bin/python
>>
>> Where your bin/python is your actual Python environment with Numpy
>> installed.
>>
>>
>> El 1 jun 2016, a las 20:16, Bhupendra Mishra 
>> escribió:
>>
>> I have numpy installed but where I should setup PYTHONPATH?
>>
>>
>> On Wed, Jun 1, 2016 at 11:39 PM, Sergio Fernández 
>> wrote:
>>
>>> sudo pip install numpy
>>>
>>> On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra <
>>> bhupendra.mis...@gmail.com> wrote:
>>>
 Thanks .
 How can this be resolved?

 On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau 
 wrote:

> Generally this means numpy isn't installed on the system or your
> PYTHONPATH has somehow gotten pointed somewhere odd,
>
> On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra <
> bhupendra.mis...@gmail.com> wrote:
>
>> If any one please can help me with following error.
>>
>>  File
>> "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py",
>> line 25, in 
>>
>> ImportError: No module named numpy
>>
>>
>> Thanks in advance!
>>
>>
>
>
> --
> Cell : 425-233-8271
> Twitter: https://twitter.com/holdenkarau
>


>>>
>>>
>>> --
>>> Sergio Fernández
>>> Partner Technology Manager
>>> Redlink GmbH
>>> m: +43 6602747925
>>> e: sergio.fernan...@redlink.co
>>> w: http://redlink.co
>>>
>>
>>
>


-- 
Sergio Fernández
Partner Technology Manager
Redlink GmbH
m: +43 6602747925
e: sergio.fernan...@redlink.co
w: http://redlink.co


Fwd: Container preempted by scheduler - Spark job error

2016-06-02 Thread Prabeesh K.
Hi Ted

We use Hadoop 2.6 & Spark 1.3.1. I also attached the error file to this
mail, please have a look at it.

Thanks

On Thu, Jun 2, 2016 at 11:51 AM, Ted Yu  wrote:

> Can you show the error in bit more detail ?
>
> Which release of hadoop / Spark are you using ?
>
> Is CapacityScheduler being used ?
>
> Thanks
>
> On Thu, Jun 2, 2016 at 1:32 AM, Prabeesh K.  wrote:
>
>> Hi I am using the below command to run a spark job and I get an error
>> like "Container preempted by scheduler"
>>
>> I am not sure if it's related to the wrong usage of Memory:
>>
>> nohup ~/spark1.3/bin/spark-submit \ --num-executors 50 \ --master yarn \
>> --deploy-mode cluster \ --queue adhoc \ --driver-memory 18G \
>> --executor-memory 12G \ --class main.ru..bigdata.externalchurn.Main
>> \ --conf "spark.task.maxFailures=100" \ --conf
>> "spark.yarn.max.executor.failures=1" \ --conf "spark.executor.cores=1"
>> \ --conf "spark.akka.frameSize=50" \ --conf
>> "spark.storage.memoryFraction=0.5" \ --conf
>> "spark.driver.maxResultSize=10G" \
>> ~/external-flow/externalChurn-1.0-SNAPSHOT-shaded.jar \
>> prepareTraining=true \ prepareTrainingMNP=true \ prepareMap=false \
>> bouldozerMode=true \ &> ~/external-flow/run.log & echo "STARTED" tail -f
>> ~/external-flow/run.log
>>
>> Thanks,
>>
>>
>>
>>
>>
>


spark-error
Description: Binary data

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org

Spark support for update/delete operations on Hive ORC transactional tables

2016-06-02 Thread Mich Talebzadeh
Hi,

Spark does not support transactions because as I understand there is a
piece in the execution side that needs to send heartbeats to Hive metastore
saying a transaction is still alive". That has not been implemented in
Spark yet to my knowledge."

Any idea on the timelines when we are going to have support for
transactions in Spark for Hive ORC tables. This will really be useful.


Thanks,


Dr Mich Talebzadeh



LinkedIn * 
https://www.linkedin.com/profile/view?id=AAEWh2gBxianrbJd6zP6AcPCCdOABUrV8Pw
*



http://talebzadehmich.wordpress.com


Re: Container preempted by scheduler - Spark job error

2016-06-02 Thread Ted Yu
Can you show the error in bit more detail ?

Which release of hadoop / Spark are you using ?

Is CapacityScheduler being used ?

Thanks

On Thu, Jun 2, 2016 at 1:32 AM, Prabeesh K.  wrote:

> Hi I am using the below command to run a spark job and I get an error like
> "Container preempted by scheduler"
>
> I am not sure if it's related to the wrong usage of Memory:
>
> nohup ~/spark1.3/bin/spark-submit \ --num-executors 50 \ --master yarn \
> --deploy-mode cluster \ --queue adhoc \ --driver-memory 18G \
> --executor-memory 12G \ --class main.ru..bigdata.externalchurn.Main
> \ --conf "spark.task.maxFailures=100" \ --conf
> "spark.yarn.max.executor.failures=1" \ --conf "spark.executor.cores=1"
> \ --conf "spark.akka.frameSize=50" \ --conf
> "spark.storage.memoryFraction=0.5" \ --conf
> "spark.driver.maxResultSize=10G" \
> ~/external-flow/externalChurn-1.0-SNAPSHOT-shaded.jar \
> prepareTraining=true \ prepareTrainingMNP=true \ prepareMap=false \
> bouldozerMode=true \ &> ~/external-flow/run.log & echo "STARTED" tail -f
> ~/external-flow/run.log
>
> Thanks,
>
>
>
>
>


Container preempted by scheduler - Spark job error

2016-06-02 Thread Prabeesh K.
Hi I am using the below command to run a spark job and I get an error like
"Container preempted by scheduler"

I am not sure if it's related to the wrong usage of Memory:

nohup ~/spark1.3/bin/spark-submit \ --num-executors 50 \ --master yarn \
--deploy-mode cluster \ --queue adhoc \ --driver-memory 18G \
--executor-memory 12G \ --class main.ru..bigdata.externalchurn.Main
\ --conf "spark.task.maxFailures=100" \ --conf
"spark.yarn.max.executor.failures=1" \ --conf "spark.executor.cores=1"
\ --conf "spark.akka.frameSize=50" \ --conf
"spark.storage.memoryFraction=0.5" \ --conf
"spark.driver.maxResultSize=10G" \
~/external-flow/externalChurn-1.0-SNAPSHOT-shaded.jar \
prepareTraining=true \ prepareTrainingMNP=true \ prepareMap=false \
bouldozerMode=true \ &> ~/external-flow/run.log & echo "STARTED" tail -f
~/external-flow/run.log

Thanks,


Re: ImportError: No module named numpy

2016-06-02 Thread Bhupendra Mishra
its RHEL

and i have already exported environment variable in spark-env.sh as
follows.. error still there  error: ImportError: No module named numpy

export PYSPARK_PYTHON=/usr/bin/python

thanks

On Thu, Jun 2, 2016 at 12:04 AM, Julio Antonio Soto de Vicente <
ju...@esbet.es> wrote:

> Try adding to spark-env.sh (renaming if you still have it with .template
> at the end):
>
> PYSPARK_PYTHON=/path/to/your/bin/python
>
> Where your bin/python is your actual Python environment with Numpy
> installed.
>
>
> El 1 jun 2016, a las 20:16, Bhupendra Mishra 
> escribió:
>
> I have numpy installed but where I should setup PYTHONPATH?
>
>
> On Wed, Jun 1, 2016 at 11:39 PM, Sergio Fernández 
> wrote:
>
>> sudo pip install numpy
>>
>> On Wed, Jun 1, 2016 at 5:56 PM, Bhupendra Mishra <
>> bhupendra.mis...@gmail.com> wrote:
>>
>>> Thanks .
>>> How can this be resolved?
>>>
>>> On Wed, Jun 1, 2016 at 9:02 PM, Holden Karau 
>>> wrote:
>>>
 Generally this means numpy isn't installed on the system or your
 PYTHONPATH has somehow gotten pointed somewhere odd,

 On Wed, Jun 1, 2016 at 8:31 AM, Bhupendra Mishra <
 bhupendra.mis...@gmail.com> wrote:

> If any one please can help me with following error.
>
>  File
> "/opt/mapr/spark/spark-1.6.1/python/lib/pyspark.zip/pyspark/mllib/__init__.py",
> line 25, in 
>
> ImportError: No module named numpy
>
>
> Thanks in advance!
>
>


 --
 Cell : 425-233-8271
 Twitter: https://twitter.com/holdenkarau

>>>
>>>
>>
>>
>> --
>> Sergio Fernández
>> Partner Technology Manager
>> Redlink GmbH
>> m: +43 6602747925
>> e: sergio.fernan...@redlink.co
>> w: http://redlink.co
>>
>
>


Fwd: Beeline - Spark thrift server user retrieval Issue

2016-06-02 Thread pooja mehta
-- Forwarded message --
From: pooja mehta 
Date: Thu, Jun 2, 2016 at 1:25 PM
Subject: Fwd: Beeline - Spark thrift server user retrieval Issue
To: user-subscr...@spark.apache.org



-- Forwarded message --
From: pooja mehta 
Date: Thu, Jun 2, 2016 at 1:24 PM
Subject: Beeline - Spark thrift server user retrieval Issue
To: user-i...@spark.apache.org, user-...@spark.apache.org


Hi,

I am using spark Beeline client and When i logged in as newuser(Lets
suppose user2) and connect to port 10015(Using spark thrift server),
And use
 userName =UserGroupInformation.getCurrentUser().getShortUserName();
It always gives hive user,not user2.

Is there any way to get user2?

While when i am connecting to port 1(Hive thrift server ) via spark
beeline ,it gives user2.


Re: --driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-02 Thread Holden Karau
Also seems like this might be better suited for dev@

On Thursday, June 2, 2016, Sun Rui  wrote:

> yes, I think you can fire a JIRA issue for this.
> But why removing the default value. Seems the default core is 1 according
> to
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala#L110
>
> On Jun 2, 2016, at 05:18, Jacek Laskowski  > wrote:
>
> Hi,
>
> I'm reviewing the code of spark-submit and can see that although
> --driver-cores is said to be for Standalone and YARN only, it is
> applicable for Mesos [1].
>
> ➜  spark git:(master) ✗ ./bin/spark-shell --help
> Usage: ./bin/spark-shell [options]
> ...
> Spark standalone with cluster deploy mode only:
>  --driver-cores NUM  Cores for driver (Default: 1).
> ...
> YARN-only:
>  --driver-cores NUM  Number of cores used by the driver, only
> in cluster mode
>  (Default: 1).
>
> I think Mesos has been overlooked (as it's not even included in the
> --help). I also can't find that the default number of cores for the
> driver for the option is 1.
>
> I can see few things to fix:
>
> 1. Have --driver-cores in the "main" help with no separation for
> standalone and YARN.
> 2. Add note that it works only for cluster deploy mode.
> 3. Remove (Default: 1)
>
> Please confirm (or fix) my understanding before I file a JIRA issue.
> Thanks!
>
> [1]
> https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L475-L476
>
> Pozdrawiam,
> Jacek Laskowski
> 
> https://medium.com/@jaceklaskowski/
> Mastering Apache Spark http://bit.ly/mastering-apache-spark
> Follow me at https://twitter.com/jaceklaskowski
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> 
> For additional commands, e-mail: user-h...@spark.apache.org
> 
>
>
>
>
>
>


-- 
Cell : 425-233-8271
Twitter: https://twitter.com/holdenkarau


Re: --driver-cores for Standalone and YARN only?! What about Mesos?

2016-06-02 Thread Sun Rui
yes, I think you can fire a JIRA issue for this.
But why removing the default value. Seems the default core is 1 according to 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/rest/mesos/MesosRestServer.scala#L110


On Jun 2, 2016, at 05:18, Jacek Laskowski  wrote:


Hi,

I'm reviewing the code of spark-submit and can see that although
--driver-cores is said to be for Standalone and YARN only, it is
applicable for Mesos [1].

➜  spark git:(master) ✗ ./bin/spark-shell --help
Usage: ./bin/spark-shell [options]
...
Spark standalone with cluster deploy mode only:
 --driver-cores NUM  Cores for driver (Default: 1).
...
YARN-only:
 --driver-cores NUM  Number of cores used by the driver, only
in cluster mode
 (Default: 1).

I think Mesos has been overlooked (as it's not even included in the
--help). I also can't find that the default number of cores for the
driver for the option is 1.

I can see few things to fix:

1. Have --driver-cores in the "main" help with no separation for
standalone and YARN.
2. Add note that it works only for cluster deploy mode.
3. Remove (Default: 1)

Please confirm (or fix) my understanding before I file a JIRA issue. Thanks!

[1] 
https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/deploy/SparkSubmit.scala#L475-L476

Pozdrawiam,
Jacek Laskowski

https://medium.com/@jaceklaskowski/
Mastering Apache Spark http://bit.ly/mastering-apache-spark
Follow me at https://twitter.com/jaceklaskowski

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





Re: spark-submit hive connection through spark Initial job has not accepted any resources

2016-06-02 Thread vinayak
Hi Herman,

This error comes when you have started your master but no worker has been
added to your cluster, please check through spark master UI is there any
worker added in master?

Also check in your driver code have you set configuration.setmaster(local[])
if added remove it and give spark master URL through spark submit command.

Thnx



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/spark-submit-hive-connection-through-spark-Initial-job-has-not-accepted-any-resources-tp24993p27074.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: get and append file name in record being reading

2016-06-02 Thread Sun Rui
You can use RDD.wholeTextFiles().

For example, suppose all your files are under /tmp/ABC_input/,

val rdd  = sc.wholeTextFiles("file:///tmp/ABC_input”)
val rdd1 = rdd.flatMap { case (path, content) => 
  val fileName = new java.io.File(path).getName
  content.split("\n").map { line => (line, fileName) }
}
val df = sqlContext.createDataFrame(rdd1).toDF("line", "file")
> On Jun 2, 2016, at 03:13, Vikash Kumar  wrote:
> 
> 100,abc,299
> 200,xyz,499



Spark Streaming join

2016-06-02 Thread karthik tunga
Hi,

I have a scenario where I need to join DStream with a RDD. This is to add
some metadata info to incoming events. This is fairly straight forward.

What I also want to do is refresh this metadata RDD on a fixed schedule(or
when  underlying hdfs file changes). I want to "expire" and reload this RDD
every say 10 minutes.

Is this possible ?

Apologies if this has been asked before.

Cheers,
Karthik


Re: Ignore features in Random Forest

2016-06-02 Thread Yuhao Yang
Hi Neha,

This looks like a feature engineering task. I think VectorSlicer can help
with your case. Please refer to
http://spark.apache.org/docs/latest/ml-features.html#vectorslicer .

Regards,
Yuhao

2016-06-01 21:18 GMT+08:00 Neha Mehta :

> Hi,
>
> I am performing Regression using Random Forest. In my input vector, I want
> the algorithm to ignore certain columns/features while training the
> classifier and also while prediction. These are basically Id columns. I
> checked the documentation and could not find any information on the same.
>
> Request help with the same.
>
> Thanks & Regards,
> Neha
>