i recently powered through this Spark + ElasticSearch integration, as well.

you see this + many other Spark integrations with the PANCAKE STACK
<http://pancake-stack.com> here:  https://github.com/fluxcapacitor/pipeline

all configs found here:
https://github.com/fluxcapacitor/pipeline/tree/master/config

in particular, the Stanford CoreNLP + Spark ML Pipeline integration was the
most-difficult, but we got it working finally with some hard-coding and
finger-crossing!


On Thu, Jun 2, 2016 at 4:09 PM, Nick Pentreath <nick.pentre...@gmail.com>
wrote:

> Fair enough.
>
> However, if you take a look at the deployment guide (
> http://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies)
> you will see that the generally advised approach is to package your app
> dependencies into a fat JAR and submit (possibly using the --jars option
> too). This also means you specify the Scala and other library versions in
> your project pom.xml or sbt file, avoiding having to manually decide which
> artefact to include on your classpath  :)
>
> On Thu, 2 Jun 2016 at 16:06 Kevin Burton <bur...@spinn3r.com> wrote:
>
>> Yeah.. thanks Nick. Figured that out since your last email... I deleted
>> the 2.10 by accident but then put 2+2 together.
>>
>> Got it working now.
>>
>> Still sticking to my story that it's somewhat complicated to setup :)
>>
>> Kevin
>>
>> On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath <nick.pentre...@gmail.com>
>> wrote:
>>
>>> Which Scala version is Spark built against? I'd guess it's 2.10 since
>>> you're using spark-1.6, and you're using the 2.11 jar for es-hadoop.
>>>
>>>
>>> On Thu, 2 Jun 2016 at 15:50 Kevin Burton <bur...@spinn3r.com> wrote:
>>>
>>>> Thanks.
>>>>
>>>> I'm trying to run it in a standalone cluster with an existing / large
>>>> 100 node ES install.
>>>>
>>>> I'm using the standard 1.6.1 -2.6 distribution with
>>>> elasticsearch-hadoop-2.3.2...
>>>>
>>>> I *think* I'm only supposed to use the
>>>> elasticsearch-spark_2.11-2.3.2.jar with it...
>>>>
>>>> but now I get the following exception:
>>>>
>>>>
>>>> java.lang.NoSuchMethodError:
>>>> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object;
>>>> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52)
>>>> at
>>>> org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37)
>>>> at
>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:40)
>>>> at
>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45)
>>>> at
>>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
>>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
>>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
>>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
>>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
>>>> at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
>>>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:59)
>>>> at $iwC$$iwC$$iwC.<init>(<console>:61)
>>>> at $iwC$$iwC.<init>(<console>:63)
>>>> at $iwC.<init>(<console>:65)
>>>> at <init>(<console>:67)
>>>> at .<init>(<console>:71)
>>>> at .<clinit>(<console>)
>>>> at .<init>(<console>:7)
>>>> at .<clinit>(<console>)
>>>> at $print(<console>)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>>> at
>>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065)
>>>> at
>>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346)
>>>> at
>>>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840)
>>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871)
>>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819)
>>>> at
>>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857)
>>>> at
>>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>>>> at
>>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875)
>>>> at
>>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902)
>>>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814)
>>>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657)
>>>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665)
>>>> at org.apache.spark.repl.SparkILoop.org
>>>> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670)
>>>> at
>>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997)
>>>> at
>>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>>>> at
>>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945)
>>>> at
>>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135)
>>>> at org.apache.spark.repl.SparkILoop.org
>>>> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945)
>>>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059)
>>>> at org.apache.spark.repl.Main$.main(Main.scala:31)
>>>> at org.apache.spark.repl.Main.main(Main.scala)
>>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>>>> at
>>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>>>> at
>>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
>>>> at java.lang.reflect.Method.invoke(Method.java:497)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
>>>> at
>>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
>>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
>>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
>>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
>>>>
>>>>
>>>> On Thu, Jun 2, 2016 at 3:45 PM, Nick Pentreath <
>>>> nick.pentre...@gmail.com> wrote:
>>>>
>>>>> Hey there
>>>>>
>>>>> When I used es-hadoop, I just pulled in the dependency into my
>>>>> pom.xml, with spark as a "provided" dependency, and built a fat jar with
>>>>> assembly.
>>>>>
>>>>> Then with spark-submit use the --jars option to include your assembly
>>>>> jar (IIRC I sometimes also needed to use --driver-classpath too, but
>>>>> perhaps not with recent Spark versions).
>>>>>
>>>>>
>>>>>
>>>>> On Thu, 2 Jun 2016 at 15:34 Kevin Burton <bur...@spinn3r.com> wrote:
>>>>>
>>>>>> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say
>>>>>> it's not super easy.
>>>>>>
>>>>>> I wish there was an easier way to get this stuff to work.. Last time
>>>>>> I tried to use spark more I was having similar problems with classpath
>>>>>> setup and Cassandra.
>>>>>>
>>>>>> Seems a huge opportunity to make this easier for new developers.
>>>>>> This stuff isn't rocket science but it can (needlessly) waste a ton of 
>>>>>> time.
>>>>>>
>>>>>> ... anyway... I'm have since figured out I have to specific
>>>>>> *specific* jars from the elasticsearch-hadoop distribution and use those.
>>>>>>
>>>>>> Right now I'm using :
>>>>>>
>>>>>>
>>>>>> SPARK_CLASSPATH=/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar:/usr/share/apache-spark/lib/*
>>>>>>
>>>>>> ... but I"m getting:
>>>>>>
>>>>>> java.lang.NoClassDefFoundError: Could not initialize class
>>>>>> org.elasticsearch.hadoop.util.Version
>>>>>> at
>>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
>>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
>>>>>> at
>>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
>>>>>> at
>>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
>>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>>>> at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>>
>>>>>> ... but I think its caused by this:
>>>>>>
>>>>>> 16/06/03 00:26:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0
>>>>>> (TID 0, localhost): java.lang.Error: Multiple ES-Hadoop versions detected
>>>>>> in the classpath; please use only one
>>>>>>
>>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar
>>>>>>
>>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar
>>>>>>
>>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar
>>>>>>
>>>>>> at org.elasticsearch.hadoop.util.Version.<clinit>(Version.java:73)
>>>>>> at
>>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376)
>>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40)
>>>>>> at
>>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
>>>>>> at
>>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67)
>>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89)
>>>>>> at
>>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>>>>>> at
>>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>>>>>> at java.lang.Thread.run(Thread.java:745)
>>>>>>
>>>>>> .. still tracking this down but was wondering if there is someting
>>>>>> obvious I'm dong wrong.  I'm going to take out
>>>>>> elasticsearch-hadoop-2.3.2.jar and try again.
>>>>>>
>>>>>> Lots of trial and error here :-/
>>>>>>
>>>>>> Kevin
>>>>>>
>>>>>> --
>>>>>>
>>>>>> We’re hiring if you know of any awesome Java Devops or Linux
>>>>>> Operations Engineers!
>>>>>>
>>>>>> Founder/CEO Spinn3r.com
>>>>>> Location: *San Francisco, CA*
>>>>>> blog: http://burtonator.wordpress.com
>>>>>> … or check out my Google+ profile
>>>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>>>
>>>>>>
>>>>
>>>>
>>>> --
>>>>
>>>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>>>> Engineers!
>>>>
>>>> Founder/CEO Spinn3r.com
>>>> Location: *San Francisco, CA*
>>>> blog: http://burtonator.wordpress.com
>>>> … or check out my Google+ profile
>>>> <https://plus.google.com/102718274791889610666/posts>
>>>>
>>>>
>>
>>
>> --
>>
>> We’re hiring if you know of any awesome Java Devops or Linux Operations
>> Engineers!
>>
>> Founder/CEO Spinn3r.com
>> Location: *San Francisco, CA*
>> blog: http://burtonator.wordpress.com
>> … or check out my Google+ profile
>> <https://plus.google.com/102718274791889610666/posts>
>>
>>


-- 
*Chris Fregly*
Research Scientist @ PipelineIO
San Francisco, CA
http://pipeline.io

Reply via email to