i recently powered through this Spark + ElasticSearch integration, as well.
you see this + many other Spark integrations with the PANCAKE STACK <http://pancake-stack.com> here: https://github.com/fluxcapacitor/pipeline all configs found here: https://github.com/fluxcapacitor/pipeline/tree/master/config in particular, the Stanford CoreNLP + Spark ML Pipeline integration was the most-difficult, but we got it working finally with some hard-coding and finger-crossing! On Thu, Jun 2, 2016 at 4:09 PM, Nick Pentreath <nick.pentre...@gmail.com> wrote: > Fair enough. > > However, if you take a look at the deployment guide ( > http://spark.apache.org/docs/latest/submitting-applications.html#bundling-your-applications-dependencies) > you will see that the generally advised approach is to package your app > dependencies into a fat JAR and submit (possibly using the --jars option > too). This also means you specify the Scala and other library versions in > your project pom.xml or sbt file, avoiding having to manually decide which > artefact to include on your classpath :) > > On Thu, 2 Jun 2016 at 16:06 Kevin Burton <bur...@spinn3r.com> wrote: > >> Yeah.. thanks Nick. Figured that out since your last email... I deleted >> the 2.10 by accident but then put 2+2 together. >> >> Got it working now. >> >> Still sticking to my story that it's somewhat complicated to setup :) >> >> Kevin >> >> On Thu, Jun 2, 2016 at 3:59 PM, Nick Pentreath <nick.pentre...@gmail.com> >> wrote: >> >>> Which Scala version is Spark built against? I'd guess it's 2.10 since >>> you're using spark-1.6, and you're using the 2.11 jar for es-hadoop. >>> >>> >>> On Thu, 2 Jun 2016 at 15:50 Kevin Burton <bur...@spinn3r.com> wrote: >>> >>>> Thanks. >>>> >>>> I'm trying to run it in a standalone cluster with an existing / large >>>> 100 node ES install. >>>> >>>> I'm using the standard 1.6.1 -2.6 distribution with >>>> elasticsearch-hadoop-2.3.2... >>>> >>>> I *think* I'm only supposed to use the >>>> elasticsearch-spark_2.11-2.3.2.jar with it... >>>> >>>> but now I get the following exception: >>>> >>>> >>>> java.lang.NoSuchMethodError: >>>> scala.Predef$.ArrowAssoc(Ljava/lang/Object;)Ljava/lang/Object; >>>> at org.elasticsearch.spark.rdd.EsSpark$.saveToEs(EsSpark.scala:52) >>>> at >>>> org.elasticsearch.spark.package$SparkRDDFunctions.saveToEs(package.scala:37) >>>> at >>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:40) >>>> at >>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:45) >>>> at >>>> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47) >>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49) >>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51) >>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53) >>>> at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55) >>>> at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:57) >>>> at $iwC$$iwC$$iwC$$iwC.<init>(<console>:59) >>>> at $iwC$$iwC$$iwC.<init>(<console>:61) >>>> at $iwC$$iwC.<init>(<console>:63) >>>> at $iwC.<init>(<console>:65) >>>> at <init>(<console>:67) >>>> at .<init>(<console>:71) >>>> at .<clinit>(<console>) >>>> at .<init>(<console>:7) >>>> at .<clinit>(<console>) >>>> at $print(<console>) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:497) >>>> at >>>> org.apache.spark.repl.SparkIMain$ReadEvalPrint.call(SparkIMain.scala:1065) >>>> at >>>> org.apache.spark.repl.SparkIMain$Request.loadAndRun(SparkIMain.scala:1346) >>>> at >>>> org.apache.spark.repl.SparkIMain.loadAndRunReq$1(SparkIMain.scala:840) >>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:871) >>>> at org.apache.spark.repl.SparkIMain.interpret(SparkIMain.scala:819) >>>> at >>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:857) >>>> at >>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) >>>> at >>>> org.apache.spark.repl.SparkILoop.reallyInterpret$1(SparkILoop.scala:875) >>>> at >>>> org.apache.spark.repl.SparkILoop.interpretStartingWith(SparkILoop.scala:902) >>>> at org.apache.spark.repl.SparkILoop.command(SparkILoop.scala:814) >>>> at org.apache.spark.repl.SparkILoop.processLine$1(SparkILoop.scala:657) >>>> at org.apache.spark.repl.SparkILoop.innerLoop$1(SparkILoop.scala:665) >>>> at org.apache.spark.repl.SparkILoop.org >>>> $apache$spark$repl$SparkILoop$$loop(SparkILoop.scala:670) >>>> at >>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply$mcZ$sp(SparkILoop.scala:997) >>>> at >>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) >>>> at >>>> org.apache.spark.repl.SparkILoop$$anonfun$org$apache$spark$repl$SparkILoop$$process$1.apply(SparkILoop.scala:945) >>>> at >>>> scala.tools.nsc.util.ScalaClassLoader$.savingContextLoader(ScalaClassLoader.scala:135) >>>> at org.apache.spark.repl.SparkILoop.org >>>> $apache$spark$repl$SparkILoop$$process(SparkILoop.scala:945) >>>> at org.apache.spark.repl.SparkILoop.process(SparkILoop.scala:1059) >>>> at org.apache.spark.repl.Main$.main(Main.scala:31) >>>> at org.apache.spark.repl.Main.main(Main.scala) >>>> at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) >>>> at >>>> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) >>>> at >>>> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) >>>> at java.lang.reflect.Method.invoke(Method.java:497) >>>> at >>>> org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731) >>>> at >>>> org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181) >>>> at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206) >>>> at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121) >>>> at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) >>>> >>>> >>>> On Thu, Jun 2, 2016 at 3:45 PM, Nick Pentreath < >>>> nick.pentre...@gmail.com> wrote: >>>> >>>>> Hey there >>>>> >>>>> When I used es-hadoop, I just pulled in the dependency into my >>>>> pom.xml, with spark as a "provided" dependency, and built a fat jar with >>>>> assembly. >>>>> >>>>> Then with spark-submit use the --jars option to include your assembly >>>>> jar (IIRC I sometimes also needed to use --driver-classpath too, but >>>>> perhaps not with recent Spark versions). >>>>> >>>>> >>>>> >>>>> On Thu, 2 Jun 2016 at 15:34 Kevin Burton <bur...@spinn3r.com> wrote: >>>>> >>>>>> I'm trying to get spark 1.6.1 to work with 2.3.2... needless to say >>>>>> it's not super easy. >>>>>> >>>>>> I wish there was an easier way to get this stuff to work.. Last time >>>>>> I tried to use spark more I was having similar problems with classpath >>>>>> setup and Cassandra. >>>>>> >>>>>> Seems a huge opportunity to make this easier for new developers. >>>>>> This stuff isn't rocket science but it can (needlessly) waste a ton of >>>>>> time. >>>>>> >>>>>> ... anyway... I'm have since figured out I have to specific >>>>>> *specific* jars from the elasticsearch-hadoop distribution and use those. >>>>>> >>>>>> Right now I'm using : >>>>>> >>>>>> >>>>>> SPARK_CLASSPATH=/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar:/usr/share/apache-spark/lib/* >>>>>> >>>>>> ... but I"m getting: >>>>>> >>>>>> java.lang.NoClassDefFoundError: Could not initialize class >>>>>> org.elasticsearch.hadoop.util.Version >>>>>> at >>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376) >>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89) >>>>>> at >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>> >>>>>> ... but I think its caused by this: >>>>>> >>>>>> 16/06/03 00:26:48 WARN TaskSetManager: Lost task 0.0 in stage 0.0 >>>>>> (TID 0, localhost): java.lang.Error: Multiple ES-Hadoop versions detected >>>>>> in the classpath; please use only one >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-2.3.2.jar >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-spark_2.11-2.3.2.jar >>>>>> >>>>>> jar:file:/usr/share/elasticsearch-hadoop/lib/elasticsearch-hadoop-mr-2.3.2.jar >>>>>> >>>>>> at org.elasticsearch.hadoop.util.Version.<clinit>(Version.java:73) >>>>>> at >>>>>> org.elasticsearch.hadoop.rest.RestService.createWriter(RestService.java:376) >>>>>> at org.elasticsearch.spark.rdd.EsRDDWriter.write(EsRDDWriter.scala:40) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at >>>>>> org.elasticsearch.spark.rdd.EsSpark$$anonfun$saveToEs$1.apply(EsSpark.scala:67) >>>>>> at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66) >>>>>> at org.apache.spark.scheduler.Task.run(Task.scala:89) >>>>>> at >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142) >>>>>> at >>>>>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617) >>>>>> at java.lang.Thread.run(Thread.java:745) >>>>>> >>>>>> .. still tracking this down but was wondering if there is someting >>>>>> obvious I'm dong wrong. I'm going to take out >>>>>> elasticsearch-hadoop-2.3.2.jar and try again. >>>>>> >>>>>> Lots of trial and error here :-/ >>>>>> >>>>>> Kevin >>>>>> >>>>>> -- >>>>>> >>>>>> We’re hiring if you know of any awesome Java Devops or Linux >>>>>> Operations Engineers! >>>>>> >>>>>> Founder/CEO Spinn3r.com >>>>>> Location: *San Francisco, CA* >>>>>> blog: http://burtonator.wordpress.com >>>>>> … or check out my Google+ profile >>>>>> <https://plus.google.com/102718274791889610666/posts> >>>>>> >>>>>> >>>> >>>> >>>> -- >>>> >>>> We’re hiring if you know of any awesome Java Devops or Linux Operations >>>> Engineers! >>>> >>>> Founder/CEO Spinn3r.com >>>> Location: *San Francisco, CA* >>>> blog: http://burtonator.wordpress.com >>>> … or check out my Google+ profile >>>> <https://plus.google.com/102718274791889610666/posts> >>>> >>>> >> >> >> -- >> >> We’re hiring if you know of any awesome Java Devops or Linux Operations >> Engineers! >> >> Founder/CEO Spinn3r.com >> Location: *San Francisco, CA* >> blog: http://burtonator.wordpress.com >> … or check out my Google+ profile >> <https://plus.google.com/102718274791889610666/posts> >> >> -- *Chris Fregly* Research Scientist @ PipelineIO San Francisco, CA http://pipeline.io