[hadoop] Getting elasticsearch-hadoop working with Shark

Max Lang Tue, 18 Feb 2014 18:03:26 -0800

I set everything up using this 
guide: https://github.com/amplab/shark/wiki/Running-Shark-on-EC2 on an ec2 
cluster. I've copied the elasticsearch-hadoop jars into the hive lib 
directory and I have elasticsearch running on localhost:9200. I'm running 
shark in a screen session with --service screenserver and connecting to it 
at the same time using shark -h localhost.


Unfortunately, when I attempt to write data into elasticsearch, it fails. 
Here's an example:

[localhost:10000] shark> CREATE EXTERNAL TABLE wiki (id BIGINT, title STRING
, last_modified STRING, xml STRING, text STRING) ROW FORMAT DELIMITED 
FIELDS TERMINATED BY '\t' LOCATION 's3n://spark-data/wikipedia-sample/'; 
Time taken (including network latency): 0.159 seconds 
14/02/19 01:23:33 INFO CliDriver: Time taken (including network latency): 
0.159 seconds 

[localhost:10000] shark> SELECT title FROM wiki LIMIT 1; 
Alpokalja 
Time taken (including network latency): 2.23 seconds 
14/02/19 01:23:48 INFO CliDriver: Time taken (including network latency): 
2.23 seconds 

[localhost:10000] shark> CREATE EXTERNAL TABLE es_wiki (id BIGINT, title 
STRING, last_modified STRING, xml STRING, text STRING) STORED BY 
'org.elasticsearch.hadoop.hive.EsStorageHandler' TBLPROPERTIES('es.resource' 
= 'wikipedia/article'); 
Time taken (including network latency): 0.061 seconds 
14/02/19 01:33:51 INFO CliDriver: Time taken (including network latency): 
0.061 seconds 

[localhost:10000] shark> INSERT OVERWRITE TABLE es_wiki SELECT w.id, w.title
, w.last_modified, w.xml, w.text FROM wiki w; 
[Hive Error]: Query returned non-zero code: 9, cause: FAILED: Execution 
Error, return code -101 from shark.execution.SparkTask 
Time taken (including network latency): 3.575 seconds 
14/02/19 01:34:42 INFO CliDriver: Time taken (including network latency): 
3.575 seconds

*The stack trace looks like this:*

org.apache.hadoop.hive.ql.metadata.HiveException 
(org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: Out 
of nodes and retries; caught exception)

org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:602)
shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:84)
shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:81)
scala.collection.Iterator$class.foreach(Iterator.scala:772)
scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)
shark.execution.FileSinkOperator.processPartition(FileSinkOperator.scala:81)
shark.execution.FileSinkOperator$.writeFiles$1(FileSinkOperator.scala:207)
shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)
shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)
org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107)
org.apache.spark.scheduler.Task.run(Task.scala:53)
org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:215)
org.apache.spark.deploy.SparkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
java.lang.Thread.run(Thread.java:744
I should be using Hive 0.9.0, shark 0.8.1, elasticsearch 1.0.0, Hadoop 
1.0.4, and java 1.7.0_51
Based on my cursory look at the hadoop and elasticsearch-hadoop sources, it 
looks like hive is just rethrowing an IOException it's getting from Spark, 
and elasticsearch-hadoop is just hitting those exceptions.
I suppose my questions are: Does this look like an issue with my 
ES/elasticsearch-hadoop config? And has anyone gotten elasticsearch working 
with Spark/Shark?
Any ideas/insights are appreciated.
Thanks,Max 

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/9486faff-3eaf-4344-8931-3121bbc5d9c7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

[hadoop] Getting elasticsearch-hadoop working with Shark

Reply via email to