Re: [hadoop] Getting elasticsearch-hadoop working with Shark

Costin Leau Tue, 18 Feb 2014 22:17:26 -0800

The error indicates a network error - namely es-hadoop cannot connect to Elasticsearch on the default (localhost:9200)HTTP port. Can you double check whether that's indeed the case (using curl or even telnet on that port) - maybe thefirewall prevents any connections to be made...

Also you could try using the latest Hive, 0.12 and a more recent Hadoop such as 
1.1.2 or 1.2.1.

Additionally, can you enable TRACE logging in your job on es-hadoop packages org.elasticsearch.hadoop.rest andorg.elasticsearch.hadoop.mr packages and report back ?


Thanks,

On 19/02/2014 4:03 AM, Max Lang wrote:

I set everything up using this guide: 
https://github.com/amplab/shark/wiki/Running-Shark-on-EC2 on an ec2 cluster. 
I've
copied the elasticsearch-hadoop jars into the hive lib directory and I have 
elasticsearch running on localhost:9200. I'm
running shark in a screen session with --service screenserver and connecting to 
it at the same time using shark -h
localhost.

Unfortunately, when I attempt to write data into elasticsearch, it fails. 
Here's an example:

|
[localhost:10000]shark>CREATE EXTERNAL TABLE wiki (id BIGINT,title 
STRING,last_modified STRING,xml STRING,text
STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'LOCATION 
's3n://spark-data/wikipedia-sample/';
Timetaken (including network latency):0.159seconds
14/02/1901:23:33INFO CliDriver:Timetaken (including network 
latency):0.159seconds

[localhost:10000]shark>SELECT title FROM wiki LIMIT 1;
Alpokalja
Timetaken (including network latency):2.23seconds
14/02/1901:23:48INFO CliDriver:Timetaken (including network latency):2.23seconds

[localhost:10000]shark>CREATE EXTERNAL TABLE es_wiki (id BIGINT,title 
STRING,last_modified STRING,xml STRING,text
STRING)STORED BY 
'org.elasticsearch.hadoop.hive.EsStorageHandler'TBLPROPERTIES('es.resource'='wikipedia/article');
Timetaken (including network latency):0.061seconds
14/02/1901:33:51INFO CliDriver:Timetaken (including network 
latency):0.061seconds

[localhost:10000]shark>INSERT OVERWRITE TABLE es_wiki SELECT 
w.id,w.title,w.last_modified,w.xml,w.text FROM wiki w;
[HiveError]:Queryreturned non-zero 
code:9,cause:FAILED:ExecutionError,returncode -101fromshark.execution.SparkTask
Timetaken (including network latency):3.575seconds
14/02/1901:34:42INFO CliDriver:Timetaken (including network 
latency):3.575seconds
|

*The stack trace looks like this:*

org.apache.hadoop.hive.ql.metadata.HiveException 
(org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException:
Out of nodes and retries; caught exception)

org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:602)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:84)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:81)scala.collection.Iterator$class.foreach(Iterator.scala:772)scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)shark.execution.FileSinkOperator.processPartition(FileSinkOperator.scala:81)shark.execution.FileSinkOperator$.writeFiles$1(FileSinkOperator.scala:207)shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107)org.apache.spark.scheduler.Task.run(Task.scala:53)org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:215)org.apache.spark.deploy.Sp

arkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)java.lang.Thread.run(Thread.java:744

I should be using Hive 0.9.0, shark 0.8.1, elasticsearch 1.0.0, Hadoop 1.0.4, 
and java 1.7.0_51
Based on my cursory look at the hadoop and elasticsearch-hadoop sources, it 
looks like hive is just rethrowing an
IOException it's getting from Spark, and elasticsearch-hadoop is just hitting 
those exceptions.
I suppose my questions are: Does this look like an issue with my 
ES/elasticsearch-hadoop config? And has anyone gotten
elasticsearch working with Spark/Shark?
Any ideas/insights are appreciated.
Thanks,Max

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to
elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit
https://groups.google.com/d/msgid/elasticsearch/9486faff-3eaf-4344-8931-3121bbc5d9c7%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.


--
Costin

--
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/53044C46.70807%40gmail.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [hadoop] Getting elasticsearch-hadoop working with Shark

Reply via email to