Re: [hadoop] Getting elasticsearch-hadoop working with Shark

Max Lang Fri, 21 Feb 2014 15:07:00 -0800

I managed to get it working on ec2 without issue this time. I'd say the 
biggest difference was that this time I set up a dedicated ES machine. Is 
it possible that, because I was using a cluster with slaves, when I used 
"localhost" the slaves couldn't find the ES instance running on the master? 
Or do all the requests go through the master?



On Wednesday, February 19, 2014 2:35:40 PM UTC-8, Costin Leau wrote:
>
> Hi, 
>
> Setting logging in Hive/Hadoop can be tricky since the log4j needs to be 
> picked up by the running JVM otherwise you 
> won't see anything. 
> Take a look at this link on how to tell Hive to use your logging settings 
> [1]. 
>
> For the next release, we might introduce dedicated exceptions for the 
> simple fact that some libraries, like Hive, 
> swallow the stack trace and it's unclear what the issue is which makes the 
> exception (IllegalStateException) ambiguous. 
>
> Let me know how it goes and whether you will encounter any issues with 
> Shark. Or if you don't :) 
>
> Thanks! 
>
> [1] 
> https://cwiki.apache.org/confluence/display/Hive/GettingStarted#GettingStarted-ErrorLogs
>  
>
> On 20/02/2014 12:02 AM, Max Lang wrote: 
> > Hey Costin, 
> > 
> > Thanks for the swift reply. I abandoned EC2 to take that out of the 
> equation and managed to get everything working 
> > locally using the latest version of everything (though I realized just 
> now I'm still on hive 0.9). I'm guessing you're 
> > right about some port connection issue because I definitely had ES 
> running on that machine. 
> > 
> > I changed hive-log4j.properties and added 
> > | 
> > #custom logging levels 
> > #log4j.logger.xxx=DEBUG 
> > log4j.logger.org.elasticsearch.hadoop.rest=TRACE 
> > log4j.logger.org.elasticsearch.hadoop.mr=TRACE 
> > | 
> > 
> > But I didn't see any trace logging. Hopefully I can get it working on 
> EC2 without issue, but, for the future, is this 
> > the correct way to set TRACE logging? 
> > 
> > Oh and, for reference, I tried running without ES up and I got the 
> following, exceptions: 
> > 
> > 2014-02-19 13:46:08,803 ERROR shark.SharkDriver 
> (Logging.scala:logError(64)) - FAILED: Hive Internal Error: 
> > java.lang.IllegalStateException(Cannot discover Elasticsearch version) 
> > java.lang.IllegalStateException: Cannot discover Elasticsearch version 
> > at 
> org.elasticsearch.hadoop.hive.EsStorageHandler.init(EsStorageHandler.java:101)
>  
>
> > at 
> org.elasticsearch.hadoop.hive.EsStorageHandler.configureOutputJobProperties(EsStorageHandler.java:83)
>  
>
> > at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureJobPropertiesForStorageHandler(PlanUtils.java:706)
>  
>
> > at 
> org.apache.hadoop.hive.ql.plan.PlanUtils.configureOutputJobPropertiesForStorageHandler(PlanUtils.java:675)
>  
>
> > at 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.augmentPlan(FileSinkOperator.java:764)
>  
>
> > at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.putOpInsertMap(SemanticAnalyzer.java:1518)
>  
>
> > at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genFileSinkPlan(SemanticAnalyzer.java:4337)
>  
>
> > at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPostGroupByBodyPlan(SemanticAnalyzer.java:6207)
>  
>
> > at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genBodyPlan(SemanticAnalyzer.java:6138)
>  
>
> > at 
> org.apache.hadoop.hive.ql.parse.SemanticAnalyzer.genPlan(SemanticAnalyzer.java:6764)
>  
>
> > at 
> shark.parse.SharkSemanticAnalyzer.analyzeInternal(SharkSemanticAnalyzer.scala:149)
>  
>
> > at 
> org.apache.hadoop.hive.ql.parse.BaseSemanticAnalyzer.analyze(BaseSemanticAnalyzer.java:244)
>  
>
> > at shark.SharkDriver.compile(SharkDriver.scala:215) 
> > at org.apache.hadoop.hive.ql.Driver.compile(Driver.java:336) 
> > at org.apache.hadoop.hive.ql.Driver.run(Driver.java:895) 
> > at shark.SharkCliDriver.processCmd(SharkCliDriver.scala:324) 
> > at org.apache.hadoop.hive.cli.CliDriver.processLine(CliDriver.java:406) 
> > at shark.SharkCliDriver$.main(SharkCliDriver.scala:232) 
> > at shark.SharkCliDriver.main(SharkCliDriver.scala) 
> > Caused by: java.io.IOException: Out of nodes and retries; caught 
> exception 
> > at 
> org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:81) 
> > at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:221) 
> > at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:205) 
> > at org.elasticsearch.hadoop.rest.RestClient.execute(RestClient.java:209) 
> > at org.elasticsearch.hadoop.rest.RestClient.get(RestClient.java:103) 
> > at 
> org.elasticsearch.hadoop.rest.RestClient.esVersion(RestClient.java:274) 
> > at 
> org.elasticsearch.hadoop.rest.InitializationUtils.discoverEsVersion(InitializationUtils.java:84)
>  
>
> > at 
> org.elasticsearch.hadoop.hive.EsStorageHandler.init(EsStorageHandler.java:99) 
>
> > ... 18 more 
> > Caused by: java.net.ConnectException: Connection refused 
> > at java.net.PlainSocketImpl.socketConnect(Native Method) 
> > at 
> java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339) 
>
> > at 
> java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
>  
>
> > at 
> java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182) 
> > at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:391) 
> > at java.net.Socket.connect(Socket.java:579) 
> > at java.net.Socket.connect(Socket.java:528) 
> > at java.net.Socket.<init>(Socket.java:425) 
> > at java.net.Socket.<init>(Socket.java:280) 
> > at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:80)
>  
>
> > at 
> org.apache.commons.httpclient.protocol.DefaultProtocolSocketFactory.createSocket(DefaultProtocolSocketFactory.java:122)
>  
>
> > at 
> org.apache.commons.httpclient.HttpConnection.open(HttpConnection.java:707) 
> > at 
> org.apache.commons.httpclient.HttpMethodDirector.executeWithRetry(HttpMethodDirector.java:387)
>  
>
> > at 
> org.apache.commons.httpclient.HttpMethodDirector.executeMethod(HttpMethodDirector.java:171)
>  
>
> > at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:397) 
> > at 
> org.apache.commons.httpclient.HttpClient.executeMethod(HttpClient.java:323) 
> > at 
> org.elasticsearch.hadoop.rest.commonshttp.CommonsHttpTransport.execute(CommonsHttpTransport.java:160)
>  
>
> > at 
> org.elasticsearch.hadoop.rest.NetworkClient.execute(NetworkClient.java:74) 
> > ... 25 more 
> > 
> > Let me know if there's anything in particular you'd like me to try on 
> EC2. 
> > 
> > (For posterity, the versions I used were: hadoop 2.2.0, hive 0.9.0, 
> shark 8.1, spark 8.1, es-hadoop 1.3.0.M2, java 
> > 1.7.0_15, scala 2.9.3, elasticsearch 1.0.0) 
> > 
> > Thanks again, 
> > Max 
> > 
> > On Tuesday, February 18, 2014 10:16:38 PM UTC-8, Costin Leau wrote: 
> > 
> >     The error indicates a network error - namely es-hadoop cannot 
> connect to Elasticsearch on the default (localhost:9200) 
> >     HTTP port. Can you double check whether that's indeed the case 
> (using curl or even telnet on that port) - maybe the 
> >     firewall prevents any connections to be made... 
> >     Also you could try using the latest Hive, 0.12 and a more recent 
> Hadoop such as 1.1.2 or 1.2.1. 
> > 
> >     Additionally, can you enable TRACE logging in your job on es-hadoop 
> packages org.elasticsearch.hadoop.rest and 
> >     org.elasticsearch.hadoop.mr <http://org.elasticsearch.hadoop.mr> 
> packages and report back ? 
> > 
> >     Thanks, 
> > 
> >     On 19/02/2014 4:03 AM, Max Lang wrote: 
> >     > I set everything up using this guide:
> https://github.com/amplab/shark/wiki/Running-Shark-on-EC2 
> >     <https://github.com/amplab/shark/wiki/Running-Shark-on-EC2> on an 
> ec2 cluster. I've 
> >     > copied the elasticsearch-hadoop jars into the hive lib directory 
> and I have elasticsearch running on localhost:9200. I'm 
> >     > running shark in a screen session with --service screenserver and 
> connecting to it at the same time using shark -h 
> >     > localhost. 
> >     > 
> >     > Unfortunately, when I attempt to write data into elasticsearch, it 
> fails. Here's an example: 
> >     > 
> >     > | 
> >     > [localhost:10000]shark>CREATE EXTERNAL TABLE wiki (id BIGINT,title 
> STRING,last_modified STRING,xml STRING,text 
> >     > STRING)ROW FORMAT DELIMITED FIELDS TERMINATED BY '\t'LOCATION 
> 's3n://spark-data/wikipedia-sample/'; 
> >     > Timetaken (including network latency):0.159seconds 
> >     > 14/02/1901:23:33INFO CliDriver:Timetaken (including network 
> latency):0.159seconds 
> >     > 
> >     > [localhost:10000]shark>SELECT title FROM wiki LIMIT 1; 
> >     > Alpokalja 
> >     > Timetaken (including network latency):2.23seconds 
> >     > 14/02/1901:23:48INFO CliDriver:Timetaken (including network 
> latency):2.23seconds 
> >     > 
> >     > [localhost:10000]shark>CREATE EXTERNAL TABLE es_wiki (id 
> BIGINT,title STRING,last_modified STRING,xml STRING,text 
> >     > STRING)STORED BY 
> 'org.elasticsearch.hadoop.hive.EsStorageHandler'TBLPROPERTIES('es.resource'='wikipedia/article');
>  
>
> >     > Timetaken (including network latency):0.061seconds 
> >     > 14/02/1901:33:51INFO CliDriver:Timetaken (including network 
> latency):0.061seconds 
> >     > 
> >     > [localhost:10000]shark>INSERT OVERWRITE TABLE es_wiki SELECTw.id <
> http://w.id>,w.title,w.last_modified,w.xml,w.text FROM wiki w; 
> >     > [HiveError]:Queryreturned non-zero 
> code:9,cause:FAILED:ExecutionError,returncode 
> -101fromshark.execution.SparkTask 
> >     > Timetaken (including network latency):3.575seconds 
> >     > 14/02/1901:34:42INFO CliDriver:Timetaken (including network 
> latency):3.575seconds 
> >     > | 
> >     > 
> >     > *The stack trace looks like this:* 
> >     > 
> >     > org.apache.hadoop.hive.ql.metadata.HiveException 
> (org.apache.hadoop.hive.ql.metadata.HiveException: java.io.IOException: 
> >     > Out of nodes and retries; caught exception) 
> >     > 
> >     > 
> org.apache.hadoop.hive.ql.exec.FileSinkOperator.processOp(FileSinkOperator.java:602)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:84)shark.execution.FileSinkOperator$$anonfun$processPartition$1.apply(FileSinkOperator.scala:81)scala.collection.Iterator$class.foreach(Iterator.scala:772)scala.collection.Iterator$$anon$19.foreach(Iterator.scala:399)shark.execution.FileSinkOperator.processPartition(FileSinkOperator.scala:81)shark.execution.FileSinkOperator$.writeFiles$1(FileSinkOperator.scala:207)shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)shark.execution.FileSinkOperator$$anonfun$executeProcessFileSinkPartition$1.apply(FileSinkOperator.scala:211)org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:107)org.apache.spark.scheduler.Task.run(Task.scala:53)org.apache.spark.executor.Executor$TaskRunner$$anonfun$run$1.apply$mcV$sp(Executor.scala:215)org.apache.spark.dep
>  
>
> loy.Sp 
> > 
> >     
> arkHadoopUtil.runAsUser(SparkHadoopUtil.scala:50)org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:182)java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)java.lang.Thread.run(Thread.java:744
>  
>
> > 
> >     > I should be using Hive 0.9.0, shark 0.8.1, elasticsearch 1.0.0, 
> Hadoop 1.0.4, and java 1.7.0_51 
> >     > Based on my cursory look at the hadoop and elasticsearch-hadoop 
> sources, it looks like hive is just rethrowing an 
> >     > IOException it's getting from Spark, and elasticsearch-hadoop is 
> just hitting those exceptions. 
> >     > I suppose my questions are: Does this look like an issue with my 
> ES/elasticsearch-hadoop config? And has anyone gotten 
> >     > elasticsearch working with Spark/Shark? 
> >     > Any ideas/insights are appreciated. 
> >     > Thanks,Max 
> >     > 
> >     > -- 
> >     > You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group. 
> >     > To unsubscribe from this group and stop receiving emails from it, 
> send an email to 
> >     >elasticsearc...@googlegroups.com <javascript:>. 
> >     > To view this discussion on the web visit 
> >     >
> https://groups.google.com/d/msgid/elasticsearch/9486faff-3eaf-4344-8931-3121bbc5d9c7%40googlegroups.com
>  
> >     <
> https://groups.google.com/d/msgid/elasticsearch/9486faff-3eaf-4344-8931-3121bbc5d9c7%40googlegroups.com>.
>  
>
> >     > For more options, visithttps://groups.google.com/groups/opt_out <
> https://groups.google.com/groups/opt_out>. 
> > 
> >     -- 
> >     Costin 
> > 
> > -- 
> > You received this message because you are subscribed to the Google 
> Groups "elasticsearch" group. 
> > To unsubscribe from this group and stop receiving emails from it, send 
> an email to 
> > elasticsearc...@googlegroups.com <javascript:>. 
> > To view this discussion on the web visit 
> > 
> https://groups.google.com/d/msgid/elasticsearch/86187c3a-0974-4d10-9689-e83da788c04a%40googlegroups.com.
>  
>
> > For more options, visit https://groups.google.com/groups/opt_out. 
>
> -- 
> Costin 
>

-- 
You received this message because you are subscribed to the Google Groups 
"elasticsearch" group.
To unsubscribe from this group and stop receiving emails from it, send an email 
to elasticsearch+unsubscr...@googlegroups.com.
To view this discussion on the web visit 
https://groups.google.com/d/msgid/elasticsearch/e29e342d-de74-4ed6-93d4-875fc728c5a5%40googlegroups.com.
For more options, visit https://groups.google.com/groups/opt_out.

Re: [hadoop] Getting elasticsearch-hadoop working with Shark

Reply via email to