Re: RE: ElasticSearch for Spark times out
Is your ES cluster reachable from your Spark cluster via network / firewall? Can you run the same query from the spark master and slave nodes via curl / one of the other clients? Seems odd that GC issues would be a problem from the scan but not when running query from a browser plugin... Sounds like it could be a network issue. — Sent from Mailbox On Thu, Apr 23, 2015 at 5:11 AM, Otis Gospodnetic wrote: > Hi, > If you get ES response back in 1-5 seconds that's pretty slow. Are these > ES aggregation queries? Costin may be right about GC possibly causing > timeouts. SPM <http://sematext.com/spm/> can give you all Spark and all > key Elasticsearch metrics, including various JVM metrics. If the problem > is GC, you'll see it. If you monitor both Spark side and ES side, you > should be able to find some correlation with SPM. > Otis > -- > Monitoring * Alerting * Anomaly Detection * Centralized Log Management > Solr & Elasticsearch Support * http://sematext.com/ > On Wed, Apr 22, 2015 at 5:43 PM, Costin Leau wrote: >> Hi, >> >> First off, for Elasticsearch questions is worth pinging the Elastic >> mailing list as that is closer monitored than this one. >> >> Back to your question, Jeetendra is right that the exception indicates >> nodata is flowing back to the es-connector and >> Spark. >> The default is 1m [1] which should be more than enough for a typical >> scenario. As a side note the scroll size is 50 per >> tasks >> (so 150 suggests 3 tasks). >> >> Once the query is made, scrolling the document is fast - likely there's >> something else at hand that causes the >> connection to timeout. >> In such cases, you can enable logging on the REST package and see what >> type of data transfer occurs between ES and Spark. >> >> Do note that if a GC occurs, that can freeze Elastic (or Spark) which >> might trigger the timeout. Consider monitoring >> Elasticsearch during >> the query and see whether anything jumps - in particular the memory >> pressure. >> >> Hope this helps, >> >> [1] >> http://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html#_network >> >> On 4/22/15 10:44 PM, Adrian Mocanu wrote: >> >>> Hi >>> >>> Thanks for the help. My ES is up. >>> >>> Out of curiosity, do you know what the timeout value is? There are >>> probably other things happening to cause the timeout; >>> I don’t think my ES is that slow but it’s possible that ES is taking too >>> long to find the data. What I see happening is >>> that it uses scroll to get the data from ES; about 150 items at a >>> time.Usual delay when I perform the same query from a >>> browser plugin ranges from 1-5sec. >>> >>> Thanks >>> >>> *From:*Jeetendra Gangele [mailto:gangele...@gmail.com] >>> *Sent:* April 22, 2015 3:09 PM >>> *To:* Adrian Mocanu >>> *Cc:* u...@spark.incubator.apache.org >>> *Subject:* Re: ElasticSearch for Spark times out >>> >>> Basically ready timeout means hat no data arrived within the specified >>> receive timeout period. >>> >>> Few thing I would suggest >>> >>> 1.are your ES cluster Up and running? >>> >>> 2. if 1 is yes then reduce the size of the Index make it few kbps and >>> then test? >>> >>> On 23 April 2015 at 00:19, Adrian Mocanu >> <mailto:amoc...@verticalscope.com>> wrote: >>> >>> Hi >>> >>> I use the ElasticSearch package for Spark and very often it times out >>> reading data from ES into an RDD. >>> >>> How can I keep the connection alive (why doesn’t it? Bug?) >>> >>> Here’s the exception I get: >>> >>> org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: >>> java.net.SocketTimeoutException: Read timed out >>> >>> at >>> org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86) >>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >>> >>> at >>> org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70) >>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >>> >>> at >>> org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) >>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >>> >>> at >>> org.el
Re: RE: ElasticSearch for Spark times out
Hi, If you get ES response back in 1-5 seconds that's pretty slow. Are these ES aggregation queries? Costin may be right about GC possibly causing timeouts. SPM <http://sematext.com/spm/> can give you all Spark and all key Elasticsearch metrics, including various JVM metrics. If the problem is GC, you'll see it. If you monitor both Spark side and ES side, you should be able to find some correlation with SPM. Otis -- Monitoring * Alerting * Anomaly Detection * Centralized Log Management Solr & Elasticsearch Support * http://sematext.com/ On Wed, Apr 22, 2015 at 5:43 PM, Costin Leau wrote: > Hi, > > First off, for Elasticsearch questions is worth pinging the Elastic > mailing list as that is closer monitored than this one. > > Back to your question, Jeetendra is right that the exception indicates > nodata is flowing back to the es-connector and > Spark. > The default is 1m [1] which should be more than enough for a typical > scenario. As a side note the scroll size is 50 per > tasks > (so 150 suggests 3 tasks). > > Once the query is made, scrolling the document is fast - likely there's > something else at hand that causes the > connection to timeout. > In such cases, you can enable logging on the REST package and see what > type of data transfer occurs between ES and Spark. > > Do note that if a GC occurs, that can freeze Elastic (or Spark) which > might trigger the timeout. Consider monitoring > Elasticsearch during > the query and see whether anything jumps - in particular the memory > pressure. > > Hope this helps, > > [1] > http://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html#_network > > On 4/22/15 10:44 PM, Adrian Mocanu wrote: > >> Hi >> >> Thanks for the help. My ES is up. >> >> Out of curiosity, do you know what the timeout value is? There are >> probably other things happening to cause the timeout; >> I don’t think my ES is that slow but it’s possible that ES is taking too >> long to find the data. What I see happening is >> that it uses scroll to get the data from ES; about 150 items at a >> time.Usual delay when I perform the same query from a >> browser plugin ranges from 1-5sec. >> >> Thanks >> >> *From:*Jeetendra Gangele [mailto:gangele...@gmail.com] >> *Sent:* April 22, 2015 3:09 PM >> *To:* Adrian Mocanu >> *Cc:* u...@spark.incubator.apache.org >> *Subject:* Re: ElasticSearch for Spark times out >> >> Basically ready timeout means hat no data arrived within the specified >> receive timeout period. >> >> Few thing I would suggest >> >> 1.are your ES cluster Up and running? >> >> 2. if 1 is yes then reduce the size of the Index make it few kbps and >> then test? >> >> On 23 April 2015 at 00:19, Adrian Mocanu > <mailto:amoc...@verticalscope.com>> wrote: >> >> Hi >> >> I use the ElasticSearch package for Spark and very often it times out >> reading data from ES into an RDD. >> >> How can I keep the connection alive (why doesn’t it? Bug?) >> >> Here’s the exception I get: >> >> org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: >> java.net.SocketTimeoutException: Read timed out >> >> at >> org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86) >> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >> >> at >> org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70) >> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >> >> at >> org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) >> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >> >> at >> org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149) >> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >> >> at >> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102) >> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >> >> at >> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81) >> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >> >> at >> org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314) >> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] >> >> at >> org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76) >> ~[elasticsearch-hadoop-2.
Re: RE: ElasticSearch for Spark times out
Hi, First off, for Elasticsearch questions is worth pinging the Elastic mailing list as that is closer monitored than this one. Back to your question, Jeetendra is right that the exception indicates nodata is flowing back to the es-connector and Spark. The default is 1m [1] which should be more than enough for a typical scenario. As a side note the scroll size is 50 per tasks (so 150 suggests 3 tasks). Once the query is made, scrolling the document is fast - likely there's something else at hand that causes the connection to timeout. In such cases, you can enable logging on the REST package and see what type of data transfer occurs between ES and Spark. Do note that if a GC occurs, that can freeze Elastic (or Spark) which might trigger the timeout. Consider monitoring Elasticsearch during the query and see whether anything jumps - in particular the memory pressure. Hope this helps, [1] http://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html#_network On 4/22/15 10:44 PM, Adrian Mocanu wrote: Hi Thanks for the help. My ES is up. Out of curiosity, do you know what the timeout value is? There are probably other things happening to cause the timeout; I don’t think my ES is that slow but it’s possible that ES is taking too long to find the data. What I see happening is that it uses scroll to get the data from ES; about 150 items at a time.Usual delay when I perform the same query from a browser plugin ranges from 1-5sec. Thanks *From:*Jeetendra Gangele [mailto:gangele...@gmail.com] *Sent:* April 22, 2015 3:09 PM *To:* Adrian Mocanu *Cc:* u...@spark.incubator.apache.org *Subject:* Re: ElasticSearch for Spark times out Basically ready timeout means hat no data arrived within the specified receive timeout period. Few thing I would suggest 1.are your ES cluster Up and running? 2. if 1 is yes then reduce the size of the Index make it few kbps and then test? On 23 April 2015 at 00:19, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: Hi I use the ElasticSearch package for Spark and very often it times out reading data from ES into an RDD. How can I keep the connection alive (why doesn’t it? Bug?) Here’s the exception I get: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out at org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library.jar:na] at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.Task.run(Task.scala:54) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.executor.Executor$TaskRunner.run(Exe
RE: ElasticSearch for Spark times out
Hi Thanks for the help. My ES is up. Out of curiosity, do you know what the timeout value is? There are probably other things happening to cause the timeout; I don't think my ES is that slow but it's possible that ES is taking too long to find the data. What I see happening is that it uses scroll to get the data from ES; about 150 items at a time. Usual delay when I perform the same query from a browser plugin ranges from 1-5sec. Thanks From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: April 22, 2015 3:09 PM To: Adrian Mocanu Cc: u...@spark.incubator.apache.org Subject: Re: ElasticSearch for Spark times out Basically ready timeout means hat no data arrived within the specified receive timeout period. Few thing I would suggest 1.are your ES cluster Up and running? 2. if 1 is yes then reduce the size of the Index make it few kbps and then test? On 23 April 2015 at 00:19, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: Hi I use the ElasticSearch package for Spark and very often it times out reading data from ES into an RDD. How can I keep the connection alive (why doesn't it? Bug?) Here's the exception I get: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out at org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library.jar:na] at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.Task.run(Task.scala:54) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) ~[spark-core_2.10-1.1.0.jar:1.1.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.7.0_75] at java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_75] at java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_75] at java.io.BufferedInputStream.read1(BufferedInputStream.java:273) ~[na:1.7.0_75] at java.io.BufferedInputStream.read(BufferedInputStream.java:334) ~[na:1.7.0_75] at org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69) ~[commons-httpclient-3.1.
Re: ElasticSearch for Spark times out
Basically ready timeout means hat no data arrived within the specified receive timeout period. Few thing I would suggest 1.are your ES cluster Up and running? 2. if 1 is yes then reduce the size of the Index make it few kbps and then test? On 23 April 2015 at 00:19, Adrian Mocanu wrote: > Hi > > > > I use the ElasticSearch package for Spark and very often it times out > reading data from ES into an RDD. > > How can I keep the connection alive (why doesn't it? Bug?) > > > > Here's the exception I get: > > org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: > java.net.SocketTimeoutException: Read timed out > > at > org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > ~[scala-library.jar:na] > > at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > ~[scala-library.jar:na] > > at > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > ~[scala-library.jar:na] > > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > ~[scala-library.jar:na] > > at > scala.collection.Iterator$class.foreach(Iterator.scala:727) > ~[scala-library.jar:na] > > at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > ~[scala-library.jar:na] > > at > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at org.apache.spark.scheduler.Task.run(Task.scala:54) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_75] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_75] > > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75] > > Caused by: java.net.SocketTimeoutException: Read timed out > > at java.net.SocketInputStream.socketRead0(Native Method) > ~[na:1.7.0_75] > > at > java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_75] > > at > java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_75] > > at > java.io.BufferedInputStream.read1(BufferedInputStream.java:273) > ~[na:1.7.0_75] > > at > java.io.BufferedInputStream.read(BufferedInputStream.java:334) > ~[na:1.7.0_75] > > at > org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69) > ~[commons-httpclient-3.1.jar:na] > > at > org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170) > ~[commons-httpclient-3.1.jar:na] > > at > java.io.FilterInputStream.read(FilterInputStream.java:133) ~[na:1.7.0_75] > > at > org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108) > ~[commons-httpclient-3.1.jar:na] > > at > org.elasticsearch.hadoop.rest.DelegatingInputStream.read(DelegatingInputStream.java:57) > ~[elasticsearch-hadoop-2
RE: ElasticSearch for Spark times out
Hi The gist of it is this: I have data indexed into ES. Each index stores monthly data and the query will get data for some date range (across several ES indexes or within 1 if the date range is within 1 month). Then I merge these RDDs into an uberRdd and performs some operations then print the result with for each. The simplified code is below. { // esRdds: List[RDD] contains mentions count per post val esRdds = (startDate.getYear to endDate.getYear).flatMap { year => val sMonth = if (year == startDate.getYear) startDate.getMonthOfYear else 1 val eMonth = if (year == endDate.getYear) endDate.getMonthOfYear else 12 (sMonth to eMonth).map { i => sc.esRDD(s"$year-${i.formatted("%02d")}_nlpindex/nlp", ESQueries.generateQueryString(Some(startDate), Some(endDate), mentionsToFind, siteNames)) .map { case (str, map) => unwrapAndCountMentionsPerPost(map)} } } var uberRdd = esRdds(0) for (rdd <- esRdds) { uberRdd = uberRdd ++ rdd } uberRdd.map joinforeach(x => println(x)) } From: Jeetendra Gangele [mailto:gangele...@gmail.com] Sent: April 22, 2015 2:52 PM To: Adrian Mocanu Cc: u...@spark.incubator.apache.org Subject: Re: ElasticSearch for Spark times out will you be able to paste the code? On 23 April 2015 at 00:19, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote: Hi I use the ElasticSearch package for Spark and very often it times out reading data from ES into an RDD. How can I keep the connection alive (why doesn't it? Bug?) Here's the exception I get: org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: java.net.SocketTimeoutException: Read timed out at org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46) ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) ~[scala-library.jar:na] at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) ~[scala-library.jar:na] at scala.collection.Iterator$class.foreach(Iterator.scala:727) ~[scala-library.jar:na] at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) ~[scala-library.jar:na] at org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.scheduler.Task.run(Task.scala:54) ~[spark-core_2.10-1.1.0.jar:1.1.0] at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) ~[spark-core_2.10-1.1.0.jar:1.1.0] at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_75] at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_75] at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75] Caused by: java.net.SocketTimeoutException: Read timed out at java.net.SocketInputStream.socketRead0(Native Method) ~[na:1.7.0_75] at java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_75] at java.ne
Re: ElasticSearch for Spark times out
will you be able to paste the code? On 23 April 2015 at 00:19, Adrian Mocanu wrote: > Hi > > > > I use the ElasticSearch package for Spark and very often it times out > reading data from ES into an RDD. > > How can I keep the connection alive (why doesn't it? Bug?) > > > > Here's the exception I get: > > org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: > java.net.SocketTimeoutException: Read timed out > > at > org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > ~[scala-library.jar:na] > > at > scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) > ~[scala-library.jar:na] > > at > scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) > ~[scala-library.jar:na] > > at > scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) > ~[scala-library.jar:na] > > at > scala.collection.Iterator$class.foreach(Iterator.scala:727) > ~[scala-library.jar:na] > > at > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > ~[scala-library.jar:na] > > at > org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at > org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at org.apache.spark.scheduler.Task.run(Task.scala:54) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at > org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) > ~[spark-core_2.10-1.1.0.jar:1.1.0] > > at > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) > [na:1.7.0_75] > > at > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > [na:1.7.0_75] > > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75] > > Caused by: java.net.SocketTimeoutException: Read timed out > > at java.net.SocketInputStream.socketRead0(Native Method) > ~[na:1.7.0_75] > > at > java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_75] > > at > java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_75] > > at > java.io.BufferedInputStream.read1(BufferedInputStream.java:273) > ~[na:1.7.0_75] > > at > java.io.BufferedInputStream.read(BufferedInputStream.java:334) > ~[na:1.7.0_75] > > at > org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69) > ~[commons-httpclient-3.1.jar:na] > > at > org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170) > ~[commons-httpclient-3.1.jar:na] > > at > java.io.FilterInputStream.read(FilterInputStream.java:133) ~[na:1.7.0_75] > > at > org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108) > ~[commons-httpclient-3.1.jar:na] > > at > org.elasticsearch.hadoop.rest.DelegatingInputStream.read(DelegatingInputStream.java:57) > ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3] > > at > org.codehaus.jackson.impl.Utf8StreamParser.loadMore(Utf8StreamParser.java:172) > ~[jackson-core-asl-1.9.11.jar:1.9.11] > > at > org.cod