Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Nick Pentreath
Is your ES cluster reachable from your Spark cluster via network / firewall? 
Can you run the same query from the spark master and slave nodes via curl / one 
of the other clients?




Seems odd that GC issues would be a problem from the scan but not when running 
query from a browser plugin... Sounds like it could be a network issue.



—
Sent from Mailbox

On Thu, Apr 23, 2015 at 5:11 AM, Otis Gospodnetic
 wrote:

> Hi,
> If you get ES response back in 1-5 seconds that's pretty slow.  Are these
> ES aggregation queries?  Costin may be right about GC possibly causing
> timeouts.  SPM <http://sematext.com/spm/> can give you all Spark and all
> key Elasticsearch metrics, including various JVM metrics.  If the problem
> is GC, you'll see it.  If you monitor both Spark side and ES side, you
> should be able to find some correlation with SPM.
> Otis
> --
> Monitoring * Alerting * Anomaly Detection * Centralized Log Management
> Solr & Elasticsearch Support * http://sematext.com/
> On Wed, Apr 22, 2015 at 5:43 PM, Costin Leau  wrote:
>> Hi,
>>
>> First off, for Elasticsearch questions is worth pinging the Elastic
>> mailing list as that is closer monitored than this one.
>>
>> Back to your question, Jeetendra is right that the exception indicates
>> nodata is flowing back to the es-connector and
>> Spark.
>> The default is 1m [1] which should be more than enough for a typical
>> scenario. As a side note the scroll size is 50 per
>> tasks
>> (so 150 suggests 3 tasks).
>>
>> Once the query is made, scrolling the document is fast - likely there's
>> something else at hand that causes the
>> connection to timeout.
>> In such cases, you can enable logging on the REST package and see what
>> type of data transfer occurs between ES and Spark.
>>
>> Do note that if a GC occurs, that can freeze Elastic (or Spark) which
>> might trigger the timeout. Consider monitoring
>> Elasticsearch during
>> the query and see whether anything jumps - in particular the memory
>> pressure.
>>
>> Hope this helps,
>>
>> [1]
>> http://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html#_network
>>
>> On 4/22/15 10:44 PM, Adrian Mocanu wrote:
>>
>>> Hi
>>>
>>> Thanks for the help. My ES is up.
>>>
>>> Out of curiosity, do you know what the timeout value is? There are
>>> probably other things happening to cause the timeout;
>>> I don’t think my ES is that slow but it’s possible that ES is taking too
>>> long to find the data. What I see happening is
>>> that it uses scroll to get the data from ES; about 150 items at a
>>> time.Usual delay when I perform the same query from a
>>> browser plugin ranges from 1-5sec.
>>>
>>> Thanks
>>>
>>> *From:*Jeetendra Gangele [mailto:gangele...@gmail.com]
>>> *Sent:* April 22, 2015 3:09 PM
>>> *To:* Adrian Mocanu
>>> *Cc:* u...@spark.incubator.apache.org
>>> *Subject:* Re: ElasticSearch for Spark times out
>>>
>>> Basically ready timeout means hat no data arrived within the specified
>>> receive timeout period.
>>>
>>> Few thing I would suggest
>>>
>>> 1.are your ES cluster Up and running?
>>>
>>> 2. if 1 is yes then reduce the size of the Index make it few kbps and
>>> then test?
>>>
>>> On 23 April 2015 at 00:19, Adrian Mocanu >> <mailto:amoc...@verticalscope.com>> wrote:
>>>
>>> Hi
>>>
>>> I use the ElasticSearch package for Spark and very often it times out
>>> reading data from ES into an RDD.
>>>
>>> How can I keep the connection alive (why doesn’t it? Bug?)
>>>
>>> Here’s the exception I get:
>>>
>>> org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:
>>> java.net.SocketTimeoutException: Read timed out
>>>
>>>  at
>>> org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
>>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>>
>>>  at
>>> org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
>>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>>
>>>  at
>>> org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58)
>>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>>
>>>  at
>>> org.el

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Otis Gospodnetic
Hi,

If you get ES response back in 1-5 seconds that's pretty slow.  Are these
ES aggregation queries?  Costin may be right about GC possibly causing
timeouts.  SPM <http://sematext.com/spm/> can give you all Spark and all
key Elasticsearch metrics, including various JVM metrics.  If the problem
is GC, you'll see it.  If you monitor both Spark side and ES side, you
should be able to find some correlation with SPM.

Otis
--
Monitoring * Alerting * Anomaly Detection * Centralized Log Management
Solr & Elasticsearch Support * http://sematext.com/


On Wed, Apr 22, 2015 at 5:43 PM, Costin Leau  wrote:

> Hi,
>
> First off, for Elasticsearch questions is worth pinging the Elastic
> mailing list as that is closer monitored than this one.
>
> Back to your question, Jeetendra is right that the exception indicates
> nodata is flowing back to the es-connector and
> Spark.
> The default is 1m [1] which should be more than enough for a typical
> scenario. As a side note the scroll size is 50 per
> tasks
> (so 150 suggests 3 tasks).
>
> Once the query is made, scrolling the document is fast - likely there's
> something else at hand that causes the
> connection to timeout.
> In such cases, you can enable logging on the REST package and see what
> type of data transfer occurs between ES and Spark.
>
> Do note that if a GC occurs, that can freeze Elastic (or Spark) which
> might trigger the timeout. Consider monitoring
> Elasticsearch during
> the query and see whether anything jumps - in particular the memory
> pressure.
>
> Hope this helps,
>
> [1]
> http://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html#_network
>
> On 4/22/15 10:44 PM, Adrian Mocanu wrote:
>
>> Hi
>>
>> Thanks for the help. My ES is up.
>>
>> Out of curiosity, do you know what the timeout value is? There are
>> probably other things happening to cause the timeout;
>> I don’t think my ES is that slow but it’s possible that ES is taking too
>> long to find the data. What I see happening is
>> that it uses scroll to get the data from ES; about 150 items at a
>> time.Usual delay when I perform the same query from a
>> browser plugin ranges from 1-5sec.
>>
>> Thanks
>>
>> *From:*Jeetendra Gangele [mailto:gangele...@gmail.com]
>> *Sent:* April 22, 2015 3:09 PM
>> *To:* Adrian Mocanu
>> *Cc:* u...@spark.incubator.apache.org
>> *Subject:* Re: ElasticSearch for Spark times out
>>
>> Basically ready timeout means hat no data arrived within the specified
>> receive timeout period.
>>
>> Few thing I would suggest
>>
>> 1.are your ES cluster Up and running?
>>
>> 2. if 1 is yes then reduce the size of the Index make it few kbps and
>> then test?
>>
>> On 23 April 2015 at 00:19, Adrian Mocanu > <mailto:amoc...@verticalscope.com>> wrote:
>>
>> Hi
>>
>> I use the ElasticSearch package for Spark and very often it times out
>> reading data from ES into an RDD.
>>
>> How can I keep the connection alive (why doesn’t it? Bug?)
>>
>> Here’s the exception I get:
>>
>> org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:
>> java.net.SocketTimeoutException: Read timed out
>>
>>  at
>> org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>
>>  at
>> org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>
>>  at
>> org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58)
>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>
>>  at
>> org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149)
>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>
>>  at
>> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102)
>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>
>>  at
>> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81)
>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>
>>  at
>> org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314)
>> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>>
>>  at
>> org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
>> ~[elasticsearch-hadoop-2.

Re: RE: ElasticSearch for Spark times out

2015-04-22 Thread Costin Leau

Hi,

First off, for Elasticsearch questions is worth pinging the Elastic mailing 
list as that is closer monitored than this one.

Back to your question, Jeetendra is right that the exception indicates nodata 
is flowing back to the es-connector and
Spark.
The default is 1m [1] which should be more than enough for a typical scenario. 
As a side note the scroll size is 50 per
tasks
(so 150 suggests 3 tasks).

Once the query is made, scrolling the document is fast - likely there's 
something else at hand that causes the
connection to timeout.
In such cases, you can enable logging on the REST package and see what type of 
data transfer occurs between ES and Spark.

Do note that if a GC occurs, that can freeze Elastic (or Spark) which might 
trigger the timeout. Consider monitoring
Elasticsearch during
the query and see whether anything jumps - in particular the memory pressure.

Hope this helps,

[1] 
http://www.elastic.co/guide/en/elasticsearch/hadoop/master/configuration.html#_network

On 4/22/15 10:44 PM, Adrian Mocanu wrote:

Hi

Thanks for the help. My ES is up.

Out of curiosity, do you know what the timeout value is? There are probably 
other things happening to cause the timeout;
I don’t think my ES is that slow but it’s possible that ES is taking too long 
to find the data. What I see happening is
that it uses scroll to get the data from ES; about 150 items at a time.Usual 
delay when I perform the same query from a
browser plugin ranges from 1-5sec.

Thanks

*From:*Jeetendra Gangele [mailto:gangele...@gmail.com]
*Sent:* April 22, 2015 3:09 PM
*To:* Adrian Mocanu
*Cc:* u...@spark.incubator.apache.org
*Subject:* Re: ElasticSearch for Spark times out

Basically ready timeout means hat no data arrived within the specified receive 
timeout period.

Few thing I would suggest

1.are your ES cluster Up and running?

2. if 1 is yes then reduce the size of the Index make it few kbps and then test?

On 23 April 2015 at 00:19, Adrian Mocanu mailto:amoc...@verticalscope.com>> wrote:

Hi

I use the ElasticSearch package for Spark and very often it times out reading 
data from ES into an RDD.

How can I keep the connection alive (why doesn’t it? Bug?)

Here’s the exception I get:

org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: 
java.net.SocketTimeoutException: Read timed out

 at 
org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]

 at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
~[scala-library.jar:na]

 at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
~[scala-library.jar:na]

 at 
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) 
~[scala-library.jar:na]

 at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
~[scala-library.jar:na]

 at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
~[scala-library.jar:na]

 at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
~[scala-library.jar:na]

 at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
~[spark-core_2.10-1.1.0.jar:1.1.0]

 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
~[spark-core_2.10-1.1.0.jar:1.1.0]

 at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
~[spark-core_2.10-1.1.0.jar:1.1.0]

 at org.apache.spark.scheduler.Task.run(Task.scala:54) 
~[spark-core_2.10-1.1.0.jar:1.1.0]

 at 
org.apache.spark.executor.Executor$TaskRunner.run(Exe

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi

Thanks for the help. My ES is up.
Out of curiosity, do you know what the timeout value is? There are probably 
other things happening to cause the timeout; I don't think my ES is that slow 
but it's possible that ES is taking too long to find the data. What I see 
happening is that it uses scroll to get the data from ES; about 150 items at a 
time. Usual delay when I perform the same query from a browser plugin ranges 
from 1-5sec.

Thanks

From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: April 22, 2015 3:09 PM
To: Adrian Mocanu
Cc: u...@spark.incubator.apache.org
Subject: Re: ElasticSearch for Spark times out

Basically ready timeout means hat no data arrived within the specified receive 
timeout period.
Few thing I would suggest
1.are your ES cluster Up and running?
2. if 1 is yes then reduce the size of the Index make it few kbps and then test?

On 23 April 2015 at 00:19, Adrian Mocanu 
mailto:amoc...@verticalscope.com>> wrote:
Hi

I use the ElasticSearch package for Spark and very often it times out reading 
data from ES into an RDD.
How can I keep the connection alive (why doesn't it? Bug?)

Here's the exception I get:
org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: 
java.net.SocketTimeoutException: Read timed out
at 
org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
~[scala-library.jar:na]
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
~[scala-library.jar:na]
at 
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) 
~[scala-library.jar:na]
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
~[scala-library.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
~[scala-library.jar:na]
at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
~[scala-library.jar:na]
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) 
~[spark-core_2.10-1.1.0.jar:1.1.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
~[spark-core_2.10-1.1.0.jar:1.1.0]
at org.apache.spark.scheduler.Task.run(Task.scala:54) 
~[spark-core_2.10-1.1.0.jar:1.1.0]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
~[spark-core_2.10-1.1.0.jar:1.1.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_75]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) 
~[na:1.7.0_75]
at java.net.SocketInputStream.read(SocketInputStream.java:152) 
~[na:1.7.0_75]
at java.net.SocketInputStream.read(SocketInputStream.java:122) 
~[na:1.7.0_75]
at 
java.io.BufferedInputStream.read1(BufferedInputStream.java:273) ~[na:1.7.0_75]
at 
java.io.BufferedInputStream.read(BufferedInputStream.java:334) ~[na:1.7.0_75]
at 
org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69)
 ~[commons-httpclient-3.1.

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele
Basically ready timeout means hat no data arrived within the specified
receive timeout period.
Few thing I would suggest
1.are your ES cluster Up and running?
2. if 1 is yes then reduce the size of the Index make it few kbps and then
test?

On 23 April 2015 at 00:19, Adrian Mocanu  wrote:

>  Hi
>
>
>
> I use the ElasticSearch package for Spark and very often it times out
> reading data from ES into an RDD.
>
> How can I keep the connection alive (why doesn't it? Bug?)
>
>
>
> Here's the exception I get:
>
> org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:
> java.net.SocketTimeoutException: Read timed out
>
> at
> org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> ~[scala-library.jar:na]
>
> at
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> ~[scala-library.jar:na]
>
> at
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> ~[scala-library.jar:na]
>
> at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> ~[scala-library.jar:na]
>
> at
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> ~[scala-library.jar:na]
>
> at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> ~[scala-library.jar:na]
>
> at
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_75]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_75]
>
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
>
> Caused by: java.net.SocketTimeoutException: Read timed out
>
> at java.net.SocketInputStream.socketRead0(Native Method)
> ~[na:1.7.0_75]
>
> at
> java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_75]
>
> at
> java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_75]
>
> at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> ~[na:1.7.0_75]
>
> at
> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> ~[na:1.7.0_75]
>
> at
> org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69)
> ~[commons-httpclient-3.1.jar:na]
>
> at
> org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
> ~[commons-httpclient-3.1.jar:na]
>
> at
> java.io.FilterInputStream.read(FilterInputStream.java:133) ~[na:1.7.0_75]
>
> at
> org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
> ~[commons-httpclient-3.1.jar:na]
>
> at
> org.elasticsearch.hadoop.rest.DelegatingInputStream.read(DelegatingInputStream.java:57)
> ~[elasticsearch-hadoop-2

RE: ElasticSearch for Spark times out

2015-04-22 Thread Adrian Mocanu
Hi
The gist of it is this:
I have data indexed into ES. Each index stores monthly data and the query will 
get data for some date range (across several ES indexes or within 1 if the date 
range is within 1 month). Then I merge these RDDs into an uberRdd and performs 
some operations then print the result with for each.

The simplified code is below.

{
// esRdds: List[RDD] contains mentions count per post
val esRdds = (startDate.getYear to endDate.getYear).flatMap { year =>
  val sMonth = if (year == startDate.getYear) startDate.getMonthOfYear else 
1
  val eMonth = if (year == endDate.getYear) endDate.getMonthOfYear else 12
  (sMonth to eMonth).map { i =>
sc.esRDD(s"$year-${i.formatted("%02d")}_nlpindex/nlp", 
ESQueries.generateQueryString(Some(startDate), Some(endDate), mentionsToFind, 
siteNames))
  .map { case (str, map) => unwrapAndCountMentionsPerPost(map)}
  }
}

var uberRdd = esRdds(0)
for (rdd <- esRdds) {
  uberRdd = uberRdd ++ rdd
}

   uberRdd.map joinforeach(x => println(x))
  }

From: Jeetendra Gangele [mailto:gangele...@gmail.com]
Sent: April 22, 2015 2:52 PM
To: Adrian Mocanu
Cc: u...@spark.incubator.apache.org
Subject: Re: ElasticSearch for Spark times out

will you be able to paste the code?

On 23 April 2015 at 00:19, Adrian Mocanu 
mailto:amoc...@verticalscope.com>> wrote:
Hi

I use the ElasticSearch package for Spark and very often it times out reading 
data from ES into an RDD.
How can I keep the connection alive (why doesn't it? Bug?)

Here's the exception I get:
org.elasticsearch.hadoop.serialization.EsHadoopSerializationException: 
java.net.SocketTimeoutException: Read timed out
at 
org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76) 
~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
 ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
~[scala-library.jar:na]
at 
scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) 
~[scala-library.jar:na]
at 
scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388) 
~[scala-library.jar:na]
at 
scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327) 
~[scala-library.jar:na]
at scala.collection.Iterator$class.foreach(Iterator.scala:727) 
~[scala-library.jar:na]
at 
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) 
~[scala-library.jar:na]
at 
org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
 ~[spark-core_2.10-1.1.0.jar:1.1.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68) 
~[spark-core_2.10-1.1.0.jar:1.1.0]
at 
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41) 
~[spark-core_2.10-1.1.0.jar:1.1.0]
at org.apache.spark.scheduler.Task.run(Task.scala:54) 
~[spark-core_2.10-1.1.0.jar:1.1.0]
at 
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177) 
~[spark-core_2.10-1.1.0.jar:1.1.0]
at 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) 
[na:1.7.0_75]
at 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) 
[na:1.7.0_75]
at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
Caused by: java.net.SocketTimeoutException: Read timed out
at java.net.SocketInputStream.socketRead0(Native Method) 
~[na:1.7.0_75]
at java.net.SocketInputStream.read(SocketInputStream.java:152) 
~[na:1.7.0_75]
at java.ne

Re: ElasticSearch for Spark times out

2015-04-22 Thread Jeetendra Gangele
will you be able to paste the code?

On 23 April 2015 at 00:19, Adrian Mocanu  wrote:

>  Hi
>
>
>
> I use the ElasticSearch package for Spark and very often it times out
> reading data from ES into an RDD.
>
> How can I keep the connection alive (why doesn't it? Bug?)
>
>
>
> Here's the exception I get:
>
> org.elasticsearch.hadoop.serialization.EsHadoopSerializationException:
> java.net.SocketTimeoutException: Read timed out
>
> at
> org.elasticsearch.hadoop.serialization.json.JacksonJsonParser.nextToken(JacksonJsonParser.java:86)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ParsingUtils.doSeekToken(ParsingUtils.java:70)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ParsingUtils.seek(ParsingUtils.java:58)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ScrollReader.readHit(ScrollReader.java:149)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:102)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.serialization.ScrollReader.read(ScrollReader.java:81)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.rest.RestRepository.scroll(RestRepository.java:314)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.hadoop.rest.ScrollQuery.hasNext(ScrollQuery.java:76)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.elasticsearch.spark.rdd.AbstractEsRDDIterator.hasNext(AbstractEsRDDIterator.scala:46)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> ~[scala-library.jar:na]
>
> at
> scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371)
> ~[scala-library.jar:na]
>
> at
> scala.collection.Iterator$$anon$14.hasNext(Iterator.scala:388)
> ~[scala-library.jar:na]
>
> at
> scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:327)
> ~[scala-library.jar:na]
>
> at
> scala.collection.Iterator$class.foreach(Iterator.scala:727)
> ~[scala-library.jar:na]
>
> at
> scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
> ~[scala-library.jar:na]
>
> at
> org.apache.spark.shuffle.hash.HashShuffleWriter.write(HashShuffleWriter.scala:65)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:68)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at
> org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at org.apache.spark.scheduler.Task.run(Task.scala:54)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at
> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
> ~[spark-core_2.10-1.1.0.jar:1.1.0]
>
> at
> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> [na:1.7.0_75]
>
> at
> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> [na:1.7.0_75]
>
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_75]
>
> Caused by: java.net.SocketTimeoutException: Read timed out
>
> at java.net.SocketInputStream.socketRead0(Native Method)
> ~[na:1.7.0_75]
>
> at
> java.net.SocketInputStream.read(SocketInputStream.java:152) ~[na:1.7.0_75]
>
> at
> java.net.SocketInputStream.read(SocketInputStream.java:122) ~[na:1.7.0_75]
>
> at
> java.io.BufferedInputStream.read1(BufferedInputStream.java:273)
> ~[na:1.7.0_75]
>
> at
> java.io.BufferedInputStream.read(BufferedInputStream.java:334)
> ~[na:1.7.0_75]
>
> at
> org.apache.commons.httpclient.WireLogInputStream.read(WireLogInputStream.java:69)
> ~[commons-httpclient-3.1.jar:na]
>
> at
> org.apache.commons.httpclient.ContentLengthInputStream.read(ContentLengthInputStream.java:170)
> ~[commons-httpclient-3.1.jar:na]
>
> at
> java.io.FilterInputStream.read(FilterInputStream.java:133) ~[na:1.7.0_75]
>
> at
> org.apache.commons.httpclient.AutoCloseInputStream.read(AutoCloseInputStream.java:108)
> ~[commons-httpclient-3.1.jar:na]
>
> at
> org.elasticsearch.hadoop.rest.DelegatingInputStream.read(DelegatingInputStream.java:57)
> ~[elasticsearch-hadoop-2.1.0.Beta3.jar:2.1.0.Beta3]
>
> at
> org.codehaus.jackson.impl.Utf8StreamParser.loadMore(Utf8StreamParser.java:172)
> ~[jackson-core-asl-1.9.11.jar:1.9.11]
>
> at
> org.cod