Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-27 Thread Gourav Sengupta
.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>> java.lang.Thread.run(Thread.java:745)
>>
>>
>> On Tue, Jan 26, 2016 at 4:26 PM, Erisa Dervishi <erisa...@gmail.com>
>> wrote:
>>
>>> I have quite a different situation though.
>>> My job works fine for S3 files (avro format) up to 1G. It starts to hang
>>> for files larger than that size (1.5G for example)
>>>
>>> This is how I am creating the RDD:
>>>
>>> val rdd: RDD[T] = ctx.newAPIHadoopFile[AvroKey[T], NullWritable,
>>> AvroKeyInputFormat[T]](s"s3n://path-to-avro-file")
>>>
>>> Because of dependency issues, I had to use an older version of Spark,
>>> and the job was hanging while reading from S3, but right now I upgraded to
>>> spark 1.5.2 and seems like reading from S3 works fine (first succeeded task
>>> in the screenshot attached, which takes 42 s).
>>>
>>> But than it gets stuck. The screenshot attached shows 24 running tasks
>>> that hang forever (with a "Running" status) eventhough I am just doing:
>>> rdd.count() (initially it was a groupby and I thought it was causing the
>>> issue, but even with just counting the lines of the file it hangs)
>>>
>>> Any suggestion is appreciated,
>>> Erisa
>>>
>>> On Tue, Jan 26, 2016 at 2:19 PM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> Are you creating RDD's using textfile option? Can you please let me
>>>> know the following:
>>>> 1. Number of partitions
>>>> 2. Number of files
>>>> 3. Time taken to create the RDD's
>>>>
>>>>
>>>> Regards,
>>>> Gourav Sengupta
>>>>
>>>>
>>>> On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta <
>>>> gourav.sengu...@gmail.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> are you creating RDD's out of the data?
>>>>>
>>>>>
>>>>>
>>>>> Regards,
>>>>> Gourav
>>>>>
>>>>> On Tue, Jan 26, 2016 at 12:45 PM, aecc <alessandroa...@gmail.com>
>>>>> wrote:
>>>>>
>>>>>> Sorry, I have not been able to solve the issue. I used speculation
>>>>>> mode as
>>>>>> workaround to this.
>>>>>>
>>>>>>
>>>>>>
>>>>>> --
>>>>>> View this message in context:
>>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
>>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>>> Nabble.com.
>>>>>>
>>>>>> -
>>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>>
>>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-27 Thread Erisa Dervishi
>
>> val rdd: RDD[T] = ctx.newAPIHadoopFile[AvroKey[T], NullWritable,
>> AvroKeyInputFormat[T]](s"s3n://path-to-avro-file")
>>
>> Because of dependency issues, I had to use an older version of Spark, and
>> the job was hanging while reading from S3, but right now I upgraded to
>> spark 1.5.2 and seems like reading from S3 works fine (first succeeded task
>> in the screenshot attached, which takes 42 s).
>>
>> But than it gets stuck. The screenshot attached shows 24 running tasks
>> that hang forever (with a "Running" status) eventhough I am just doing:
>> rdd.count() (initially it was a groupby and I thought it was causing the
>> issue, but even with just counting the lines of the file it hangs)
>>
>> Any suggestion is appreciated,
>> Erisa
>>
>> On Tue, Jan 26, 2016 at 2:19 PM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> Are you creating RDD's using textfile option? Can you please let me know
>>> the following:
>>> 1. Number of partitions
>>> 2. Number of files
>>> 3. Time taken to create the RDD's
>>>
>>>
>>> Regards,
>>> Gourav Sengupta
>>>
>>>
>>> On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta <
>>> gourav.sengu...@gmail.com> wrote:
>>>
>>>> Hi,
>>>>
>>>> are you creating RDD's out of the data?
>>>>
>>>>
>>>>
>>>> Regards,
>>>> Gourav
>>>>
>>>> On Tue, Jan 26, 2016 at 12:45 PM, aecc <alessandroa...@gmail.com>
>>>> wrote:
>>>>
>>>>> Sorry, I have not been able to solve the issue. I used speculation
>>>>> mode as
>>>>> workaround to this.
>>>>>
>>>>>
>>>>>
>>>>> --
>>>>> View this message in context:
>>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
>>>>> Sent from the Apache Spark User List mailing list archive at
>>>>> Nabble.com.
>>>>>
>>>>> -
>>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>>
>>>>>
>>>>
>>>
>>
>


Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Gourav Sengupta
Hi,

Are you creating RDD's using textfile option? Can you please let me know
the following:
1. Number of partitions
2. Number of files
3. Time taken to create the RDD's


Regards,
Gourav Sengupta


On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta <gourav.sengu...@gmail.com>
wrote:

> Hi,
>
> are you creating RDD's out of the data?
>
>
>
> Regards,
> Gourav
>
> On Tue, Jan 26, 2016 at 12:45 PM, aecc <alessandroa...@gmail.com> wrote:
>
>> Sorry, I have not been able to solve the issue. I used speculation mode as
>> workaround to this.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> -
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>
>


Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Erisa Dervishi
 it was a groupby and I thought it was causing the
> issue, but even with just counting the lines of the file it hangs)
>
> Any suggestion is appreciated,
> Erisa
>
> On Tue, Jan 26, 2016 at 2:19 PM, Gourav Sengupta <
> gourav.sengu...@gmail.com> wrote:
>
>> Hi,
>>
>> Are you creating RDD's using textfile option? Can you please let me know
>> the following:
>> 1. Number of partitions
>> 2. Number of files
>> 3. Time taken to create the RDD's
>>
>>
>> Regards,
>> Gourav Sengupta
>>
>>
>> On Tue, Jan 26, 2016 at 1:12 PM, Gourav Sengupta <
>> gourav.sengu...@gmail.com> wrote:
>>
>>> Hi,
>>>
>>> are you creating RDD's out of the data?
>>>
>>>
>>>
>>> Regards,
>>> Gourav
>>>
>>> On Tue, Jan 26, 2016 at 12:45 PM, aecc <alessandroa...@gmail.com> wrote:
>>>
>>>> Sorry, I have not been able to solve the issue. I used speculation mode
>>>> as
>>>> workaround to this.
>>>>
>>>>
>>>>
>>>> --
>>>> View this message in context:
>>>> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
>>>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>>>
>>>> -
>>>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>>>> For additional commands, e-mail: user-h...@spark.apache.org
>>>>
>>>>
>>>
>>
>


Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Gourav Sengupta
Hi,

are you creating RDD's out of the data?



Regards,
Gourav

On Tue, Jan 26, 2016 at 12:45 PM, aecc <alessandroa...@gmail.com> wrote:

> Sorry, I have not been able to solve the issue. I used speculation mode as
> workaround to this.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread Erisa Dervishi
Hi,
I kind am in your situation now while trying to read from S3.
Where you able to find a workaround in the end?

Thnx,
Erisa



On Thu, Nov 12, 2015 at 12:00 PM, aecc <alessandroa...@gmail.com> wrote:

> Some other stats:
>
> The number of files I have in the folder is 48.
> The number of partitions used when reading data is 7315.
> The maximum size of a file to read is 14G
> The size of the folder is around: 270G
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25367.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


Re: Spark task hangs infinitely when accessing S3 from AWS

2016-01-26 Thread aecc
Sorry, I have not been able to solve the issue. I used speculation mode as
workaround to this.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p26068.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread aecc
Any hints?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25365.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread Michael Cutler
Reading files directly from Amazon S3 can be frustrating especially if
you're dealing with a large number of input files, could you please
elaborate more on your use-case?  Does the S3 bucket in question already
contain a large number of files?

The implementation of the * wildcard operator in S3 input paths requires an
AWS S3 API call to list everything based on the common-prefix; so if your
input is something like;

  s3://my-bucket*.json

Then the prefix "///" will be passed to the API and
should be fairly efficient.

However if you're doing something more adventurous like;

  s3://my-bucket/*/*/*/*.json

There is no common-prefix to give the API here, it will literally list
every object in the bucket and then filter client-side to find anything
that matches "*.json", these types of requests are prone to timeouts and
other intermittent issues as well as taking a ridiculous amount of time
before the job can start.


Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-12 Thread aecc
Some other stats:

The number of files I have in the folder is 48.
The number of partitions used when reading data is 7315.
The maximum size of a file to read is 14G
The size of the folder is around: 270G 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25367.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Re: Spark task hangs infinitely when accessing S3 from AWS

2015-11-09 Thread aecc
Any help on this? this is really blocking me and I don't find any feasible
solution yet.

Thanks.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289p25327.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org



Spark task hangs infinitely when accessing S3 from AWS

2015-11-05 Thread aecc
Hi guys, when reading data from S3 from AWS using Spark 1.5.1 one of the
tasks hangs when reading data in a way that cannot be reproduced. Some times
it hangs, some times it doesn't.

This is the thread dump from the hung task:

"Executor task launch worker-3" daemon prio=10 tid=0x7f419c023000
nid=0x6548 runnable [0x7f425df2b000]
   java.lang.Thread.State: RUNNABLE
at java.net.SocketInputStream.socketRead0(Native Method)
at java.net.SocketInputStream.read(SocketInputStream.java:152)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:934)
- locked <0x7f42c373b4d8> (a java.lang.Object)
at
sun.security.ssl.SSLSocketImpl.performInitialHandshake(SSLSocketImpl.java:1332)
- locked <0x7f42c373b610> (a java.lang.Object)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1359)
at
sun.security.ssl.SSLSocketImpl.startHandshake(SSLSocketImpl.java:1343)
at
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:533)
at
org.apache.http.conn.ssl.SSLSocketFactory.connectSocket(SSLSocketFactory.java:401)
at
org.apache.http.impl.conn.DefaultClientConnectionOperator.openConnection(DefaultClientConnectionOperator.java:177)
at
org.apache.http.impl.conn.ManagedClientConnectionImpl.open(ManagedClientConnectionImpl.java:304)
at
org.apache.http.impl.client.DefaultRequestDirector.tryConnect(DefaultRequestDirector.java:610)
at
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:445)
at
org.apache.http.impl.client.AbstractHttpClient.doExecute(AbstractHttpClient.java:863)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:82)
at
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:57)
at
com.amazonaws.http.AmazonHttpClient.executeHelper(AmazonHttpClient.java:384)
at
com.amazonaws.http.AmazonHttpClient.execute(AmazonHttpClient.java:232)
at
com.amazonaws.services.s3.AmazonS3Client.invoke(AmazonS3Client.java:3528)
at
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:976)
at
com.amazonaws.services.s3.AmazonS3Client.getObjectMetadata(AmazonS3Client.java:956)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:892)
at
org.apache.hadoop.fs.s3a.S3AFileSystem.getFileStatus(S3AFileSystem.java:77)
at org.apache.avro.mapred.FsInput.(FsInput.java:37)
at
org.apache.avro.mapreduce.AvroRecordReaderBase.createSeekableInput(AvroRecordReaderBase.java:171)
at
org.apache.avro.mapreduce.AvroRecordReaderBase.initialize(AvroRecordReaderBase.java:87)
at
org.apache.spark.rdd.NewHadoopRDD$$anon$1.(NewHadoopRDD.scala:153)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:124)
at org.apache.spark.rdd.NewHadoopRDD.compute(NewHadoopRDD.scala:65)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)

This are my only manually passed settings (besides the s3 credentials):

--conf spark.driver.maxResultSize=4g \
--conf spark.akka.frameSize=500 \
--conf spark.hadoop.fs.s3a.connection.maximum=500 \

I'm using aws-java-sdk-1.7.4.jar and hadoop-aws-2.7.1.jar to be able to read
data from AWS.

I have been long time struggling with this issue, the only workaround I have
been able to find is to use Spark Speculation, however that's not a feasible
solution for me anymore.

Thanks



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-task-hangs-infinitely-when-accessing-S3-from-AWS-tp25289.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org