RE: Understanding Spark S3 Read Performance

2023-05-16 Thread info
Hi,For clarification, are those 12 / 14 minutes cumulative cpu time or wall 
clock time? How many executors executed those 1 / 375 tasks?Cheers,Enrico
 Ursprüngliche Nachricht Von: Shashank Rao 
 Datum: 16.05.23  19:48  (GMT+01:00) An: 
user@spark.apache.org Betreff: Understanding Spark S3 Read Performance Hi,I'm 
trying to set up a Spark pipeline which reads data from S3 and writes it into 
Google Big Query.Environment Details:---Java 8AWS 
EMR-6.10.0Spark v3.3.12 m5.xlarge executor nodesS3 Directory 
structure:--- bucket-name:|---folder1:                 |---folder2:     
              |---file1.jsonl             |---file2.jsonl               ...     
                      |---file12000.jsonlEach file is of size 1.6 MB and there 
are a total of 12,000 files. The code to read the source data looks like 
this:spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",)spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",)spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","s3.amazonaws.com")Dataset
 data = spark.read().option("recursiveFileLookup", 
"true").json("s3a://bucket-name/folder1/folder2/")Now
 here's my problem:This triggers two jobs: a) Listing Leaf Nodes, b) Json 
ReadThe list job takes around 12 mins and has 10,000 partitions/tasks. The read 
job takes 14 mins and has 375 partitions/tasks.1. What's going on here? Why 
does the list job take so much time? I could write a simple JAVA code using AWS 
SDK which listed all the files in the exact same S3 directory in 5 seconds.2. 
Why are there 10,000 partitions/tasks for listing? Can this be configured / 
changed? Is there any optimisation that can be done to reduce the time taken 
over here?3. What is the second job doing? Why are there 375 partitions in that 
job? If this works out, I actually need to run the pipeline on a much larger 
data set of around 200,000 files and it doesn't look like it will scale very 
well.Please note, modifying the source data is not an option that I have. 
Hence, I cannot merge multiple small files into a single large file.Any help is 
appreciated.-- Thanks,Shashank Rao


Understanding Spark S3 Read Performance

2023-05-16 Thread Shashank Rao
Hi,
I'm trying to set up a Spark pipeline which reads data from S3 and writes
it into Google Big Query.

Environment Details:
---
Java 8
AWS EMR-6.10.0
Spark v3.3.1
2 m5.xlarge executor nodes


S3 Directory structure:
---
bucket-name:
|---folder1:
  |---folder2:
 |---file1.jsonl
 |---file2.jsonl
   ...
 |---file12000.jsonl


Each file is of size 1.6 MB and there are a total of 12,000 files.

The code to read the source data looks like this:

spark.sparkContext.hadoopConfiguration.set("fs.s3a.access.key",)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.secret.key",)
spark.sparkContext.hadoopConfiguration.set("fs.s3a.endpoint","
s3.amazonaws.com")
Dataset data = spark.read().option("recursiveFileLookup",
"true").json("s3a://bucket-name/folder1/folder2/")


Now here's my problem:
This triggers two jobs: a) Listing Leaf Nodes, b) Json Read

[image: Screenshot 2023-05-16 at 23.00.23.png]

The list job takes around 12 mins and has 10,000 partitions/tasks. The read
job takes 14 mins and has 375 partitions/tasks.

1. What's going on here? Why does the list job take so much time? I could
write a simple JAVA code using AWS SDK which listed all the files in the
exact same S3 directory in 5 seconds.

2. Why are there 10,000 partitions/tasks for listing? Can this be
configured / changed? Is there any optimisation that can be done to reduce
the time taken over here?

3. What is the second job doing? Why are there 375 partitions in that job?

If this works out, I actually need to run the pipeline on a much larger
data set of around 200,000 files and it doesn't look like it will scale
very well.

Please note, modifying the source data is not an option that I have. Hence,
I cannot merge multiple small files into a single large file.

Any help is appreciated.


-- 
Thanks,
Shashank Rao


Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Thanks, Vadim! That helps and makes sense. I don't think we have a number of 
keys so large that we have to worry about it. If we do, I think I would go with 
an approach similar to what you suggested.

Thanks again,
Subhash 

Sent from my iPhone

> On Mar 8, 2018, at 11:56 AM, Vadim Semenov  wrote:
> 
> You need to put randomness into the beginning of the key, if you put it other 
> than into the beginning, it's not guaranteed that you're going to have good 
> performance.
> 
> The way we achieved this is by writing to HDFS first, and then having a 
> custom DistCp implemented using Spark that copies parquet files using random 
> keys,
> and then saves the list of resulting keys to S3, and when we want to use 
> those parquet files, we just need to load the listing file, and then take 
> keys from it and pass them into the loader.
> 
> You only need to do this when you have way too many files, if the number of 
> keys you operate is reasonably small (let's say, in thousands), you won't get 
> any benefits.
> 
> Also the S3 buckets have internal optimizations, and overtime it adjusts to 
> the workload, i.e. some additional underlying partitions are getting added, 
> some splits happen, etc.
> If you want to have good performance from start, you would need to use 
> randomization, yes.
> Or alternatively, you can contact AWS and tell them about the naming schema 
> that you're going to have (but it must be set in stone), and then they can 
> try to pre-optimize the bucket for you.
> 
>> On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram  
>> wrote:
>> Hey Spark user community,
>> 
>> I am writing Parquet files from Spark to S3 using S3a. I was reading this 
>> article about improving S3 bucket performance, specifically about how it can 
>> help to introduce randomness to your key names so that data is written to 
>> different partitions.
>> 
>> https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/
>> 
>> Is there a straight forward way to accomplish this randomness in Spark via 
>> the DataSet API? The only thing that I could think of would be to actually 
>> split the large set into multiple sets (based on row boundaries), and then 
>> write each one with the random key name.
>> 
>> Is there an easier way that I am missing?
>> 
>> Thanks in advance!
>> Subhash
>> 
>> 
> 


Re: Spark & S3 - Introducing random values into key names

2018-03-08 Thread Vadim Semenov
You need to put randomness into the beginning of the key, if you put it
other than into the beginning, it's not guaranteed that you're going to
have good performance.

The way we achieved this is by writing to HDFS first, and then having a
custom DistCp implemented using Spark that copies parquet files using
random keys,
and then saves the list of resulting keys to S3, and when we want to use
those parquet files, we just need to load the listing file, and then take
keys from it and pass them into the loader.

You only need to do this when you have way too many files, if the number of
keys you operate is reasonably small (let's say, in thousands), you won't
get any benefits.

Also the S3 buckets have internal optimizations, and overtime it adjusts to
the workload, i.e. some additional underlying partitions are getting added,
some splits happen, etc.
If you want to have good performance from start, you would need to use
randomization, yes.
Or alternatively, you can contact AWS and tell them about the naming schema
that you're going to have (but it must be set in stone), and then they can
try to pre-optimize the bucket for you.

On Thu, Mar 8, 2018 at 11:42 AM, Subhash Sriram 
wrote:

> Hey Spark user community,
>
> I am writing Parquet files from Spark to S3 using S3a. I was reading this
> article about improving S3 bucket performance, specifically about how it
> can help to introduce randomness to your key names so that data is written
> to different partitions.
>
> https://aws.amazon.com/premiumsupport/knowledge-
> center/s3-bucket-performance-improve/
>
> Is there a straight forward way to accomplish this randomness in Spark via
> the DataSet API? The only thing that I could think of would be to actually
> split the large set into multiple sets (based on row boundaries), and then
> write each one with the random key name.
>
> Is there an easier way that I am missing?
>
> Thanks in advance!
> Subhash
>
>
>


Spark & S3 - Introducing random values into key names

2018-03-08 Thread Subhash Sriram
Hey Spark user community,

I am writing Parquet files from Spark to S3 using S3a. I was reading this
article about improving S3 bucket performance, specifically about how it
can help to introduce randomness to your key names so that data is written
to different partitions.

https://aws.amazon.com/premiumsupport/knowledge-center/s3-bucket-performance-improve/

Is there a straight forward way to accomplish this randomness in Spark via
the DataSet API? The only thing that I could think of would be to actually
split the large set into multiple sets (based on row boundaries), and then
write each one with the random key name.

Is there an easier way that I am missing?

Thanks in advance!
Subhash


Re: Spark <--> S3 flakiness

2017-05-18 Thread Steve Loughran

On 18 May 2017, at 05:29, lucas.g...@gmail.com 
wrote:

Steve, just to clarify:

"FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way 
better on high-performance reads, especially if you are working with column 
data and can set the fs.s3a.experimental.fadvise=random option. "

Are you talking about the hadoop-aws lib or hadoop itself.  I see that spark is 
currently only pre-built against hadoop 2.7.

all the hadoop JARs. It's a big move, and really I'd hold off for it except in 
the special case : spark standalone on my desktop

Most of our failures are on write, the other fix I've seen advertised has been: 
"fileoutputcommitter.algorithm.version=2"

this eliminates the big rename() in job commit, renaming the work of individual 
tasks at the end of each task commit.

It doesn't do anything for problems writing data, and it still has a 
fundamental flaw: to rename everything in a "directory tree", you need to be 
able to list all objects under a path, which utterly depends on consistent 
directory listings. Amazon S3 doesn't offer that: you can create a file, then 
list the bucket *and not see the file*. Similarly, after deletion it may be 
listed, but not be there any more. Without that consistent listing, you don't 
get reliable renames, hence output.

It's possible that you may not even notice the fact that data hasn't been 
copied over.

Ryan's committer avoids this problem by using the local filesystem and HDFS 
cluster as the consistent stores, and using uncompleted S3A multipart uploads 
to eliminate the rename at the end

https://github.com/rdblue/s3committer

see also: https://www.youtube.com/watch?v=8F2Jqw5_OnI=youtu.be



Still doing some reading and will start testing in the next day or so.

Thanks!

Gary

On 17 May 2017 at 03:19, Steve Loughran 
> wrote:

On 17 May 2017, at 06:00, lucas.g...@gmail.com 
wrote:

Steve, thanks for the reply.  Digging through all the documentation now.

Much appreciated!



FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way 
better on high-performance reads, especially if you are working with column 
data and can set the fs.s3a.experimental.fadvise=random option.

That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest versions 
of CDH, even if their docs don't mention it

https://hortonworks.github.io/hdp-aws/s3-performance/
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/spark_s3.html


On 16 May 2017 at 10:10, Steve Loughran 
> wrote:

On 11 May 2017, at 06:07, lucas.g...@gmail.com 
wrote:

Hi users, we have a bunch of pyspark jobs that are using S3 for loading / 
intermediate steps and final output of parquet files.

Please don't, not without a committer specially written to work against S3 in 
the presence of failures.You are at risk of things going wrong and you not even 
noticing.

The only one that I trust to do this right now is; 
https://github.com/rdblue/s3committer


see also : https://github.com/apache/spark/blob/master/docs/cloud-integration.md



We're running into the following issues on a semi regular basis:
* These are intermittent errors, IE we have about 300 jobs that run nightly... 
And a fairly random but small-ish percentage of them fail with the following 
classes of errors.

S3 write errors

"ERROR Utils: Aborting task
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS 
Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error 
Message: Not Found, S3 Extended Request ID: BlaBlahEtc="

"Py4JJavaError: An error occurred while calling o43.parquet.
: com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, 
AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error 
Message: One or more objects could not be deleted, S3 Extended Request ID: null"


S3 Read Errors:

[Stage 1:=>   (27 + 4) / 
31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 
11)
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at 

Re: Spark <--> S3 flakiness

2017-05-17 Thread lucas.g...@gmail.com
Steve, just to clarify:

"FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is
way better on high-performance reads, especially if you are working with
column data and can set the fs.s3a.experimental.fadvise=random option. "

Are you talking about the hadoop-aws lib or hadoop itself.  I see that
spark is currently only pre-built against hadoop 2.7.

Most of our failures are on write, the other fix I've seen advertised has
been: "fileoutputcommitter.algorithm.version=2"

Still doing some reading and will start testing in the next day or so.

Thanks!

Gary

On 17 May 2017 at 03:19, Steve Loughran  wrote:

>
> On 17 May 2017, at 06:00, lucas.g...@gmail.com wrote:
>
> Steve, thanks for the reply.  Digging through all the documentation now.
>
> Much appreciated!
>
>
>
> FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is
> way better on high-performance reads, especially if you are working with
> column data and can set the fs.s3a.experimental.fadvise=random option.
>
> That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest
> versions of CDH, even if their docs don't mention it
>
> https://hortonworks.github.io/hdp-aws/s3-performance/
> https://www.cloudera.com/documentation/enterprise/5-9-
> x/topics/spark_s3.html
>
>
> On 16 May 2017 at 10:10, Steve Loughran  wrote:
>
>>
>> On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote:
>>
>> Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
>> intermediate steps and final output of parquet files.
>>
>>
>> Please don't, not without a committer specially written to work against
>> S3 in the presence of failures.You are at risk of things going wrong and
>> you not even noticing.
>>
>> The only one that I trust to do this right now is;
>> https://github.com/rdblue/s3committer
>>
>>
>> see also : https://github.com/apache/spark/blob/master/docs/cloud-int
>> egration.md
>>
>>
>>
>> We're running into the following issues on a semi regular basis:
>> * These are intermittent errors, IE we have about 300 jobs that run
>> nightly... And a fairly random but small-ish percentage of them fail with
>> the following classes of errors.
>>
>>
>> *S3 write errors *
>>
>>> "ERROR Utils: Aborting task
>>> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404,
>>> AWS Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
>>> Error Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>>>
>>
>>
>>> "Py4JJavaError: An error occurred while calling o43.parquet.
>>> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status
>>> Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS
>>> Error Message: One or more objects could not be deleted, S3 Extended
>>> Request ID: null"
>>
>>
>>
>>
>> *S3 Read Errors: *
>>
>>> [Stage 1:=>   (27 +
>>> 4) / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage
>>> 1.0 (TID 11)
>>> java.net.SocketException: Connection reset
>>> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>>> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>>> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>>> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>>> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>>> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>>> at org.apache.http.impl.io.AbstractSessionInputBuffer.read(Abst
>>> ractSessionInputBuffer.java:198)
>>> at org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>>> tLengthInputStream.java:178)
>>> at org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>>> tLengthInputStream.java:200)
>>> at org.apache.http.impl.io.ContentLengthInputStream.close(Conte
>>> ntLengthInputStream.java:103)
>>> at org.apache.http.conn.BasicManagedEntity.streamClosed(BasicMa
>>> nagedEntity.java:168)
>>> at org.apache.http.conn.EofSensorInputStream.checkClose(EofSens
>>> orInputStream.java:228)
>>> at org.apache.http.conn.EofSensorInputStream.close(EofSensorInp
>>> utStream.java:174)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>>> at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream
>>> .java:187)
>>
>>
>>
>> We have literally tons of logs we can add but it would make the email
>> unwieldy big.  If it would be helpful I'll drop them in a pastebin or
>> something.
>>
>> Our config is along the lines of:
>>
>>- spark-2.1.0-bin-hadoop2.7

Re: Spark <--> S3 flakiness

2017-05-17 Thread Steve Loughran

On 17 May 2017, at 06:00, lucas.g...@gmail.com 
wrote:

Steve, thanks for the reply.  Digging through all the documentation now.

Much appreciated!



FWIW, if you can move up to the Hadoop 2.8 version of the S3A client it is way 
better on high-performance reads, especially if you are working with column 
data and can set the fs.s3a.experimental.fadvise=random option.

That's in apache Hadoop 2.8, HDP 2.5+, and I suspect also the latest versions 
of CDH, even if their docs don't mention it

https://hortonworks.github.io/hdp-aws/s3-performance/
https://www.cloudera.com/documentation/enterprise/5-9-x/topics/spark_s3.html


On 16 May 2017 at 10:10, Steve Loughran 
> wrote:

On 11 May 2017, at 06:07, lucas.g...@gmail.com 
wrote:

Hi users, we have a bunch of pyspark jobs that are using S3 for loading / 
intermediate steps and final output of parquet files.

Please don't, not without a committer specially written to work against S3 in 
the presence of failures.You are at risk of things going wrong and you not even 
noticing.

The only one that I trust to do this right now is; 
https://github.com/rdblue/s3committer


see also : https://github.com/apache/spark/blob/master/docs/cloud-integration.md



We're running into the following issues on a semi regular basis:
* These are intermittent errors, IE we have about 300 jobs that run nightly... 
And a fairly random but small-ish percentage of them fail with the following 
classes of errors.

S3 write errors

"ERROR Utils: Aborting task
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS 
Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error 
Message: Not Found, S3 Extended Request ID: BlaBlahEtc="

"Py4JJavaError: An error occurred while calling o43.parquet.
: com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, 
AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error 
Message: One or more objects could not be deleted, S3 Extended Request ID: null"


S3 Read Errors:

[Stage 1:=>   (27 + 4) / 
31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 
11)
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at 
org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at 
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168)
at 
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
at 
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)


We have literally tons of logs we can add but it would make the email unwieldy 
big.  If it would be helpful I'll drop them in a pastebin or something.

Our config is along the lines of:

  *   spark-2.1.0-bin-hadoop2.7
  *   '--packages 
com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 
pyspark-shell'

You should have the Hadoop 2.7 JARs on your CP, as s3a on 2.6 wasn't ready to 
play with. In particular, in a close() call it reads to the end of the stream, 
which is a performance killer on large files. That stack trace you see is from 
that same phase of operation, so should go away too.

Hadoop 2.7.3 depends on Amazon SDK 1.7.4; trying to use a different one will 
probably cause link errors.
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3

Also: make sure Joda time >= 2.8.1 for Java 8

If you go up to 2.8.0, and you still see the errors, file something against 
HADOOP in JIRA


Given the stack overflow / googling 

Re: Spark <--> S3 flakiness

2017-05-16 Thread lucas.g...@gmail.com
Steve, thanks for the reply.  Digging through all the documentation now.

Much appreciated!



On 16 May 2017 at 10:10, Steve Loughran  wrote:

>
> On 11 May 2017, at 06:07, lucas.g...@gmail.com wrote:
>
> Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
> intermediate steps and final output of parquet files.
>
>
> Please don't, not without a committer specially written to work against S3
> in the presence of failures.You are at risk of things going wrong and you
> not even noticing.
>
> The only one that I trust to do this right now is;
> https://github.com/rdblue/s3committer
>
>
> see also : https://github.com/apache/spark/blob/master/docs/cloud-
> integration.md
>
>
>
> We're running into the following issues on a semi regular basis:
> * These are intermittent errors, IE we have about 300 jobs that run
> nightly... And a fairly random but small-ish percentage of them fail with
> the following classes of errors.
>
>
> *S3 write errors *
>
>> "ERROR Utils: Aborting task
>> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS
>> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error
>> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>>
>
>
>> "Py4JJavaError: An error occurred while calling o43.parquet.
>> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status
>> Code: 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS
>> Error Message: One or more objects could not be deleted, S3 Extended
>> Request ID: null"
>
>
>
>
> *S3 Read Errors: *
>
>> [Stage 1:=>   (27 +
>> 4) / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage
>> 1.0 (TID 11)
>> java.net.SocketException: Connection reset
>> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>> at org.apache.http.impl.io.AbstractSessionInputBuffer.read(
>> AbstractSessionInputBuffer.java:198)
>> at org.apache.http.impl.io.ContentLengthInputStream.read(
>> ContentLengthInputStream.java:178)
>> at org.apache.http.impl.io.ContentLengthInputStream.read(
>> ContentLengthInputStream.java:200)
>> at org.apache.http.impl.io.ContentLengthInputStream.close(
>> ContentLengthInputStream.java:103)
>> at org.apache.http.conn.BasicManagedEntity.streamClosed(
>> BasicManagedEntity.java:168)
>> at org.apache.http.conn.EofSensorInputStream.checkClose(
>> EofSensorInputStream.java:228)
>> at org.apache.http.conn.EofSensorInputStream.close(
>> EofSensorInputStream.java:174)
>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>> at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)
>
>
>
> We have literally tons of logs we can add but it would make the email
> unwieldy big.  If it would be helpful I'll drop them in a pastebin or
> something.
>
> Our config is along the lines of:
>
>- spark-2.1.0-bin-hadoop2.7
>- '--packages 
> com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
>pyspark-shell'
>
>
> You should have the Hadoop 2.7 JARs on your CP, as s3a on 2.6 wasn't ready
> to play with. In particular, in a close() call it reads to the end of the
> stream, which is a performance killer on large files. That stack trace you
> see is from that same phase of operation, so should go away too.
>
> Hadoop 2.7.3 depends on Amazon SDK 1.7.4; trying to use a different one
> will probably cause link errors.
> http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3
>
> Also: make sure Joda time >= 2.8.1 for Java 8
>
> If you go up to 2.8.0, and you still see the errors, file something
> against HADOOP in JIRA
>
>
> Given the stack overflow / googling I've been doing I know we're not the
> only org with these issues but I haven't found a good set of solutions in
> those spaces yet.
>
> Thanks!
>
> Gary Lucas
>
>
>


Re: Spark <--> S3 flakiness

2017-05-16 Thread Steve Loughran

On 11 May 2017, at 06:07, lucas.g...@gmail.com 
wrote:

Hi users, we have a bunch of pyspark jobs that are using S3 for loading / 
intermediate steps and final output of parquet files.

Please don't, not without a committer specially written to work against S3 in 
the presence of failures.You are at risk of things going wrong and you not even 
noticing.

The only one that I trust to do this right now is; 
https://github.com/rdblue/s3committer


see also : https://github.com/apache/spark/blob/master/docs/cloud-integration.md



We're running into the following issues on a semi regular basis:
* These are intermittent errors, IE we have about 300 jobs that run nightly... 
And a fairly random but small-ish percentage of them fail with the following 
classes of errors.

S3 write errors

"ERROR Utils: Aborting task
com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS 
Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error 
Message: Not Found, S3 Extended Request ID: BlaBlahEtc="

"Py4JJavaError: An error occurred while calling o43.parquet.
: com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code: 0, 
AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error 
Message: One or more objects could not be deleted, S3 Extended Request ID: null"


S3 Read Errors:

[Stage 1:=>   (27 + 4) / 
31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0 (TID 
11)
java.net.SocketException: Connection reset
at java.net.SocketInputStream.read(SocketInputStream.java:196)
at java.net.SocketInputStream.read(SocketInputStream.java:122)
at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
at sun.security.ssl.InputRecord.read(InputRecord.java:509)
at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
at 
org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
at 
org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
at 
org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
at 
org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168)
at 
org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
at 
org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at java.io.FilterInputStream.close(FilterInputStream.java:181)
at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)


We have literally tons of logs we can add but it would make the email unwieldy 
big.  If it would be helpful I'll drop them in a pastebin or something.

Our config is along the lines of:

  *   spark-2.1.0-bin-hadoop2.7
  *   '--packages 
com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0 
pyspark-shell'

You should have the Hadoop 2.7 JARs on your CP, as s3a on 2.6 wasn't ready to 
play with. In particular, in a close() call it reads to the end of the stream, 
which is a performance killer on large files. That stack trace you see is from 
that same phase of operation, so should go away too.

Hadoop 2.7.3 depends on Amazon SDK 1.7.4; trying to use a different one will 
probably cause link errors.
http://mvnrepository.com/artifact/org.apache.hadoop/hadoop-aws/2.7.3

Also: make sure Joda time >= 2.8.1 for Java 8

If you go up to 2.8.0, and you still see the errors, file something against 
HADOOP in JIRA


Given the stack overflow / googling I've been doing I know we're not the only 
org with these issues but I haven't found a good set of solutions in those 
spaces yet.

Thanks!

Gary Lucas



Re: Spark <--> S3 flakiness

2017-05-14 Thread Gourav Sengupta
Are you running EMR?

On Sun, May 14, 2017 at 4:59 AM, Miguel Morales 
wrote:

> Some things just didn't work as i had first expected it.  For example,
> when writing from a spark collection to an alluxio destination didn't
> persist them to s3 automatically.
>
> I remember having to use the alluxio library directly to force the
> files to persist to s3 after spark finished writing to alluxio.
>
> On Fri, May 12, 2017 at 6:52 AM, Gene Pang  wrote:
> > Hi,
> >
> > Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog
> post
> > on Spark + Alluxio + S3, and here is some documentation for configuring
> > Alluxio + S3 and configuring Spark + Alluxio.
> >
> > You mentioned that it required a lot of effort to get working. May I ask
> > what you ran into, and how you got it to work?
> >
> > Thanks,
> > Gene
> >
> > On Thu, May 11, 2017 at 11:55 AM, Miguel Morales <
> therevolti...@gmail.com>
> > wrote:
> >>
> >> Might want to try to use gzip as opposed to parquet.  The only way i
> >> ever reliably got parquet to work on S3 is by using Alluxio as a
> >> buffer, but it's a decent amount of work.
> >>
> >> On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com
> >>  wrote:
> >> > Also, and this is unrelated to the actual question... Why don't these
> >> > messages show up in the archive?
> >> >
> >> > http://apache-spark-user-list.1001560.n3.nabble.com/
> >> >
> >> > Ideally I'd want to post a link to our internal wiki for these
> >> > questions,
> >> > but can't find them in the archive.
> >> >
> >> > On 11 May 2017 at 07:16, lucas.g...@gmail.com 
> >> > wrote:
> >> >>
> >> >> Looks like this isn't viable in spark 2.0.0 (and greater I presume).
> >> >> I'm
> >> >> pretty sure I came across this blog and ignored it due to that.
> >> >>
> >> >> Any other thoughts?  The linked tickets in:
> >> >> https://issues.apache.org/jira/browse/SPARK-10063
> >> >> https://issues.apache.org/jira/browse/HADOOP-13786
> >> >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
> >> >>
> >> >> On 10 May 2017 at 22:24, Miguel Morales 
> >> >> wrote:
> >> >>>
> >> >>> Try using the DirectParquetOutputCommiter:
> >> >>> http://dev.sortable.com/spark-directparquetoutputcommitter/
> >> >>>
> >> >>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
> >> >>>  wrote:
> >> >>> > Hi users, we have a bunch of pyspark jobs that are using S3 for
> >> >>> > loading
> >> >>> > /
> >> >>> > intermediate steps and final output of parquet files.
> >> >>> >
> >> >>> > We're running into the following issues on a semi regular basis:
> >> >>> > * These are intermittent errors, IE we have about 300 jobs that
> run
> >> >>> > nightly... And a fairly random but small-ish percentage of them
> fail
> >> >>> > with
> >> >>> > the following classes of errors.
> >> >>> >
> >> >>> > S3 write errors
> >> >>> >
> >> >>> >> "ERROR Utils: Aborting task
> >> >>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code:
> >> >>> >> 404,
> >> >>> >> AWS
> >> >>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null,
> >> >>> >> AWS
> >> >>> >> Error
> >> >>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
> >> >>> >
> >> >>> >
> >> >>> >>
> >> >>> >> "Py4JJavaError: An error occurred while calling o43.parquet.
> >> >>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException:
> >> >>> >> Status
> >> >>> >> Code:
> >> >>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null,
> >> >>> >> AWS
> >> >>> >> Error
> >> >>> >> Message: One or more objects could not be deleted, S3 Extended
> >> >>> >> Request
> >> >>> >> ID:
> >> >>> >> null"
> >> >>> >
> >> >>> >
> >> >>> >
> >> >>> > S3 Read Errors:
> >> >>> >
> >> >>> >> [Stage 1:=>
> >> >>> >> (27
> >> >>> >> + 4)
> >> >>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in
> >> >>> >> stage
> >> >>> >> 1.0
> >> >>> >> (TID 11)
> >> >>> >> java.net.SocketException: Connection reset
> >> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
> >> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >> >>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
> >> >>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:
> 554)
> >> >>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
> >> >>> >> at
> >> >>> >> sun.security.ssl.SSLSocketImpl.readRecord(
> SSLSocketImpl.java:927)
> >> >>> >> at
> >> >>> >>
> >> >>> >> sun.security.ssl.SSLSocketImpl.readDataRecord(
> SSLSocketImpl.java:884)
> >> >>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
> >> >>> >> at
> >> >>> >>
> >> >>> >>
> >> >>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(
> AbstractSessionInputBuffer.java:198)
> >> >>> >> at
> >> >>> >>
> >> >>> >>
> >> >>> >> 

Re: Spark <--> S3 flakiness

2017-05-13 Thread Miguel Morales
Some things just didn't work as i had first expected it.  For example,
when writing from a spark collection to an alluxio destination didn't
persist them to s3 automatically.

I remember having to use the alluxio library directly to force the
files to persist to s3 after spark finished writing to alluxio.

On Fri, May 12, 2017 at 6:52 AM, Gene Pang  wrote:
> Hi,
>
> Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog post
> on Spark + Alluxio + S3, and here is some documentation for configuring
> Alluxio + S3 and configuring Spark + Alluxio.
>
> You mentioned that it required a lot of effort to get working. May I ask
> what you ran into, and how you got it to work?
>
> Thanks,
> Gene
>
> On Thu, May 11, 2017 at 11:55 AM, Miguel Morales 
> wrote:
>>
>> Might want to try to use gzip as opposed to parquet.  The only way i
>> ever reliably got parquet to work on S3 is by using Alluxio as a
>> buffer, but it's a decent amount of work.
>>
>> On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com
>>  wrote:
>> > Also, and this is unrelated to the actual question... Why don't these
>> > messages show up in the archive?
>> >
>> > http://apache-spark-user-list.1001560.n3.nabble.com/
>> >
>> > Ideally I'd want to post a link to our internal wiki for these
>> > questions,
>> > but can't find them in the archive.
>> >
>> > On 11 May 2017 at 07:16, lucas.g...@gmail.com 
>> > wrote:
>> >>
>> >> Looks like this isn't viable in spark 2.0.0 (and greater I presume).
>> >> I'm
>> >> pretty sure I came across this blog and ignored it due to that.
>> >>
>> >> Any other thoughts?  The linked tickets in:
>> >> https://issues.apache.org/jira/browse/SPARK-10063
>> >> https://issues.apache.org/jira/browse/HADOOP-13786
>> >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
>> >>
>> >> On 10 May 2017 at 22:24, Miguel Morales 
>> >> wrote:
>> >>>
>> >>> Try using the DirectParquetOutputCommiter:
>> >>> http://dev.sortable.com/spark-directparquetoutputcommitter/
>> >>>
>> >>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
>> >>>  wrote:
>> >>> > Hi users, we have a bunch of pyspark jobs that are using S3 for
>> >>> > loading
>> >>> > /
>> >>> > intermediate steps and final output of parquet files.
>> >>> >
>> >>> > We're running into the following issues on a semi regular basis:
>> >>> > * These are intermittent errors, IE we have about 300 jobs that run
>> >>> > nightly... And a fairly random but small-ish percentage of them fail
>> >>> > with
>> >>> > the following classes of errors.
>> >>> >
>> >>> > S3 write errors
>> >>> >
>> >>> >> "ERROR Utils: Aborting task
>> >>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code:
>> >>> >> 404,
>> >>> >> AWS
>> >>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null,
>> >>> >> AWS
>> >>> >> Error
>> >>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>> >>> >
>> >>> >
>> >>> >>
>> >>> >> "Py4JJavaError: An error occurred while calling o43.parquet.
>> >>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException:
>> >>> >> Status
>> >>> >> Code:
>> >>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null,
>> >>> >> AWS
>> >>> >> Error
>> >>> >> Message: One or more objects could not be deleted, S3 Extended
>> >>> >> Request
>> >>> >> ID:
>> >>> >> null"
>> >>> >
>> >>> >
>> >>> >
>> >>> > S3 Read Errors:
>> >>> >
>> >>> >> [Stage 1:=>
>> >>> >> (27
>> >>> >> + 4)
>> >>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in
>> >>> >> stage
>> >>> >> 1.0
>> >>> >> (TID 11)
>> >>> >> java.net.SocketException: Connection reset
>> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>> >>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>> >>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>> >>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>> >>> >> at
>> >>> >> sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>> >>> >> at
>> >>> >>
>> >>> >> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>> >>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
>> >>> >> at
>> >>> >>
>> >>> >>
>> >>> >> org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
>> >>> >> at
>> >>> >>

Re: Spark <--> S3 flakiness

2017-05-12 Thread Gene Pang
Hi,

Yes, you can use Alluxio with Spark to read/write to S3. Here is a blog
post on Spark + Alluxio + S3
,
and here is some documentation for configuring Alluxio + S3

and configuring Spark + Alluxio
.

You mentioned that it required a lot of effort to get working. May I ask
what you ran into, and how you got it to work?

Thanks,
Gene

On Thu, May 11, 2017 at 11:55 AM, Miguel Morales 
wrote:

> Might want to try to use gzip as opposed to parquet.  The only way i
> ever reliably got parquet to work on S3 is by using Alluxio as a
> buffer, but it's a decent amount of work.
>
> On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com
>  wrote:
> > Also, and this is unrelated to the actual question... Why don't these
> > messages show up in the archive?
> >
> > http://apache-spark-user-list.1001560.n3.nabble.com/
> >
> > Ideally I'd want to post a link to our internal wiki for these questions,
> > but can't find them in the archive.
> >
> > On 11 May 2017 at 07:16, lucas.g...@gmail.com 
> wrote:
> >>
> >> Looks like this isn't viable in spark 2.0.0 (and greater I presume).
> I'm
> >> pretty sure I came across this blog and ignored it due to that.
> >>
> >> Any other thoughts?  The linked tickets in:
> >> https://issues.apache.org/jira/browse/SPARK-10063
> >> https://issues.apache.org/jira/browse/HADOOP-13786
> >> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
> >>
> >> On 10 May 2017 at 22:24, Miguel Morales 
> wrote:
> >>>
> >>> Try using the DirectParquetOutputCommiter:
> >>> http://dev.sortable.com/spark-directparquetoutputcommitter/
> >>>
> >>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
> >>>  wrote:
> >>> > Hi users, we have a bunch of pyspark jobs that are using S3 for
> loading
> >>> > /
> >>> > intermediate steps and final output of parquet files.
> >>> >
> >>> > We're running into the following issues on a semi regular basis:
> >>> > * These are intermittent errors, IE we have about 300 jobs that run
> >>> > nightly... And a fairly random but small-ish percentage of them fail
> >>> > with
> >>> > the following classes of errors.
> >>> >
> >>> > S3 write errors
> >>> >
> >>> >> "ERROR Utils: Aborting task
> >>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code:
> 404,
> >>> >> AWS
> >>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
> >>> >> Error
> >>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
> >>> >
> >>> >
> >>> >>
> >>> >> "Py4JJavaError: An error occurred while calling o43.parquet.
> >>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException:
> Status
> >>> >> Code:
> >>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null,
> AWS
> >>> >> Error
> >>> >> Message: One or more objects could not be deleted, S3 Extended
> Request
> >>> >> ID:
> >>> >> null"
> >>> >
> >>> >
> >>> >
> >>> > S3 Read Errors:
> >>> >
> >>> >> [Stage 1:=>
>  (27
> >>> >> + 4)
> >>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in
> stage
> >>> >> 1.0
> >>> >> (TID 11)
> >>> >> java.net.SocketException: Connection reset
> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
> >>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
> >>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
> >>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
> >>> >> at sun.security.ssl.SSLSocketImpl.readRecord(
> SSLSocketImpl.java:927)
> >>> >> at
> >>> >> sun.security.ssl.SSLSocketImpl.readDataRecord(
> SSLSocketImpl.java:884)
> >>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(
> AbstractSessionInputBuffer.java:198)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(
> ContentLengthInputStream.java:178)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.impl.io.ContentLengthInputStream.read(
> ContentLengthInputStream.java:200)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.impl.io.ContentLengthInputStream.close(
> ContentLengthInputStream.java:103)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.conn.BasicManagedEntity.streamClosed(
> BasicManagedEntity.java:168)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.conn.EofSensorInputStream.checkClose(
> EofSensorInputStream.java:228)
> >>> >> at
> >>> >>
> >>> >> org.apache.http.conn.EofSensorInputStream.close(
> EofSensorInputStream.java:174)
> >>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> 

Re: Spark <--> S3 flakiness

2017-05-11 Thread Vadim Semenov
Use the official mailing list archive

http://mail-archives.apache.org/mod_mbox/spark-user/201705.mbox/%3ccajyeq0gh1fbhbajb9gghognhqouogydba28lnn262hfzzgf...@mail.gmail.com%3e

On Thu, May 11, 2017 at 2:50 PM, lucas.g...@gmail.com 
wrote:

> Also, and this is unrelated to the actual question... Why don't these
> messages show up in the archive?
>
> http://apache-spark-user-list.1001560.n3.nabble.com/
>
> Ideally I'd want to post a link to our internal wiki for these questions,
> but can't find them in the archive.
>
> On 11 May 2017 at 07:16, lucas.g...@gmail.com 
> wrote:
>
>> Looks like this isn't viable in spark 2.0.0 (and greater I presume).  I'm
>> pretty sure I came across this blog and ignored it due to that.
>>
>> Any other thoughts?  The linked tickets in: https://issues.apache.org/
>> jira/browse/SPARK-10063 https://issues.apache.org/jira/brows
>> e/HADOOP-13786 https://issues.apache.org/jira/browse/HADOOP-9565 look
>> relevant too.
>>
>> On 10 May 2017 at 22:24, Miguel Morales  wrote:
>>
>>> Try using the DirectParquetOutputCommiter:
>>> http://dev.sortable.com/spark-directparquetoutputcommitter/
>>>
>>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
>>>  wrote:
>>> > Hi users, we have a bunch of pyspark jobs that are using S3 for
>>> loading /
>>> > intermediate steps and final output of parquet files.
>>> >
>>> > We're running into the following issues on a semi regular basis:
>>> > * These are intermittent errors, IE we have about 300 jobs that run
>>> > nightly... And a fairly random but small-ish percentage of them fail
>>> with
>>> > the following classes of errors.
>>> >
>>> > S3 write errors
>>> >
>>> >> "ERROR Utils: Aborting task
>>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404,
>>> AWS
>>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
>>> Error
>>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>>> >
>>> >
>>> >>
>>> >> "Py4JJavaError: An error occurred while calling o43.parquet.
>>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status
>>> Code:
>>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS
>>> Error
>>> >> Message: One or more objects could not be deleted, S3 Extended
>>> Request ID:
>>> >> null"
>>> >
>>> >
>>> >
>>> > S3 Read Errors:
>>> >
>>> >> [Stage 1:=>
>>>  (27 + 4)
>>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in
>>> stage 1.0
>>> >> (TID 11)
>>> >> java.net.SocketException: Connection reset
>>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>>> >> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>>> >> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.
>>> java:884)
>>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>>> >> at
>>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(Abst
>>> ractSessionInputBuffer.java:198)
>>> >> at
>>> >> org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>>> tLengthInputStream.java:178)
>>> >> at
>>> >> org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>>> tLengthInputStream.java:200)
>>> >> at
>>> >> org.apache.http.impl.io.ContentLengthInputStream.close(Conte
>>> ntLengthInputStream.java:103)
>>> >> at
>>> >> org.apache.http.conn.BasicManagedEntity.streamClosed(BasicMa
>>> nagedEntity.java:168)
>>> >> at
>>> >> org.apache.http.conn.EofSensorInputStream.checkClose(EofSens
>>> orInputStream.java:228)
>>> >> at
>>> >> org.apache.http.conn.EofSensorInputStream.close(EofSensorInp
>>> utStream.java:174)
>>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> >> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>>> >> at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream
>>> .java:187)
>>> >
>>> >
>>> >
>>> > We have literally tons of logs we can add but it would make the email
>>> > unwieldy big.  If it would be helpful I'll drop them in a pastebin or
>>> > something.
>>> >
>>> > Our config is along the lines of:
>>> >
>>> > spark-2.1.0-bin-hadoop2.7
>>> > '--packages
>>> > com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
>>> > pyspark-shell'
>>> >
>>> > Given the stack overflow / googling I've been doing I know we're not
>>> the
>>> > only org with these issues but I haven't found a good set of solutions
>>> in
>>> 

Re: Spark <--> S3 flakiness

2017-05-11 Thread Miguel Morales
Might want to try to use gzip as opposed to parquet.  The only way i
ever reliably got parquet to work on S3 is by using Alluxio as a
buffer, but it's a decent amount of work.

On Thu, May 11, 2017 at 11:50 AM, lucas.g...@gmail.com
 wrote:
> Also, and this is unrelated to the actual question... Why don't these
> messages show up in the archive?
>
> http://apache-spark-user-list.1001560.n3.nabble.com/
>
> Ideally I'd want to post a link to our internal wiki for these questions,
> but can't find them in the archive.
>
> On 11 May 2017 at 07:16, lucas.g...@gmail.com  wrote:
>>
>> Looks like this isn't viable in spark 2.0.0 (and greater I presume).  I'm
>> pretty sure I came across this blog and ignored it due to that.
>>
>> Any other thoughts?  The linked tickets in:
>> https://issues.apache.org/jira/browse/SPARK-10063
>> https://issues.apache.org/jira/browse/HADOOP-13786
>> https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
>>
>> On 10 May 2017 at 22:24, Miguel Morales  wrote:
>>>
>>> Try using the DirectParquetOutputCommiter:
>>> http://dev.sortable.com/spark-directparquetoutputcommitter/
>>>
>>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
>>>  wrote:
>>> > Hi users, we have a bunch of pyspark jobs that are using S3 for loading
>>> > /
>>> > intermediate steps and final output of parquet files.
>>> >
>>> > We're running into the following issues on a semi regular basis:
>>> > * These are intermittent errors, IE we have about 300 jobs that run
>>> > nightly... And a fairly random but small-ish percentage of them fail
>>> > with
>>> > the following classes of errors.
>>> >
>>> > S3 write errors
>>> >
>>> >> "ERROR Utils: Aborting task
>>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404,
>>> >> AWS
>>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
>>> >> Error
>>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>>> >
>>> >
>>> >>
>>> >> "Py4JJavaError: An error occurred while calling o43.parquet.
>>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status
>>> >> Code:
>>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS
>>> >> Error
>>> >> Message: One or more objects could not be deleted, S3 Extended Request
>>> >> ID:
>>> >> null"
>>> >
>>> >
>>> >
>>> > S3 Read Errors:
>>> >
>>> >> [Stage 1:=>   (27
>>> >> + 4)
>>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage
>>> >> 1.0
>>> >> (TID 11)
>>> >> java.net.SocketException: Connection reset
>>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>>> >> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>>> >> at
>>> >> sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>>> >> at
>>> >>
>>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
>>> >> at
>>> >>
>>> >> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
>>> >> at
>>> >>
>>> >> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
>>> >> at
>>> >>
>>> >> org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
>>> >> at
>>> >>
>>> >> org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168)
>>> >> at
>>> >>
>>> >> org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
>>> >> at
>>> >>
>>> >> org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
>>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>>> >> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>>> >> at
>>> >> org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)
>>> >
>>> >
>>> >
>>> > We have literally tons of logs we can add but it would make the email
>>> > unwieldy big.  If it would be helpful I'll drop them in a pastebin or
>>> > something.
>>> >
>>> > Our config is along the lines of:
>>> >
>>> > spark-2.1.0-bin-hadoop2.7
>>> > '--packages
>>> > com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
>>> > pyspark-shell'
>>> >
>>> > Given the stack overflow / googling I've been doing I know we're not
>>> > the
>>> > only org with these issues but I 

Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Also, and this is unrelated to the actual question... Why don't these
messages show up in the archive?

http://apache-spark-user-list.1001560.n3.nabble.com/

Ideally I'd want to post a link to our internal wiki for these questions,
but can't find them in the archive.

On 11 May 2017 at 07:16, lucas.g...@gmail.com  wrote:

> Looks like this isn't viable in spark 2.0.0 (and greater I presume).  I'm
> pretty sure I came across this blog and ignored it due to that.
>
> Any other thoughts?  The linked tickets in: https://issues.apache.org/
> jira/browse/SPARK-10063 https://issues.apache.org/jira/browse/HADOOP-13786
>  https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.
>
> On 10 May 2017 at 22:24, Miguel Morales  wrote:
>
>> Try using the DirectParquetOutputCommiter:
>> http://dev.sortable.com/spark-directparquetoutputcommitter/
>>
>> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
>>  wrote:
>> > Hi users, we have a bunch of pyspark jobs that are using S3 for loading
>> /
>> > intermediate steps and final output of parquet files.
>> >
>> > We're running into the following issues on a semi regular basis:
>> > * These are intermittent errors, IE we have about 300 jobs that run
>> > nightly... And a fairly random but small-ish percentage of them fail
>> with
>> > the following classes of errors.
>> >
>> > S3 write errors
>> >
>> >> "ERROR Utils: Aborting task
>> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404,
>> AWS
>> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
>> Error
>> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>> >
>> >
>> >>
>> >> "Py4JJavaError: An error occurred while calling o43.parquet.
>> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status
>> Code:
>> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS
>> Error
>> >> Message: One or more objects could not be deleted, S3 Extended Request
>> ID:
>> >> null"
>> >
>> >
>> >
>> > S3 Read Errors:
>> >
>> >> [Stage 1:=>   (27
>> + 4)
>> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage
>> 1.0
>> >> (TID 11)
>> >> java.net.SocketException: Connection reset
>> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>> >> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>> >> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.
>> java:884)
>> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>> >> at
>> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(Abst
>> ractSessionInputBuffer.java:198)
>> >> at
>> >> org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>> tLengthInputStream.java:178)
>> >> at
>> >> org.apache.http.impl.io.ContentLengthInputStream.read(Conten
>> tLengthInputStream.java:200)
>> >> at
>> >> org.apache.http.impl.io.ContentLengthInputStream.close(Conte
>> ntLengthInputStream.java:103)
>> >> at
>> >> org.apache.http.conn.BasicManagedEntity.streamClosed(BasicMa
>> nagedEntity.java:168)
>> >> at
>> >> org.apache.http.conn.EofSensorInputStream.checkClose(EofSens
>> orInputStream.java:228)
>> >> at
>> >> org.apache.http.conn.EofSensorInputStream.close(EofSensorInp
>> utStream.java:174)
>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> >> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>> >> at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream
>> .java:187)
>> >
>> >
>> >
>> > We have literally tons of logs we can add but it would make the email
>> > unwieldy big.  If it would be helpful I'll drop them in a pastebin or
>> > something.
>> >
>> > Our config is along the lines of:
>> >
>> > spark-2.1.0-bin-hadoop2.7
>> > '--packages
>> > com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
>> > pyspark-shell'
>> >
>> > Given the stack overflow / googling I've been doing I know we're not the
>> > only org with these issues but I haven't found a good set of solutions
>> in
>> > those spaces yet.
>> >
>> > Thanks!
>> >
>> > Gary Lucas
>>
>
>


Re: Spark <--> S3 flakiness

2017-05-11 Thread lucas.g...@gmail.com
Looks like this isn't viable in spark 2.0.0 (and greater I presume).  I'm
pretty sure I came across this blog and ignored it due to that.

Any other thoughts?  The linked tickets in:
https://issues.apache.org/jira/browse/SPARK-10063
https://issues.apache.org/jira/browse/HADOOP-13786
https://issues.apache.org/jira/browse/HADOOP-9565 look relevant too.

On 10 May 2017 at 22:24, Miguel Morales  wrote:

> Try using the DirectParquetOutputCommiter:
> http://dev.sortable.com/spark-directparquetoutputcommitter/
>
> On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
>  wrote:
> > Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
> > intermediate steps and final output of parquet files.
> >
> > We're running into the following issues on a semi regular basis:
> > * These are intermittent errors, IE we have about 300 jobs that run
> > nightly... And a fairly random but small-ish percentage of them fail with
> > the following classes of errors.
> >
> > S3 write errors
> >
> >> "ERROR Utils: Aborting task
> >> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404,
> AWS
> >> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS
> Error
> >> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
> >
> >
> >>
> >> "Py4JJavaError: An error occurred while calling o43.parquet.
> >> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status
> Code:
> >> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS
> Error
> >> Message: One or more objects could not be deleted, S3 Extended Request
> ID:
> >> null"
> >
> >
> >
> > S3 Read Errors:
> >
> >> [Stage 1:=>   (27
> + 4)
> >> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage
> 1.0
> >> (TID 11)
> >> java.net.SocketException: Connection reset
> >> at java.net.SocketInputStream.read(SocketInputStream.java:196)
> >> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> >> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
> >> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
> >> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
> >> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
> >> at sun.security.ssl.SSLSocketImpl.readDataRecord(
> SSLSocketImpl.java:884)
> >> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
> >> at
> >> org.apache.http.impl.io.AbstractSessionInputBuffer.read(
> AbstractSessionInputBuffer.java:198)
> >> at
> >> org.apache.http.impl.io.ContentLengthInputStream.read(
> ContentLengthInputStream.java:178)
> >> at
> >> org.apache.http.impl.io.ContentLengthInputStream.read(
> ContentLengthInputStream.java:200)
> >> at
> >> org.apache.http.impl.io.ContentLengthInputStream.close(
> ContentLengthInputStream.java:103)
> >> at
> >> org.apache.http.conn.BasicManagedEntity.streamClosed(
> BasicManagedEntity.java:168)
> >> at
> >> org.apache.http.conn.EofSensorInputStream.checkClose(
> EofSensorInputStream.java:228)
> >> at
> >> org.apache.http.conn.EofSensorInputStream.close(
> EofSensorInputStream.java:174)
> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> >> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> >> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
> >> at org.apache.hadoop.fs.s3a.S3AInputStream.close(
> S3AInputStream.java:187)
> >
> >
> >
> > We have literally tons of logs we can add but it would make the email
> > unwieldy big.  If it would be helpful I'll drop them in a pastebin or
> > something.
> >
> > Our config is along the lines of:
> >
> > spark-2.1.0-bin-hadoop2.7
> > '--packages
> > com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
> > pyspark-shell'
> >
> > Given the stack overflow / googling I've been doing I know we're not the
> > only org with these issues but I haven't found a good set of solutions in
> > those spaces yet.
> >
> > Thanks!
> >
> > Gary Lucas
>


Re: Spark <--> S3 flakiness

2017-05-10 Thread Miguel Morales
Try using the DirectParquetOutputCommiter:
http://dev.sortable.com/spark-directparquetoutputcommitter/

On Wed, May 10, 2017 at 10:07 PM, lucas.g...@gmail.com
 wrote:
> Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
> intermediate steps and final output of parquet files.
>
> We're running into the following issues on a semi regular basis:
> * These are intermittent errors, IE we have about 300 jobs that run
> nightly... And a fairly random but small-ish percentage of them fail with
> the following classes of errors.
>
> S3 write errors
>
>> "ERROR Utils: Aborting task
>> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS
>> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error
>> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>
>
>>
>> "Py4JJavaError: An error occurred while calling o43.parquet.
>> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code:
>> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error
>> Message: One or more objects could not be deleted, S3 Extended Request ID:
>> null"
>
>
>
> S3 Read Errors:
>
>> [Stage 1:=>   (27 + 4)
>> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0
>> (TID 11)
>> java.net.SocketException: Connection reset
>> at java.net.SocketInputStream.read(SocketInputStream.java:196)
>> at java.net.SocketInputStream.read(SocketInputStream.java:122)
>> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
>> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
>> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
>> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
>> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
>> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
>> at
>> org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
>> at
>> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
>> at
>> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
>> at
>> org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
>> at
>> org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168)
>> at
>> org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
>> at
>> org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> at java.io.FilterInputStream.close(FilterInputStream.java:181)
>> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
>> at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)
>
>
>
> We have literally tons of logs we can add but it would make the email
> unwieldy big.  If it would be helpful I'll drop them in a pastebin or
> something.
>
> Our config is along the lines of:
>
> spark-2.1.0-bin-hadoop2.7
> '--packages
> com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
> pyspark-shell'
>
> Given the stack overflow / googling I've been doing I know we're not the
> only org with these issues but I haven't found a good set of solutions in
> those spaces yet.
>
> Thanks!
>
> Gary Lucas

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Spark <--> S3 flakiness

2017-05-10 Thread lucas.g...@gmail.com
Hi users, we have a bunch of pyspark jobs that are using S3 for loading /
intermediate steps and final output of parquet files.

We're running into the following issues on a semi regular basis:
* These are intermittent errors, IE we have about 300 jobs that run
nightly... And a fairly random but small-ish percentage of them fail with
the following classes of errors.


*S3 write errors*

> "ERROR Utils: Aborting task
> com.amazonaws.services.s3.model.AmazonS3Exception: Status Code: 404, AWS
> Service: Amazon S3, AWS Request ID: 2D3RP, AWS Error Code: null, AWS Error
> Message: Not Found, S3 Extended Request ID: BlaBlahEtc="
>


> "Py4JJavaError: An error occurred while calling o43.parquet.
> : com.amazonaws.services.s3.model.MultiObjectDeleteException: Status Code:
> 0, AWS Service: null, AWS Request ID: null, AWS Error Code: null, AWS Error
> Message: One or more objects could not be deleted, S3 Extended Request ID:
> null"




*S3 Read Errors:*

> [Stage 1:=>   (27 + 4)
> / 31]17/05/10 16:25:23 ERROR Executor: Exception in task 10.0 in stage 1.0
> (TID 11)
> java.net.SocketException: Connection reset
> at java.net.SocketInputStream.read(SocketInputStream.java:196)
> at java.net.SocketInputStream.read(SocketInputStream.java:122)
> at sun.security.ssl.InputRecord.readFully(InputRecord.java:442)
> at sun.security.ssl.InputRecord.readV3Record(InputRecord.java:554)
> at sun.security.ssl.InputRecord.read(InputRecord.java:509)
> at sun.security.ssl.SSLSocketImpl.readRecord(SSLSocketImpl.java:927)
> at sun.security.ssl.SSLSocketImpl.readDataRecord(SSLSocketImpl.java:884)
> at sun.security.ssl.AppInputStream.read(AppInputStream.java:102)
> at
> org.apache.http.impl.io.AbstractSessionInputBuffer.read(AbstractSessionInputBuffer.java:198)
> at
> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:178)
> at
> org.apache.http.impl.io.ContentLengthInputStream.read(ContentLengthInputStream.java:200)
> at
> org.apache.http.impl.io.ContentLengthInputStream.close(ContentLengthInputStream.java:103)
> at
> org.apache.http.conn.BasicManagedEntity.streamClosed(BasicManagedEntity.java:168)
> at
> org.apache.http.conn.EofSensorInputStream.checkClose(EofSensorInputStream.java:228)
> at
> org.apache.http.conn.EofSensorInputStream.close(EofSensorInputStream.java:174)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at java.io.FilterInputStream.close(FilterInputStream.java:181)
> at com.amazonaws.services.s3.model.S3Object.close(S3Object.java:203)
> at org.apache.hadoop.fs.s3a.S3AInputStream.close(S3AInputStream.java:187)



We have literally tons of logs we can add but it would make the email
unwieldy big.  If it would be helpful I'll drop them in a pastebin or
something.

Our config is along the lines of:

   - spark-2.1.0-bin-hadoop2.7
   - '--packages
   com.amazonaws:aws-java-sdk:1.10.34,org.apache.hadoop:hadoop-aws:2.6.0
   pyspark-shell'

Given the stack overflow / googling I've been doing I know we're not the
only org with these issues but I haven't found a good set of solutions in
those spaces yet.

Thanks!

Gary Lucas


Re: Spark S3

2016-10-11 Thread Abhinay Mehta
Hi Selvam,

Is your 35GB parquet file split up into multiple S3 objects or just one big
Parquet file?

If its just one big file then I believe only one executor will be able to
work on it until some job action partitions the data into smaller chunks.



On 11 October 2016 at 06:03, Selvam Raman  wrote:

> I mentioned parquet as input format.
> On Oct 10, 2016 11:06 PM, "ayan guha"  wrote:
>
>> It really depends on the input format used.
>> On 11 Oct 2016 08:46, "Selvam Raman"  wrote:
>>
>>> Hi,
>>>
>>> How spark reads data from s3 and runs parallel task.
>>>
>>> Assume I have a s3 bucket size of 35 GB( parquet file).
>>>
>>> How the sparksession will read the data and process the data parallel.
>>> How it splits the s3 data and assign to each executor task.
>>>
>>> ​Please share me your points.
>>>
>>> Note:
>>> if we have RDD , then we can look at the partitions.size or length to
>>> check how many partition for a file. But how this will be accomplished in
>>> terms of S3 bucket.​
>>>
>>> --
>>> Selvam Raman
>>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>>
>>


Re: Spark S3

2016-10-10 Thread Selvam Raman
I mentioned parquet as input format.
On Oct 10, 2016 11:06 PM, "ayan guha"  wrote:

> It really depends on the input format used.
> On 11 Oct 2016 08:46, "Selvam Raman"  wrote:
>
>> Hi,
>>
>> How spark reads data from s3 and runs parallel task.
>>
>> Assume I have a s3 bucket size of 35 GB( parquet file).
>>
>> How the sparksession will read the data and process the data parallel.
>> How it splits the s3 data and assign to each executor task.
>>
>> ​Please share me your points.
>>
>> Note:
>> if we have RDD , then we can look at the partitions.size or length to
>> check how many partition for a file. But how this will be accomplished in
>> terms of S3 bucket.​
>>
>> --
>> Selvam Raman
>> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>>
>


Re: Spark S3

2016-10-10 Thread ayan guha
It really depends on the input format used.
On 11 Oct 2016 08:46, "Selvam Raman"  wrote:

> Hi,
>
> How spark reads data from s3 and runs parallel task.
>
> Assume I have a s3 bucket size of 35 GB( parquet file).
>
> How the sparksession will read the data and process the data parallel. How
> it splits the s3 data and assign to each executor task.
>
> ​Please share me your points.
>
> Note:
> if we have RDD , then we can look at the partitions.size or length to
> check how many partition for a file. But how this will be accomplished in
> terms of S3 bucket.​
>
> --
> Selvam Raman
> "லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"
>


Spark S3

2016-10-10 Thread Selvam Raman
Hi,

How spark reads data from s3 and runs parallel task.

Assume I have a s3 bucket size of 35 GB( parquet file).

How the sparksession will read the data and process the data parallel. How
it splits the s3 data and assign to each executor task.

​Please share me your points.

Note:
if we have RDD , then we can look at the partitions.size or length to check
how many partition for a file. But how this will be accomplished in terms
of S3 bucket.​

-- 
Selvam Raman
"லஞ்சம் தவிர்த்து நெஞ்சம் நிமிர்த்து"


Re: Spark S3 Performance

2014-11-24 Thread Nitay Joffe
Andrei, Ashish,

To be clear, I don't think it's *counting* the entire file. It just seems
from the logging and the timing that it is doing a get of the entire file,
then figures out it only needs some certain blocks, does another get of
only the specific block.

Regarding # partitions - I think I see now it has to do with Hadoop's block
size being set at 64MB. This is not a big deal to me, the main issue is the
first one, why is every worker doing a call to get the entire file followed
by the *real* call to get only the specific partitions it needs.

Best,

- Nitay
Founder  CTO


On Sat, Nov 22, 2014 at 8:28 PM, Andrei faithlessfri...@gmail.com wrote:

 Concerning your second question, I believe you try to set number of
 partitions with something like this:

 rdd = sc.textFile(..., 8)

 but things like `textFile()` don't actually take fixed number of
 partitions. Instead, they expect *minimal* number of partitions. Since in
 your file you have 21 blocks of data, it creates exactly 21 worker (which
 is greater than 8, as expected). To set exact number of partitions, use
 `repartition()` or its full version - `coalesce()` (see example [1])

 [1]:
 http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce



 On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole arang...@gmail.com
 wrote:

 What makes you think that each executor is reading the whole file? If
 that is the case then the count value returned to the driver will be actual
 X NumOfExecutors. Is that the case when compared with actual lines in the
 input file? If the count returned is same as actual then you probably don't
 have an extra read problem.

 I also see this in your logs which indicates that it is a read that
 starts from an offset and reading one split size (64MB) worth of data:

 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input
 split: s3n://mybucket/myfile:335544320+67108864
 On Nov 22, 2014 7:23 AM, Nitay Joffe ni...@actioniq.co wrote:

 Err I meant #1 :)

 - Nitay
 Founder  CTO


 On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote:

 Anyone have any thoughts on this? Trying to understand especially #2 if
 it's a legit bug or something I'm doing wrong.

 - Nitay
 Founder  CTO


 On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co
 wrote:

 I have a simple S3 job to read a text file and do a line count.
 Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
 file is about 1.2GB. My setup is standalone spark cluster with 4 workers
 each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
 hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = 
 Spark).

 The whole count is taking on the order of a couple of minutes, which
 seems extremely slow.
 I've been looking into it and so far have noticed two things, hoping
 the community has seen this before and knows what to do...

 1) Every executor seems to make an S3 call to read the *entire file* 
 before
 making another call to read just it's split. Here's a paste I've cleaned 
 up
 to show just one task: http://goo.gl/XCfyZA. I've verified this
 happens in every task. It is taking a long time (40-50 seconds), I don't
 see why it is doing this?
 2) I've tried a few numPartitions parameters. When I make the
 parameter anything below 21 it seems to get ignored. Under the hood
 FileInputFormat is doing something that always ends up with at least 21
 partitions of ~64MB or so. I've also tried 40, 60, and 100 partitions and
 have seen that the performance only gets worse as I increase it beyond 21.
 I would like to try 8 just to see, but again I don't see how to force it 
 to
 go below 21.

 Thanks for the help,
 - Nitay
 Founder  CTO







Re: Spark S3 Performance

2014-11-24 Thread Daniil Osipov
Can you verify that its reading the entire file on each worker using
network monitoring stats? If it does, that would be a bug in my opinion.

On Mon, Nov 24, 2014 at 2:06 PM, Nitay Joffe ni...@actioniq.co wrote:

 Andrei, Ashish,

 To be clear, I don't think it's *counting* the entire file. It just seems
 from the logging and the timing that it is doing a get of the entire file,
 then figures out it only needs some certain blocks, does another get of
 only the specific block.

 Regarding # partitions - I think I see now it has to do with Hadoop's
 block size being set at 64MB. This is not a big deal to me, the main issue
 is the first one, why is every worker doing a call to get the entire file
 followed by the *real* call to get only the specific partitions it needs.

 Best,

 - Nitay
 Founder  CTO


 On Sat, Nov 22, 2014 at 8:28 PM, Andrei faithlessfri...@gmail.com wrote:

 Concerning your second question, I believe you try to set number of
 partitions with something like this:

 rdd = sc.textFile(..., 8)

 but things like `textFile()` don't actually take fixed number of
 partitions. Instead, they expect *minimal* number of partitions. Since
 in your file you have 21 blocks of data, it creates exactly 21 worker
 (which is greater than 8, as expected). To set exact number of partitions,
 use `repartition()` or its full version - `coalesce()` (see example [1])

 [1]:
 http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce



 On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole arang...@gmail.com
 wrote:

 What makes you think that each executor is reading the whole file? If
 that is the case then the count value returned to the driver will be actual
 X NumOfExecutors. Is that the case when compared with actual lines in the
 input file? If the count returned is same as actual then you probably don't
 have an extra read problem.

 I also see this in your logs which indicates that it is a read that
 starts from an offset and reading one split size (64MB) worth of data:

 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input
 split: s3n://mybucket/myfile:335544320+67108864
 On Nov 22, 2014 7:23 AM, Nitay Joffe ni...@actioniq.co wrote:

 Err I meant #1 :)

 - Nitay
 Founder  CTO


 On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co
 wrote:

 Anyone have any thoughts on this? Trying to understand especially #2
 if it's a legit bug or something I'm doing wrong.

 - Nitay
 Founder  CTO


 On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co
 wrote:

 I have a simple S3 job to read a text file and do a line count.
 Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
 file is about 1.2GB. My setup is standalone spark cluster with 4 workers
 each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
 hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = 
 Spark).

 The whole count is taking on the order of a couple of minutes, which
 seems extremely slow.
 I've been looking into it and so far have noticed two things, hoping
 the community has seen this before and knows what to do...

 1) Every executor seems to make an S3 call to read the *entire file* 
 before
 making another call to read just it's split. Here's a paste I've cleaned 
 up
 to show just one task: http://goo.gl/XCfyZA. I've verified this
 happens in every task. It is taking a long time (40-50 seconds), I don't
 see why it is doing this?
 2) I've tried a few numPartitions parameters. When I make the
 parameter anything below 21 it seems to get ignored. Under the hood
 FileInputFormat is doing something that always ends up with at least 21
 partitions of ~64MB or so. I've also tried 40, 60, and 100 partitions and
 have seen that the performance only gets worse as I increase it beyond 
 21.
 I would like to try 8 just to see, but again I don't see how to force it 
 to
 go below 21.

 Thanks for the help,
 - Nitay
 Founder  CTO








Re: Spark S3 Performance

2014-11-22 Thread Nitay Joffe
Anyone have any thoughts on this? Trying to understand especially #2 if
it's a legit bug or something I'm doing wrong.

- Nitay
Founder  CTO


On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote:

 I have a simple S3 job to read a text file and do a line count.
 Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
 file is about 1.2GB. My setup is standalone spark cluster with 4 workers
 each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
 hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = Spark).

 The whole count is taking on the order of a couple of minutes, which seems
 extremely slow.
 I've been looking into it and so far have noticed two things, hoping the
 community has seen this before and knows what to do...

 1) Every executor seems to make an S3 call to read the *entire file* before
 making another call to read just it's split. Here's a paste I've cleaned up
 to show just one task: http://goo.gl/XCfyZA. I've verified this happens
 in every task. It is taking a long time (40-50 seconds), I don't see why it
 is doing this?
 2) I've tried a few numPartitions parameters. When I make the parameter
 anything below 21 it seems to get ignored. Under the hood FileInputFormat
 is doing something that always ends up with at least 21 partitions of ~64MB
 or so. I've also tried 40, 60, and 100 partitions and have seen that the
 performance only gets worse as I increase it beyond 21. I would like to try
 8 just to see, but again I don't see how to force it to go below 21.

 Thanks for the help,
 - Nitay
 Founder  CTO




Re: Spark S3 Performance

2014-11-22 Thread Nitay Joffe
Err I meant #1 :)

- Nitay
Founder  CTO


On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote:

 Anyone have any thoughts on this? Trying to understand especially #2 if
 it's a legit bug or something I'm doing wrong.

 - Nitay
 Founder  CTO


 On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote:

 I have a simple S3 job to read a text file and do a line count.
 Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
 file is about 1.2GB. My setup is standalone spark cluster with 4 workers
 each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
 hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = Spark).

 The whole count is taking on the order of a couple of minutes, which
 seems extremely slow.
 I've been looking into it and so far have noticed two things, hoping the
 community has seen this before and knows what to do...

 1) Every executor seems to make an S3 call to read the *entire file* before
 making another call to read just it's split. Here's a paste I've cleaned up
 to show just one task: http://goo.gl/XCfyZA. I've verified this happens
 in every task. It is taking a long time (40-50 seconds), I don't see why it
 is doing this?
 2) I've tried a few numPartitions parameters. When I make the parameter
 anything below 21 it seems to get ignored. Under the hood FileInputFormat
 is doing something that always ends up with at least 21 partitions of ~64MB
 or so. I've also tried 40, 60, and 100 partitions and have seen that the
 performance only gets worse as I increase it beyond 21. I would like to try
 8 just to see, but again I don't see how to force it to go below 21.

 Thanks for the help,
 - Nitay
 Founder  CTO





Re: Spark S3 Performance

2014-11-22 Thread Andrei
Not that I'm professional user of Amazon services, but I have a guess about
your performance issues. From [1], there are two different filesystems over
S3:

 - native that behaves just like regular files (schema: s3n)
 - block-based that looks more like HDFS (schema: s3)

Since you use s3n in your URL, each Spark worker seems to treat the file
as unsplittable piece of data and downloads it all (though, probably,
applies functions to specific regions only). If I understand it right,
using s3 instead will allow Spark workers see data as a sequence of
blocks and download each block separately.

But anyway, using s3 Implies loss of data locality, so data will be
transferred to workers instead of code being transferred to data. Given
data size of 1.2Gb, consider also storing data in Hadoop's HDFS instead of
S3 (as far as I remember, Amazon allows using both at the same time).

Please, let us know if it works.


[1]: https://wiki.apache.org/hadoop/AmazonS3

On Sat, Nov 22, 2014 at 6:21 PM, Nitay Joffe ni...@actioniq.co wrote:

 Err I meant #1 :)

 - Nitay
 Founder  CTO


 On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote:

 Anyone have any thoughts on this? Trying to understand especially #2 if
 it's a legit bug or something I'm doing wrong.

 - Nitay
 Founder  CTO


 On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote:

 I have a simple S3 job to read a text file and do a line count.
 Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
 file is about 1.2GB. My setup is standalone spark cluster with 4 workers
 each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
 hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = Spark).

 The whole count is taking on the order of a couple of minutes, which
 seems extremely slow.
 I've been looking into it and so far have noticed two things, hoping the
 community has seen this before and knows what to do...

 1) Every executor seems to make an S3 call to read the *entire file* before
 making another call to read just it's split. Here's a paste I've cleaned up
 to show just one task: http://goo.gl/XCfyZA. I've verified this happens
 in every task. It is taking a long time (40-50 seconds), I don't see why it
 is doing this?
 2) I've tried a few numPartitions parameters. When I make the parameter
 anything below 21 it seems to get ignored. Under the hood FileInputFormat
 is doing something that always ends up with at least 21 partitions of ~64MB
 or so. I've also tried 40, 60, and 100 partitions and have seen that the
 performance only gets worse as I increase it beyond 21. I would like to try
 8 just to see, but again I don't see how to force it to go below 21.

 Thanks for the help,
 - Nitay
 Founder  CTO






Re: Spark S3 Performance

2014-11-22 Thread Andrei
Concerning your second question, I believe you try to set number of
partitions with something like this:

rdd = sc.textFile(..., 8)

but things like `textFile()` don't actually take fixed number of
partitions. Instead, they expect *minimal* number of partitions. Since in
your file you have 21 blocks of data, it creates exactly 21 worker (which
is greater than 8, as expected). To set exact number of partitions, use
`repartition()` or its full version - `coalesce()` (see example [1])

[1]:
http://homepage.cs.latrobe.edu.au/zhe/ZhenHeSparkRDDAPIExamples.html#coalesce



On Sat, Nov 22, 2014 at 10:04 PM, Ashish Rangole arang...@gmail.com wrote:

 What makes you think that each executor is reading the whole file? If that
 is the case then the count value returned to the driver will be actual X
 NumOfExecutors. Is that the case when compared with actual lines in the
 input file? If the count returned is same as actual then you probably don't
 have an extra read problem.

 I also see this in your logs which indicates that it is a read that starts
 from an offset and reading one split size (64MB) worth of data:

 14/11/20 15:39:45 [Executor task launch worker-1 ] INFO HadoopRDD: Input
 split: s3n://mybucket/myfile:335544320+67108864
 On Nov 22, 2014 7:23 AM, Nitay Joffe ni...@actioniq.co wrote:

 Err I meant #1 :)

 - Nitay
 Founder  CTO


 On Sat, Nov 22, 2014 at 10:20 AM, Nitay Joffe ni...@actioniq.co wrote:

 Anyone have any thoughts on this? Trying to understand especially #2 if
 it's a legit bug or something I'm doing wrong.

 - Nitay
 Founder  CTO


 On Thu, Nov 20, 2014 at 11:54 AM, Nitay Joffe ni...@actioniq.co wrote:

 I have a simple S3 job to read a text file and do a line count.
 Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
 file is about 1.2GB. My setup is standalone spark cluster with 4 workers
 each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
 hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = Spark).

 The whole count is taking on the order of a couple of minutes, which
 seems extremely slow.
 I've been looking into it and so far have noticed two things, hoping
 the community has seen this before and knows what to do...

 1) Every executor seems to make an S3 call to read the *entire file* before
 making another call to read just it's split. Here's a paste I've cleaned up
 to show just one task: http://goo.gl/XCfyZA. I've verified this
 happens in every task. It is taking a long time (40-50 seconds), I don't
 see why it is doing this?
 2) I've tried a few numPartitions parameters. When I make the parameter
 anything below 21 it seems to get ignored. Under the hood FileInputFormat
 is doing something that always ends up with at least 21 partitions of ~64MB
 or so. I've also tried 40, 60, and 100 partitions and have seen that the
 performance only gets worse as I increase it beyond 21. I would like to try
 8 just to see, but again I don't see how to force it to go below 21.

 Thanks for the help,
 - Nitay
 Founder  CTO






Spark S3 Performance

2014-11-20 Thread Nitay Joffe
I have a simple S3 job to read a text file and do a line count.
Specifically I'm doing *sc.textFile(s3n://mybucket/myfile).count*.The
file is about 1.2GB. My setup is standalone spark cluster with 4 workers
each with 2 cores / 16GB ram. I'm using branch-1.2 code built against
hadoop 2.4 (though I'm not actually using HDFS, just straight S3 = Spark).

The whole count is taking on the order of a couple of minutes, which seems
extremely slow.
I've been looking into it and so far have noticed two things, hoping the
community has seen this before and knows what to do...

1) Every executor seems to make an S3 call to read the *entire file* before
making another call to read just it's split. Here's a paste I've cleaned up
to show just one task: http://goo.gl/XCfyZA. I've verified this happens in
every task. It is taking a long time (40-50 seconds), I don't see why it is
doing this?
2) I've tried a few numPartitions parameters. When I make the parameter
anything below 21 it seems to get ignored. Under the hood FileInputFormat
is doing something that always ends up with at least 21 partitions of ~64MB
or so. I've also tried 40, 60, and 100 partitions and have seen that the
performance only gets worse as I increase it beyond 21. I would like to try
8 just to see, but again I don't see how to force it to go below 21.

Thanks for the help,
- Nitay
Founder  CTO


SPARK S3 LZO input; worker stuck

2014-07-13 Thread hassan
Hi

I'm trying to read lzo compressed files from S3 using spark. The lzo files
are not indexed. Spark job starts to read the files just fine but after a
while it just hangs. No network throughput. I have to restart the worker
process to get it back up. Any idea what could be causing this. We were
using uncompressed files before and that worked just fine, went with the
compression to reduce S3 storage. 

Any help would be appreciated. 

Thanks!



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/SPARK-S3-LZO-input-worker-stuck-tp9584.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.


Spark S3 LZO input files

2014-07-03 Thread hassan
I'm trying to read input files from S3. The files are compressed using LZO.
i-e from spark-shell 

sc.textFile(s3n://path/xx.lzo).first returns 'String = �LZO?'

Spark does not uncompress the data from the file. I am using cloudera
manager 5, with CDH 5.0.2. I've already installed 'GPLEXTRAS' parcel and
have included 'opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/hadoop-lzo.jar'
and '/opt/cloudera/parcels/GPLEXTRAS/lib/hadoop/lib/native/' in
SPARK_CLASS_PATH. What am I missing?



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-S3-LZO-input-files-tp8706.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.