RE: spark 2.02 error when writing to s3

VND Tremblay, Paul Fri, 27 Jan 2017 15:19:02 -0800

Not sure what you mean by "a consistency layer on top." Any explanation would 
be greatly appreciated!


Paul

_________________________________________________________________________________________________

Paul Tremblay
Analytics Specialist
THE BOSTON CONSULTING GROUP
Tel. + ▪ Mobile +

_________________________________________________________________________________________________

From: Steve Loughran [mailto:ste...@hortonworks.com]
Sent: Friday, January 27, 2017 3:20 AM
To: VND Tremblay, Paul
Cc: Neil Jonkers; Takeshi Yamamuro; user@spark.apache.org
Subject: Re: spark 2.02 error when writing to s3

OK

Nobody should be committing output directly to S3 without having something add 
a consistency layer on top, not if you want reliabie (as in "doesn't 
lose/corrupt data" reliable) work

On 26 Jan 2017, at 19:09, VND Tremblay, Paul 
<tremblay.p...@bcg.com<mailto:tremblay.p...@bcg.com>> wrote:

This seems to have done the trick, although I am not positive. If I have time, 
I'll test spinning up a cluster with and without consistent view to pin point 
the error.

_________________________________________________________________________________________________

Paul Tremblay
Analytics Specialist
THE BOSTON CONSULTING GROUP
Tel. + ▪ Mobile +

_________________________________________________________________________________________________


From: Neil Jonkers [mailto:neilod...@gmail.com]
Sent: Friday, January 20, 2017 11:39 AM
To: Steve Loughran; VND Tremblay, Paul
Cc: Takeshi Yamamuro; user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark 2.02 error when writing to s3

Can you test by enabling emrfs consistent view and use s3:// uri.

http://docs.aws.amazon.com/emr/latest/ManagementGuide/enable-consistent-view.html

-------- Original message --------
From: Steve Loughran
Date:20/01/2017 21:17 (GMT+02:00)
To: "VND Tremblay, Paul"
Cc: Takeshi Yamamuro ,user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark 2.02 error when writing to s3

AWS S3 is eventually consistent: even after something is deleted, a LIST/GET 
call may show it. You may be seeing that effect; even after the DELETE has got 
rid of the files, a listing sees something there, And I suspect the time it 
takes for the listing to "go away" will depend on the total number of entries 
underneath, as there are more deletion markers "tombstones" to propagate around 
s3

Try deleting the path and then waiting a short period


On 20 Jan 2017, at 18:54, VND Tremblay, Paul 
<tremblay.p...@bcg.com<mailto:tremblay.p...@bcg.com>> wrote:

I am using an EMR cluster, and the latest version offered is 2.02. The link 
below indicates that that user had the same problem, which seems unresolved.

Thanks

Paul

_________________________________________________________________________________________________

Paul Tremblay
Analytics Specialist
THE BOSTON CONSULTING GROUP
Tel. + ▪ Mobile +

_________________________________________________________________________________________________



From: Takeshi Yamamuro [mailto:linguin....@gmail.com]
Sent: Thursday, January 19, 2017 9:27 PM
To: VND Tremblay, Paul
Cc: user@spark.apache.org<mailto:user@spark.apache.org>
Subject: Re: spark 2.02 error when writing to s3

Hi,

Do you get the same exception also in v2.1.0?
Anyway, I saw another guy reporting the same error, I think.
https://www.mail-archive.com/user@spark.apache.org/msg60882.html

// maropu


On Fri, Jan 20, 2017 at 5:15 AM, VND Tremblay, Paul 
<tremblay.p...@bcg.com<mailto:tremblay.p...@bcg.com>> wrote:
I have come across a problem when writing CSV files to S3 in Spark 2.02. The 
problem does not exist in Spark 1.6.


19:09:20 Caused by: java.io.IOException: File already 
exists:s3://stx-apollo-pr-datascience-internal/revenue_model/part-r-00025-c48a0d52-9600-4495-913c-64ae6bf888bd.csv


My code is this:

new_rdd\
135         .map(add_date_diff)\
136         .map(sid_offer_days)\
137         .groupByKey()\
138         .map(custom_sort)\
139         .map(before_rev_date)\
140         .map(lambda x, num_weeks = args.num_weeks: create_columns(x, 
num_weeks))\
141         .toDF()\
142         .write.csv(
143                 sep = "|",
144                 header = True,
145                 nullValue = '',
146                 quote = None,
147                 path = path
148                 )

In order to get the path (the last argument), I call this function:

150 def _get_s3_write(test):
151     if s3_utility.s3_data_already_exists(_get_write_bucket_name(), 
_get_s3_write_dir(test)):
152         s3_utility.remove_s3_dir(_get_write_bucket_name(), 
_get_s3_write_dir(test))
153     return make_s3_path(_get_write_bucket_name(), _get_s3_write_dir(test))

In other words, I am removing the directory if it exists before I write.

Notes:

* If I use a small set of data, then I don't get the error

* If I use Spark 1.6, I don't get the error

* If I read in a simple dataframe and then write to S3, I still get the error 
(without doing any transformations)

* If I do the previous step with a smaller set of data, I don't get the error.

* I am using pyspark, with python 2.7

* The thread at this link: 
https://forums.aws.amazon.com/thread.jspa?threadID=152470  Indicates the 
problem is caused by a problem sync problem. With large datasets, spark tries 
to write multiple times and causes the error. The suggestion is to turn off 
speculation, but I believe speculation is turned off by default in pyspark.

Thanks!

Paul


_____________________________________________________________________________________________________

Paul Tremblay
Analytics Specialist

THE BOSTON CONSULTING GROUP
STL ▪

Tel. + ▪ Mobile +
tremblay.p...@bcg.com<mailto:tremblay.p...@bcg.com>
_____________________________________________________________________________________________________

Read BCG's latest insights, analysis, and viewpoints at 
bcgperspectives.com<http://www.bcgperspectives.com/>

________________________________

The Boston Consulting Group, Inc.

This e-mail message may contain confidential and/or privileged information. If 
you are not an addressee or otherwise authorized to receive this message, you 
should not use, copy, disclose or take any action based on this e-mail or any 
information contained in the message. If you have received this material in 
error, please advise the sender immediately by reply e-mail and delete this 
message. Thank you.



--
---
Takeshi Yamamuro

RE: spark 2.02 error when writing to s3

Reply via email to