[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

Steve Loughran (JIRA) Thu, 24 Aug 2017 04:34:23 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16139919#comment-16139919
 ]


Steve Loughran commented on SPARK-21797:
----------------------------------------

If you are using S3// URLs then its the AWS team's problem. If you were using 
s3a://, then it'd be something you ask the hadoop team to look at, but we'd say 
no as

* it's a niche use case
* It's really slow, as in "read() takes so long other bits of the system will 
start to think your worker is hanging". Which means if you have speculative 
execution turned on, they kick off other workers to read the data.
* t's a very, very expensive way to work with data; $0.03/GB, which ramps up 
fast once multiple spark workers start reading the same datasets in parallel.
* Finally, it's been rejected on the server with a 403 response. That's Amazon 
S3 saying "no", not any of the clients.

You shouldn't be trying to process data direct from S3. Copy to S3 or a 
transient HDFS cluster, maybe as part of an oozie or airflow workflow.

Be curious about the fulll stack trace you see if you do try this with s3a://, 
even though it'll still be a WONTFIX. We could at least go for a more 
meaningful exception translation, and the retry logic needs to know that it 
won't go away if you try again

> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
>                 Key: SPARK-21797
>                 URL: https://issues.apache.org/jira/browse/SPARK-21797
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Boris Clémençon 
>              Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

Reply via email to