[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

Steve Loughran (JIRA) Thu, 24 Aug 2017 10:57:18 -0700

    [ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16140408#comment-16140408
 ]


Steve Loughran edited comment on SPARK-21797 at 8/24/17 5:56 PM:
-----------------------------------------------------------------

This is happening deep in the Amazon EMR team's closed source 
{{EmrFileSystem}}, so nothing anyone here at the ASF can deal with directly; 
I'm confident S3A will handle it pretty similarly though, either in the open() 
call or shortly afterwards, in the first read(). All we could do there is 
convert to a more meaningful error, or actually check to see if the file is 
valid at open() time & again, fail meaningfully

At the Spark level, it's because Parquet is trying to read the footer of every 
file in parallel

the good news, you can tell Spark to ignore files it can't read. I believe this 
might be a quick workaround:
{code}
spark.sql.files.ignoreCorruptFiles=true
{code}

Let us know what happens



was (Author: ste...@apache.org):
This is happening deep the Amazon EMR team's closed source {{EmrFileSystem}}, 
so nothing anyone here at the ASF can deal with directly; I'm confident S3A 
will handle it pretty similarly though, either in the open() call or shortly 
afterwards, in the first read(). All we could do there is convert to a more 
meaningful error, or actually check to see if the file is valid at open() time 
& again, fail meaningfully

At the Spark level, it's because Parquet is trying to read the footer of every 
file in parallel

the good news, you can tell Spark to ignore files it can't read. I believe this 
might be a quick workaround:
{code}
spark.sql.files.ignoreCorruptFiles=true
{code}

Let us know what happens


> spark cannot read partitioned data in S3 that are partly in glacier
> -------------------------------------------------------------------
>
>                 Key: SPARK-21797
>                 URL: https://issues.apache.org/jira/browse/SPARK-21797
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 2.2.0
>         Environment: Amazon EMR
>            Reporter: Boris Clémençon 
>              Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/    [in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/    [in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/    [not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/    [not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Comment Edited] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

Reply via email to