[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Steve Loughran updated SPARK-21797: ----------------------------------- Environment: Amazon EMR > spark cannot read partitioned data in S3 that are partly in glacier > ------------------------------------------------------------------- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 2.2.0 > Environment: Amazon EMR > Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/ [in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/ [in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/ [not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/ [not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org