[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2019-04-02 Thread vaquar khan (JIRA)


[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808388#comment-16808388
 ] 

vaquar khan commented on SPARK-21797:
-

{quote}Issue is related to AWS storage class not Apache Spark even AES athena 
**  giving same error when try to read it from glacier . .

If you archive objects using the Glacier storage option, you must inspect the 
storage class of an object before you attempt to retrieve it. The customary GET 
request will work as expected if the object is stored in S3 Standard or Reduced 
Redundancy (RRS) storage. It will fail (with a 403 error) if the object is 
archived in Glacier. In this case, you must use the RESTORE operation 
(described below) to make your data available in S3.
{quote} * [https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/]
 

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>Priority: Major
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-10-12 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201917#comment-16201917
 ] 

Steve Loughran commented on SPARK-21797:


Update, in HADOOP-14874 I've noted we could use the existing 
{{FileSystem.getContentSummary(Path)}} API to return {{StorageType.ARCHIVE}} 
for glaciated data. You'd need a way of filtering the listing of source files 
to strip out everything of archive type, but then yes, you could skip data in 
glacier

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-29 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146193#comment-16146193
 ] 

Steve Loughran commented on SPARK-21797:


No> That's a shame. I only came across the option when I pasted the stack trace 
in the IDE, and it said "enable this option". sorry, I'm not sure about what 
other strategies there are. Sean? Any idea?

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-29 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145716#comment-16145716
 ] 

Boris Clémençon  commented on SPARK-21797:
--

FYI, the flag spark.sql.files.ignoreCorruptFiles=true does not seem to fix the 
pbm.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-25 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141483#comment-16141483
 ] 

Steve Loughran commented on SPARK-21797:


bq. According to our test, it is 20% slower maximum to read parquet data from 
S3 than HDFS. Do you agree? 

You are testing the amazon EMR client. try with the Hadoop 2.8 JARs and the s3a 
client, enable columnar store optimised seek with 
spark.hadoop.fs.s3a.experimental.fadvise=random & see how things compare then. 

You will still be at a disadvantage with any directory scanning/walking which 
can take seconds rather than millis, and seeks are still expensive as you have 
to issue new HTTP requests with different content ranges. And of course, AWS 
throttles your VMs and shard-specific access to subtrees of a single bucket. 
HDFS locally still wins

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-25 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141358#comment-16141358
 ] 

Boris Clémençon  commented on SPARK-21797:
--

That's very good news indeed. Easiest way to fix! A more meaningful error would 
be appreciated nonetheless. I will chat with the AWS sdk team to be able to 
tackle the issue in a more orthodox way. 
Thanks!

*Additional points:*
You mentioned "read() takes so long other bits of the system will start to 
think your worker is hanging". According to our test, it is 20% slower maximum 
to read parquet data from S3 than HDFS. Do you agree? And for the price, you 
are right, it can be expensive to read the same data on S3 again and again. In 
our case, beside ML, spark is mostly used for ETL processes and we use Redshift 
for analytics, so there is just one read on S3 per process everyday, so it is 
more advantageous to read once form S3 rather than copying first on HDFS and 
read (now that I have the ultimate option to read partitioned dataset from S3.

Thanks again!

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
> Environment: Amazon EMR
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140408#comment-16140408
 ] 

Steve Loughran commented on SPARK-21797:


This is happening deep the Amazon EMR team's closed source {{EmrFileSystem}}, 
so nothing anyone here at the ASF can deal with directly; I'm confident S3A 
will handle it pretty similarly though, either in the open() call or shortly 
afterwards, in the first read(). All we could do there is convert to a more 
meaningful error, or actually check to see if the file is valid at open() time 
& again, fail meaningfully

At the Spark level, it's because Parquet is trying to read the footer of every 
file in parallel

the good news, you can tell Spark to ignore files it can't read. I believe this 
might be a quick workaround:
{code}
spark.sql.files.ignoreCorruptFiles=true
{code}

Let us know what happens


> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140261#comment-16140261
 ] 

Boris Clémençon  commented on SPARK-21797:
--

Steve, 

This is the stacks:


{noformat}
WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, 
ip-172-31-42-242.eu-west-1.compute.internal, executor 1): java.io.IOException: 
com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
 The operation is not valid for the object's storage class (Service: Amazon S3; 
Status Code: 403; Error Code: InvalidObjectState; Request ID: 
5DD5BEBB8173977D), S3 Extended Request ID: 
K9bDwhm32CFHeg5zgVfW/T1A/vB4e8gqQ/p7E0Ze9ZG55UFoDP7hgnkQxLIwYX9i2LEcKwrR+lo=
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.handleAmazonServiceException(Jets3tNativeFileSystemStore.java:434)
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:461)
at 
com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:439)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at 
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at 
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191)
at 
org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102)
at com.sun.proxy.$Proxy30.retrievePair(Unknown Source)
at 
com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1201)
at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773)
at 
com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166)
at 
org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:443)
at 
org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:421)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:491)
at 
org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:485)
at 
scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132)
at 
scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62)
at 
scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at 
scala.collection.parallel.ParIterableLike$FlatMap.tryLeaf(ParIterableLike.scala:1068)
at 
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443)
at 
scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341)
at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673)
at 
scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378)
at 
scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443)
at 
scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426)
at 
scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56)
at 
scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:958)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at 
scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48)
at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51)
at 
scala.collection.parallel.ParIterableLike$ResultMapping.tryLeaf(ParIterableLike.scala:953)
at 
scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152)
at 

[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140147#comment-16140147
 ] 

Steve Loughran commented on SPARK-21797:


Note that if it is just during spark partition calculation, it's probable that 
it is going down the directory tree and inspecting the files through HEAD 
requests, maybe looking at metadata entries too. So do attach the s3a & spark 
trace so we can see what's going on, as something may be over enthusastic about 
looking at files, or we could have something recognise the problem and recover 
from it.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140138#comment-16140138
 ] 

Steve Loughran commented on SPARK-21797:


I was talking about the cost and time of getting data from Glacier. If that's 
the only place where data lives, then its slow and expensive. And that's the 
bit I'm describing as niche. Given I've been working full time on S3A, I'm 
reasonably confident it gets used a lot.

If you talk to data in S3 that has been backed up to glacier, you *wlll get a 
403*: According to Jeff Barr himself: 
https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/

bq. If you archive objects using the Glacier storage option, you must inspect 
the storage class of an object before you attempt to retrieve it. The customary 
GET request will work as expected if the object is stored in S3 Standard or 
Reduced Redundancy (RRS) storage. It will fail (with a 403 error) if the object 
is archived in Glacier. In this case, you must use the RESTORE operation 
(described below) to make your data available in S3.

bq. You use S3’s new RESTORE operation to access an object archived in Glacier. 
As part of the request, you need to specify a retention period in days. 
Restoring an object will generally take 3 to 5 hours. Your restored object will 
remain in both Glacier and S3’s Reduced Redundancy Storage (RRS) for the 
duration of the retention period. At the end of the retention period the 
object’s data will be removed from S3; the object will remain in Glacier.

Like I said, I'd be interested in getting the full stack trace if you try to 
read this with an S3A client. Not for fixing, but for better reporting. 
Probably point them at Jeff's blog entry. Or this JIRA :)


> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139986#comment-16139986
 ] 

Sean Owen commented on SPARK-21797:
---

Sure, but in all events, this is an operation that is fine with Spark, but not 
fine with something between the AWS SDK and AWS. It's not something Spark can 
fix.

If source data is in S3, there's no way to avoid copying it from S3. 
Intermediate data produced by Spark can't live on S3 as it's too eventually 
consistent. Some final result could. And yeah you pay to read/write S3 so in 
some use cases might be more economical to keep intensely read/written data 
close to the compute workers for a time, rather than write/read to S3 between 
several closely related jobs.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139978#comment-16139978
 ] 

Boris Clémençon  commented on SPARK-21797:
--

Hi Steve,

to be sure we understand each other, *I don't want to read data from Glacier*. 
Concretely, I have a dataset in parquet partitioned by date in S3, with a 
automatic rule that freeze oldest dates in Glacier (and a few months later, 
delete it altogether). I want to read only most recent dates that are still in 
S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot 
do it. Do you understand each other? 

Besides, why do you say that it a niche use case? and why do you say reading 
data from S3 is a "very, very expensive way to work with data"? According to 
our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we 
operate from within AWS with a EMR cluster, so we should not pay data IO from 
S3. On the other hand, copying the dataset on HDFS has a time overhead and you 
need a large enough cluster with enough disk to store the whole dataset, or at 
least the relevant dates (whereas you may want to process a few columns, ie a 
fraction of the initial dataset). I would like you expertise about that.

In any case, I understand your and Sean's argument though which says that it is 
to AWS to solve the problem.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-24 Thread Steve Loughran (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139919#comment-16139919
 ] 

Steve Loughran commented on SPARK-21797:


If you are using S3// URLs then its the AWS team's problem. If you were using 
s3a://, then it'd be something you ask the hadoop team to look at, but we'd say 
no as

* it's a niche use case
* It's really slow, as in "read() takes so long other bits of the system will 
start to think your worker is hanging". Which means if you have speculative 
execution turned on, they kick off other workers to read the data.
* t's a very, very expensive way to work with data; $0.03/GB, which ramps up 
fast once multiple spark workers start reading the same datasets in parallel.
* Finally, it's been rejected on the server with a 403 response. That's Amazon 
S3 saying "no", not any of the clients.

You shouldn't be trying to process data direct from S3. Copy to S3 or a 
transient HDFS cluster, maybe as part of an oozie or airflow workflow.

Be curious about the fulll stack trace you see if you do try this with s3a://, 
even though it'll still be a WONTFIX. We could at least go for a more 
meaningful exception translation, and the retry logic needs to know that it 
won't go away if you try again

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-22 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136669#comment-16136669
 ] 

Boris Clémençon  commented on SPARK-21797:
--

OK, I see. I will have a look at the aws-sdk-java project. thank you

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-21 Thread Sean Owen (JIRA)

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135236#comment-16135236
 ] 

Sean Owen commented on SPARK-21797:
---

Sure, but this would not be a change in Spark, but in the AWS SDK or Glacier 
service. Spark can't do anything about it so this isn't the right place.

> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier

2017-08-21 Thread JIRA

[ 
https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135233#comment-16135233
 ] 

Boris Clémençon  commented on SPARK-21797:
--

Hi Sean,

Thanks for the quick answer. I understand your point of view. However, this is 
a very common use case (and a good practice) to partition by date and send the 
oldest data in Glacier to optimize the costs of the data warehouse. Today, 
Spark cannot be used properly with such frequent architectures, which is a 
pity. Could we reconsider this issue as a "new feature" instead of "bug"?


> spark cannot read partitioned data in S3 that are partly in glacier
> ---
>
> Key: SPARK-21797
> URL: https://issues.apache.org/jira/browse/SPARK-21797
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 2.2.0
>Reporter: Boris Clémençon 
>  Labels: glacier, partitions, read, s3
>
> I have a dataset in parquet in S3 partitioned by date (dt) with oldest date 
> stored in AWS Glacier to save some money. For instance, we have...
> {noformat}
> s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier]
> s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier]
> ...
> s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier]
> {noformat}
> I want to read this dataset, but only a subset of date that are not yet in 
> glacier, eg:
> {code:java}
> val from = "2017-07-15"
> val to = "2017-08-24"
> val path = "s3://my-bucket/my-dataset/"
> val X = spark.read.parquet(path).where(col("dt").between(from, to))
> {code}
> Unfortunately, I have the exception
> {noformat}
> java.io.IOException: 
> com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception:
>  The operation is not valid for the object's storage class (Service: Amazon 
> S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 
> C444D508B6042138)
> {noformat}
> I seems that spark does not like partitioned dataset when some partitions are 
> in Glacier. I could always read specifically each date, add the column with 
> current date and reduce(_ union _) at the end, but not pretty and it should 
> not be necessary.
> Is there any tip to read available data in the datastore even with old data 
> in glacier?



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org