[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16808388#comment-16808388 ] vaquar khan commented on SPARK-21797: - {quote}Issue is related to AWS storage class not Apache Spark even AES athena ** giving same error when try to read it from glacier . . If you archive objects using the Glacier storage option, you must inspect the storage class of an object before you attempt to retrieve it. The customary GET request will work as expected if the object is stored in S3 Standard or Reduced Redundancy (RRS) storage. It will fail (with a 403 error) if the object is archived in Glacier. In this case, you must use the RESTORE operation (described below) to make your data available in S3. {quote} * [https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/] > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Amazon EMR >Reporter: Boris Clémençon >Priority: Major > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16201917#comment-16201917 ] Steve Loughran commented on SPARK-21797: Update, in HADOOP-14874 I've noted we could use the existing {{FileSystem.getContentSummary(Path)}} API to return {{StorageType.ARCHIVE}} for glaciated data. You'd need a way of filtering the listing of source files to strip out everything of archive type, but then yes, you could skip data in glacier > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Amazon EMR >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16146193#comment-16146193 ] Steve Loughran commented on SPARK-21797: No> That's a shame. I only came across the option when I pasted the stack trace in the IDE, and it said "enable this option". sorry, I'm not sure about what other strategies there are. Sean? Any idea? > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Amazon EMR >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16145716#comment-16145716 ] Boris Clémençon commented on SPARK-21797: -- FYI, the flag spark.sql.files.ignoreCorruptFiles=true does not seem to fix the pbm. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Amazon EMR >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141483#comment-16141483 ] Steve Loughran commented on SPARK-21797: bq. According to our test, it is 20% slower maximum to read parquet data from S3 than HDFS. Do you agree? You are testing the amazon EMR client. try with the Hadoop 2.8 JARs and the s3a client, enable columnar store optimised seek with spark.hadoop.fs.s3a.experimental.fadvise=random & see how things compare then. You will still be at a disadvantage with any directory scanning/walking which can take seconds rather than millis, and seeks are still expensive as you have to issue new HTTP requests with different content ranges. And of course, AWS throttles your VMs and shard-specific access to subtrees of a single bucket. HDFS locally still wins > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Amazon EMR >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16141358#comment-16141358 ] Boris Clémençon commented on SPARK-21797: -- That's very good news indeed. Easiest way to fix! A more meaningful error would be appreciated nonetheless. I will chat with the AWS sdk team to be able to tackle the issue in a more orthodox way. Thanks! *Additional points:* You mentioned "read() takes so long other bits of the system will start to think your worker is hanging". According to our test, it is 20% slower maximum to read parquet data from S3 than HDFS. Do you agree? And for the price, you are right, it can be expensive to read the same data on S3 again and again. In our case, beside ML, spark is mostly used for ETL processes and we use Redshift for analytics, so there is just one read on S3 per process everyday, so it is more advantageous to read once form S3 rather than copying first on HDFS and read (now that I have the ultimate option to read partitioned dataset from S3. Thanks again! > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 > Environment: Amazon EMR >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140408#comment-16140408 ] Steve Loughran commented on SPARK-21797: This is happening deep the Amazon EMR team's closed source {{EmrFileSystem}}, so nothing anyone here at the ASF can deal with directly; I'm confident S3A will handle it pretty similarly though, either in the open() call or shortly afterwards, in the first read(). All we could do there is convert to a more meaningful error, or actually check to see if the file is valid at open() time & again, fail meaningfully At the Spark level, it's because Parquet is trying to read the footer of every file in parallel the good news, you can tell Spark to ignore files it can't read. I believe this might be a quick workaround: {code} spark.sql.files.ignoreCorruptFiles=true {code} Let us know what happens > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140261#comment-16140261 ] Boris Clémençon commented on SPARK-21797: -- Steve, This is the stacks: {noformat} WARN TaskSetManager: Lost task 0.1 in stage 0.0 (TID 1, ip-172-31-42-242.eu-west-1.compute.internal, executor 1): java.io.IOException: com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: The operation is not valid for the object's storage class (Service: Amazon S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: 5DD5BEBB8173977D), S3 Extended Request ID: K9bDwhm32CFHeg5zgVfW/T1A/vB4e8gqQ/p7E0Ze9ZG55UFoDP7hgnkQxLIwYX9i2LEcKwrR+lo= at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.handleAmazonServiceException(Jets3tNativeFileSystemStore.java:434) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:461) at com.amazon.ws.emr.hadoop.fs.s3n.Jets3tNativeFileSystemStore.retrievePair(Jets3tNativeFileSystemStore.java:439) at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:498) at org.apache.hadoop.io.retry.RetryInvocationHandler.invokeMethod(RetryInvocationHandler.java:191) at org.apache.hadoop.io.retry.RetryInvocationHandler.invoke(RetryInvocationHandler.java:102) at com.sun.proxy.$Proxy30.retrievePair(Unknown Source) at com.amazon.ws.emr.hadoop.fs.s3n.S3NativeFileSystem.open(S3NativeFileSystem.java:1201) at org.apache.hadoop.fs.FileSystem.open(FileSystem.java:773) at com.amazon.ws.emr.hadoop.fs.EmrFileSystem.open(EmrFileSystem.java:166) at org.apache.parquet.hadoop.util.HadoopInputFile.newStream(HadoopInputFile.java:65) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:443) at org.apache.parquet.hadoop.ParquetFileReader.readFooter(ParquetFileReader.java:421) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:491) at org.apache.spark.sql.execution.datasources.parquet.ParquetFileFormat$$anonfun$readParquetFootersInParallel$1.apply(ParquetFileFormat.scala:485) at scala.collection.parallel.AugmentedIterableIterator$class.flatmap2combiner(RemainsIterator.scala:132) at scala.collection.parallel.immutable.ParVector$ParVectorIterator.flatmap2combiner(ParVector.scala:62) at scala.collection.parallel.ParIterableLike$FlatMap.leaf(ParIterableLike.scala:1072) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) at scala.collection.parallel.ParIterableLike$FlatMap.tryLeaf(ParIterableLike.scala:1068) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.compute(Tasks.scala:443) at scala.concurrent.forkjoin.RecursiveAction.exec(RecursiveAction.java:160) at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) at scala.concurrent.forkjoin.ForkJoinTask.doJoin(ForkJoinTask.java:341) at scala.concurrent.forkjoin.ForkJoinTask.join(ForkJoinTask.java:673) at scala.collection.parallel.ForkJoinTasks$WrappedTask$class.sync(Tasks.scala:378) at scala.collection.parallel.AdaptiveWorkStealingForkJoinTasks$WrappedTask.sync(Tasks.scala:443) at scala.collection.parallel.ForkJoinTasks$class.executeAndWaitResult(Tasks.scala:426) at scala.collection.parallel.ForkJoinTaskSupport.executeAndWaitResult(TaskSupport.scala:56) at scala.collection.parallel.ParIterableLike$ResultMapping.leaf(ParIterableLike.scala:958) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply$mcV$sp(Tasks.scala:49) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$$anonfun$tryLeaf$1.apply(Tasks.scala:48) at scala.collection.parallel.Task$class.tryLeaf(Tasks.scala:51) at scala.collection.parallel.ParIterableLike$ResultMapping.tryLeaf(ParIterableLike.scala:953) at scala.collection.parallel.AdaptiveWorkStealingTasks$WrappedTask$class.compute(Tasks.scala:152) at
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140147#comment-16140147 ] Steve Loughran commented on SPARK-21797: Note that if it is just during spark partition calculation, it's probable that it is going down the directory tree and inspecting the files through HEAD requests, maybe looking at metadata entries too. So do attach the s3a & spark trace so we can see what's going on, as something may be over enthusastic about looking at files, or we could have something recognise the problem and recover from it. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16140138#comment-16140138 ] Steve Loughran commented on SPARK-21797: I was talking about the cost and time of getting data from Glacier. If that's the only place where data lives, then its slow and expensive. And that's the bit I'm describing as niche. Given I've been working full time on S3A, I'm reasonably confident it gets used a lot. If you talk to data in S3 that has been backed up to glacier, you *wlll get a 403*: According to Jeff Barr himself: https://aws.amazon.com/blogs/aws/archive-s3-to-glacier/ bq. If you archive objects using the Glacier storage option, you must inspect the storage class of an object before you attempt to retrieve it. The customary GET request will work as expected if the object is stored in S3 Standard or Reduced Redundancy (RRS) storage. It will fail (with a 403 error) if the object is archived in Glacier. In this case, you must use the RESTORE operation (described below) to make your data available in S3. bq. You use S3’s new RESTORE operation to access an object archived in Glacier. As part of the request, you need to specify a retention period in days. Restoring an object will generally take 3 to 5 hours. Your restored object will remain in both Glacier and S3’s Reduced Redundancy Storage (RRS) for the duration of the retention period. At the end of the retention period the object’s data will be removed from S3; the object will remain in Glacier. Like I said, I'd be interested in getting the full stack trace if you try to read this with an S3A client. Not for fixing, but for better reporting. Probably point them at Jeff's blog entry. Or this JIRA :) > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139986#comment-16139986 ] Sean Owen commented on SPARK-21797: --- Sure, but in all events, this is an operation that is fine with Spark, but not fine with something between the AWS SDK and AWS. It's not something Spark can fix. If source data is in S3, there's no way to avoid copying it from S3. Intermediate data produced by Spark can't live on S3 as it's too eventually consistent. Some final result could. And yeah you pay to read/write S3 so in some use cases might be more economical to keep intensely read/written data close to the compute workers for a time, rather than write/read to S3 between several closely related jobs. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139978#comment-16139978 ] Boris Clémençon commented on SPARK-21797: -- Hi Steve, to be sure we understand each other, *I don't want to read data from Glacier*. Concretely, I have a dataset in parquet partitioned by date in S3, with a automatic rule that freeze oldest dates in Glacier (and a few months later, delete it altogether). I want to read only most recent dates that are still in S3 (in a lazy way), not in Glacier (see exemple above), but even that, I cannot do it. Do you understand each other? Besides, why do you say that it a niche use case? and why do you say reading data from S3 is a "very, very expensive way to work with data"? According to our tests, reading on S3 in maximum 20% slower than reading from HDFS, and we operate from within AWS with a EMR cluster, so we should not pay data IO from S3. On the other hand, copying the dataset on HDFS has a time overhead and you need a large enough cluster with enough disk to store the whole dataset, or at least the relevant dates (whereas you may want to process a few columns, ie a fraction of the initial dataset). I would like you expertise about that. In any case, I understand your and Sean's argument though which says that it is to AWS to solve the problem. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16139919#comment-16139919 ] Steve Loughran commented on SPARK-21797: If you are using S3// URLs then its the AWS team's problem. If you were using s3a://, then it'd be something you ask the hadoop team to look at, but we'd say no as * it's a niche use case * It's really slow, as in "read() takes so long other bits of the system will start to think your worker is hanging". Which means if you have speculative execution turned on, they kick off other workers to read the data. * t's a very, very expensive way to work with data; $0.03/GB, which ramps up fast once multiple spark workers start reading the same datasets in parallel. * Finally, it's been rejected on the server with a 403 response. That's Amazon S3 saying "no", not any of the clients. You shouldn't be trying to process data direct from S3. Copy to S3 or a transient HDFS cluster, maybe as part of an oozie or airflow workflow. Be curious about the fulll stack trace you see if you do try this with s3a://, even though it'll still be a WONTFIX. We could at least go for a more meaningful exception translation, and the retry logic needs to know that it won't go away if you try again > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16136669#comment-16136669 ] Boris Clémençon commented on SPARK-21797: -- OK, I see. I will have a look at the aws-sdk-java project. thank you > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135236#comment-16135236 ] Sean Owen commented on SPARK-21797: --- Sure, but this would not be a change in Spark, but in the AWS SDK or Glacier service. Spark can't do anything about it so this isn't the right place. > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-21797) spark cannot read partitioned data in S3 that are partly in glacier
[ https://issues.apache.org/jira/browse/SPARK-21797?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16135233#comment-16135233 ] Boris Clémençon commented on SPARK-21797: -- Hi Sean, Thanks for the quick answer. I understand your point of view. However, this is a very common use case (and a good practice) to partition by date and send the oldest data in Glacier to optimize the costs of the data warehouse. Today, Spark cannot be used properly with such frequent architectures, which is a pity. Could we reconsider this issue as a "new feature" instead of "bug"? > spark cannot read partitioned data in S3 that are partly in glacier > --- > > Key: SPARK-21797 > URL: https://issues.apache.org/jira/browse/SPARK-21797 > Project: Spark > Issue Type: Bug > Components: Spark Core >Affects Versions: 2.2.0 >Reporter: Boris Clémençon > Labels: glacier, partitions, read, s3 > > I have a dataset in parquet in S3 partitioned by date (dt) with oldest date > stored in AWS Glacier to save some money. For instance, we have... > {noformat} > s3://my-bucket/my-dataset/dt=2017-07-01/[in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-09/[in glacier] > s3://my-bucket/my-dataset/dt=2017-07-10/[not in glacier] > ... > s3://my-bucket/my-dataset/dt=2017-07-24/[not in glacier] > {noformat} > I want to read this dataset, but only a subset of date that are not yet in > glacier, eg: > {code:java} > val from = "2017-07-15" > val to = "2017-08-24" > val path = "s3://my-bucket/my-dataset/" > val X = spark.read.parquet(path).where(col("dt").between(from, to)) > {code} > Unfortunately, I have the exception > {noformat} > java.io.IOException: > com.amazon.ws.emr.hadoop.fs.shaded.com.amazonaws.services.s3.model.AmazonS3Exception: > The operation is not valid for the object's storage class (Service: Amazon > S3; Status Code: 403; Error Code: InvalidObjectState; Request ID: > C444D508B6042138) > {noformat} > I seems that spark does not like partitioned dataset when some partitions are > in Glacier. I could always read specifically each date, add the column with > current date and reduce(_ union _) at the end, but not pretty and it should > not be necessary. > Is there any tip to read available data in the datastore even with old data > in glacier? -- This message was sent by Atlassian JIRA (v6.4.14#64029) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org