[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths

2021-06-03 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17356625#comment-17356625
 ] 

Steve Loughran commented on SPARK-34298:


Actually, you've just found a bug in Path.suffix, HADOOP-17744. Patches welcome 
there.

Whether that will fix your problem: who knows. But it will get you further

> SaveMode.Overwrite not usable when using s3a root paths 
> 
>
> Key: SPARK-34298
> URL: https://issues.apache.org/jira/browse/SPARK-34298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: cornel creanga
>Priority: Minor
>
> SaveMode.Overwrite does not work when using paths containing just the root eg 
> "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials 
> are needed):
> {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color}
> {color:#0033b3}val {color}{color:#00}sparkContext{color}: 
> {color:#00}SparkContext {color}= 
> {color:#00}SparkContext{color}.getOrCreate()
> {color:#0033b3}val {color}{color:#00}someData {color}= 
> {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, 
> {color:#067d17}"mouse"{color}))
> {color:#0033b3}val {color}{color:#00}someSchema {color}= 
> {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, 
> {color:#00}IntegerType{color}, 
> {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, 
> {color:#00}StringType{color},{color:#0033b3}true{color}))
> {color:#0033b3}val {color}{color:#00}someDF {color}= 
> {color:#871094}spark{color}.createDataFrame(
>  
> {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color},
>  accessK{color:#00}ey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color},
>  {color:#00}secretKey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color},
>  
> {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color})
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color},
>  {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color})
> {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color})
>  .save({color:#00}out{color})
>  
> Error stacktrace:
> Exception in thread "main" java.lang.IllegalArgumentException: Can not create 
> a Path from an empty string
>  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[]
> at org.apache.hadoop.fs.Path.suffix(Path.java:446)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240)
>  
> If you change out from {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val 
> {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code 
> works.{color}{color}
> {color:#067d17}{color:#172b4d}There are two problems in the actual code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color}
> {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix 
> method that doesn't work on root paths
> {color}{color}
> {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly 
> (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem 
> class){color}{color}
> {color:#067d17}{color:#172b4d}I think that there are two 
> choices:{color}{color}
> {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite 
> mode for root folders {color}{color}
> {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the 
> Path.suffix method and change the clean up code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root 
> folder content and delete the entries one by one.{color}{color}
> I can provide a patch for both choices, assuming that they make sense.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths

2021-02-05 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279798#comment-17279798
 ] 

Steve Loughran commented on SPARK-34298:


well, I'm sure a PR with tests will get reviewed...

> SaveMode.Overwrite not usable when using s3a root paths 
> 
>
> Key: SPARK-34298
> URL: https://issues.apache.org/jira/browse/SPARK-34298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: cornel creanga
>Priority: Minor
>
> SaveMode.Overwrite does not work when using paths containing just the root eg 
> "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials 
> are needed):
> {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color}
> {color:#0033b3}val {color}{color:#00}sparkContext{color}: 
> {color:#00}SparkContext {color}= 
> {color:#00}SparkContext{color}.getOrCreate()
> {color:#0033b3}val {color}{color:#00}someData {color}= 
> {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, 
> {color:#067d17}"mouse"{color}))
> {color:#0033b3}val {color}{color:#00}someSchema {color}= 
> {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, 
> {color:#00}IntegerType{color}, 
> {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, 
> {color:#00}StringType{color},{color:#0033b3}true{color}))
> {color:#0033b3}val {color}{color:#00}someDF {color}= 
> {color:#871094}spark{color}.createDataFrame(
>  
> {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color},
>  accessK{color:#00}ey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color},
>  {color:#00}secretKey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color},
>  
> {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color})
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color},
>  {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color})
> {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color})
>  .save({color:#00}out{color})
>  
> Error stacktrace:
> Exception in thread "main" java.lang.IllegalArgumentException: Can not create 
> a Path from an empty string
>  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[]
> at org.apache.hadoop.fs.Path.suffix(Path.java:446)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240)
>  
> If you change out from {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val 
> {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code 
> works.{color}{color}
> {color:#067d17}{color:#172b4d}There are two problems in the actual code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color}
> {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix 
> method that doesn't work on root paths
> {color}{color}
> {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly 
> (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem 
> class){color}{color}
> {color:#067d17}{color:#172b4d}I think that there are two 
> choices:{color}{color}
> {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite 
> mode for root folders {color}{color}
> {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the 
> Path.suffix method and change the clean up code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root 
> folder content and delete the entries one by one.{color}{color}
> I can provide a patch for both choices, assuming that they make sense.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org



[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths

2021-02-04 Thread cornel creanga (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17279006#comment-17279006
 ] 

cornel creanga commented on SPARK-34298:


Thanks for the answer. In this wouldn't be better to implement the option a) - 
throw an explicit error with a meaningful message(eg root dirs are not 
supported in overwrite mode etc) when trying to use a root dir? Right now one 
will get an java.lang.IndexOutOfBoundsException and will have to dig into the 
Spark code in order to understand what's the problem (as it happened to me).

> SaveMode.Overwrite not usable when using s3a root paths 
> 
>
> Key: SPARK-34298
> URL: https://issues.apache.org/jira/browse/SPARK-34298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: cornel creanga
>Priority: Minor
>
> SaveMode.Overwrite does not work when using paths containing just the root eg 
> "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials 
> are needed):
> {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color}
> {color:#0033b3}val {color}{color:#00}sparkContext{color}: 
> {color:#00}SparkContext {color}= 
> {color:#00}SparkContext{color}.getOrCreate()
> {color:#0033b3}val {color}{color:#00}someData {color}= 
> {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, 
> {color:#067d17}"mouse"{color}))
> {color:#0033b3}val {color}{color:#00}someSchema {color}= 
> {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, 
> {color:#00}IntegerType{color}, 
> {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, 
> {color:#00}StringType{color},{color:#0033b3}true{color}))
> {color:#0033b3}val {color}{color:#00}someDF {color}= 
> {color:#871094}spark{color}.createDataFrame(
>  
> {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color},
>  accessK{color:#00}ey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color},
>  {color:#00}secretKey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color},
>  
> {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color})
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color},
>  {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color})
> {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color})
>  .save({color:#00}out{color})
>  
> Error stacktrace:
> Exception in thread "main" java.lang.IllegalArgumentException: Can not create 
> a Path from an empty string
>  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[]
> at org.apache.hadoop.fs.Path.suffix(Path.java:446)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240)
>  
> If you change out from {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val 
> {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code 
> works.{color}{color}
> {color:#067d17}{color:#172b4d}There are two problems in the actual code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color}
> {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix 
> method that doesn't work on root paths
> {color}{color}
> {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly 
> (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem 
> class){color}{color}
> {color:#067d17}{color:#172b4d}I think that there are two 
> choices:{color}{color}
> {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite 
> mode for root folders {color}{color}
> {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the 
> Path.suffix method and change the clean up code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root 
> folder content and delete the entries one by one.{color}{color}
> I can provide a patch for both choices, assuming that they make sense.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org

[jira] [Commented] (SPARK-34298) SaveMode.Overwrite not usable when using s3a root paths

2021-02-04 Thread Steve Loughran (Jira)


[ 
https://issues.apache.org/jira/browse/SPARK-34298?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17278846#comment-17278846
 ] 

Steve Loughran commented on SPARK-34298:


root dirs are special in that they always exist. Normally apps like spark and 
hive don't notice this as nobody ever runs jobs which write to the base of 
file:// or hdfs:// ; object stores are special there.

You might the commit algorithms get a bit confused too. In which case: 

* fixes for the s3a committers welcome;
* there is a serialized test phase in hadoop-aws where we do stuff against the 
root dir; 
* and a PR for real integration tests can go in 
https://github.com/hortonworks-spark/cloud-integration
* anything related to the classic MR committer will be rejected out of fear of 
going near it; not safe for s3 anyway

Otherwise: workaround is to write into a subdir. Sorry

> SaveMode.Overwrite not usable when using s3a root paths 
> 
>
> Key: SPARK-34298
> URL: https://issues.apache.org/jira/browse/SPARK-34298
> Project: Spark
>  Issue Type: Bug
>  Components: Spark Core
>Affects Versions: 3.1.2
>Reporter: cornel creanga
>Priority: Minor
>
> SaveMode.Overwrite does not work when using paths containing just the root eg 
> "s3a://peakhour-report". To reproduce the issue (an s3 bucket + credentials 
> are needed):
> {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color}
> {color:#0033b3}val {color}{color:#00}sparkContext{color}: 
> {color:#00}SparkContext {color}= 
> {color:#00}SparkContext{color}.getOrCreate()
> {color:#0033b3}val {color}{color:#00}someData {color}= 
> {color:#871094}Seq{color}(Row({color:#1750eb}24{color}, 
> {color:#067d17}"mouse"{color}))
> {color:#0033b3}val {color}{color:#00}someSchema {color}= 
> {color:#871094}List{color}(StructField({color:#067d17}"age"{color}, 
> {color:#00}IntegerType{color}, 
> {color:#0033b3}true{color}),StructField({color:#067d17}"word"{color}, 
> {color:#00}StringType{color},{color:#0033b3}true{color}))
> {color:#0033b3}val {color}{color:#00}someDF {color}= 
> {color:#871094}spark{color}.createDataFrame(
>  
> {color:#871094}spark{color}.sparkContext.parallelize({color:#00}someData{color}),StructType({color:#00}someSchema{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.access.key"{color},
>  accessK{color:#00}ey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.secret.key"{color},
>  {color:#00}secretKey{color}))
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.aws.credentials.provider"{color},
>  
> {color:#067d17}"org.apache.hadoop.fs.s3a.SimpleAWSCredentialsProvider"{color})
> {color:#00}sparkContext{color}.hadoopConfiguration.set({color:#067d17}"fs.s3a.impl"{color},
>  {color:#067d17}"org.apache.hadoop.fs.s3a.S3AFileSystem"{color})
> {color:#00}someDF{color}.write.format({color:#067d17}"parquet"{color}).partitionBy({color:#067d17}"age"{color}).mode({color:#00}SaveMode{color}.{color:#871094}Overwrite{color})
>  .save({color:#00}out{color})
>  
> Error stacktrace:
> Exception in thread "main" java.lang.IllegalArgumentException: Can not create 
> a Path from an empty string
>  at org.apache.hadoop.fs.Path.checkPathArg(Path.java:168)[]
> at org.apache.hadoop.fs.Path.suffix(Path.java:446)
>  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions(InsertIntoHadoopFsRelationCommand.scala:240)
>  
> If you change out from {color:#0033b3}val {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report"{color} to {color:#0033b3}val 
> {color}{color:#00}out {color}= 
> {color:#067d17}"s3a://peakhour-report/folder" {color:#172b4d}the code 
> works.{color}{color}
> {color:#067d17}{color:#172b4d}There are two problems in the actual code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions: {color}{color}
> {color:#067d17}{color:#172b4d}a) it uses org.apache.hadoop.fs.Path.suffix 
> method that doesn't work on root paths
> {color}{color}
> {color:#067d17}{color:#172b4d}b) it tries to delete the root folder directly 
> (in our case the s3 bucket name) and this is prohibited (in the S3AFileSystem 
> class){color}{color}
> {color:#067d17}{color:#172b4d}I think that there are two 
> choices:{color}{color}
> {color:#067d17}{color:#172b4d}a) throw an explicit error when using overwrite 
> mode for root folders {color}{color}
> {color:#067d17}{color:#172b4d}b)fix the actual issue. don't use the 
> Path.suffix method and change the clean up code from 
> InsertIntoHadoopFsRelationCommand.deleteMatchingPartitions to list the root 
> folder content and delet