[ https://issues.apache.org/jira/browse/HADOOP-18856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756834#comment-17756834 ]
Steve Loughran edited comment on HADOOP-18856 at 8/21/23 11:09 AM: ------------------------------------------------------------------- bq. Dataproc's latest 2.1 uses Hadoop 3.3.3 and this needs to be fixed in 3.3.3. # that's google's product, so their problem. # telling an open source project what version they need to fix misses a fundamental point: it is up to the project to decide what versions things get fixed in. for hadoop. 3.2.x. gets security fixes, for branch 3.3, it will be the latest # this is HADOOP-18652; already fixed in hadoop 3.3.6. please upgrade # or get google to update theirs. i'm going to warn, even with HADOOP-18652, you are unlikely to have a spark job successfully write into the root path of an object store -any object store. root directories are special, in that you can never delete them; most job/committer setup code assumes that an "rm -r $jobDest" will remove everything, including the actual directory of the job. it'll be up to you to work out whether this is the case for google's code. Assuming you are using FileOutputCommitter with v2 algorithm, as google recommend, it'll blow up as mkdirs() will fail (dest exists). we aren't going to fix that as that committer is considered "stable with critical bug fixes only". Changes to the ManifestCommitter -which is designed to deliver correctness and performance on gcs and azure storage are welcome, a quick look at that code shows it can't handle / as a destination. Created MAPREDUCE-7452 for someone (you?) to handle. to close then: unless you get the existing HADOOP-18652 patch and support for / as a destination in mapreduce/spark code, you are going to have to commit your work to a subdirectory. as noted, fixes to MAPREDUCE-7452 are welcome, targeting hadoop trunk and the 3.3.9 branch. although the gcs connector isn't in our codebase, an integration test which targets azure abfs will suffice. sorry we can't be of more help; personally i'd write to a subdir. / is special was (Author: ste...@apache.org): bq. Dataproc's latest 2.1 uses Hadoop 3.3.3 and this needs to be fixed in 3.3.3. # that's google's product, so their problem. # telling an open source project what version they need to fix misses a fundamental point: it is up to the project to describe what versions things get fixed in. # this is HADOOP-18652; already fixed in hadoop 3.3.6. please move to a version with this i'm going to warn, even with HADOOP-18652, you are unlikely to have a spark job successfully write into the root path of an object store -any object store. root directories are special, in that you can never delete them; most job/committer setup code assumes that an "rm -r $jobDest" will remove everything, including the actual directory of the job. it'll be up to you to work out whether this is the case for google's code. Assuming you are using FileOutputCommitter with v2 algorithm, as google recommend, it'll blow up as mkdirs() will fail (dest exists). we aren't going to fix that as that committer is considered "stable with critical bug fixes only". Changes to the ManifestCommitter -which is designed to deliver correctness and performance on gcs and azure storage are welcome, a quick look at that code shows it can't handle / as a destination. Created MAPREDUCE-7452 for someone (you?) to handle. to close then: unless you get the existing HADOOP-18652 patch and support for / as a destination in mapreduce/spark code, you are going to have to commit your work to a subdirectory. as noted, fixes to MAPREDUCE-7452 are welcome, targeting hadoop trunk and the 3.3.9 branch. although the gcs connector isn't in our codebase, an integration test which targets azure abfs will suffice. > Spark insertInto with location GCS bucket root causes NPE > --------------------------------------------------------- > > Key: HADOOP-18856 > URL: https://issues.apache.org/jira/browse/HADOOP-18856 > Project: Hadoop Common > Issue Type: Bug > Components: common > Affects Versions: 3.3.3 > Reporter: Dipayan Dev > Priority: Minor > > > {noformat} > scala> import org.apache.hadoop.fs.Path > import org.apache.hadoop.fs.Path > scala> val path: Path = new Path("gs://test_dd123/") > path: org.apache.hadoop.fs.Path = gs://test_dd123/ > scala> path.suffix("/num=123") > java.lang.NullPointerException > at org.apache.hadoop.fs.Path.<init>(Path.java:150) > at org.apache.hadoop.fs.Path.<init>(Path.java:129) > at org.apache.hadoop.fs.Path.suffix(Path.java:450){noformat} > > Path.suffix throws NPE when writing into GS buckets root. > > In our Organisation, we are using GCS bucket root location to point to our > Hive table. Dataproc's latest 2.1 uses *Hadoop* *3.3.3* and this needs to be > fixed in 3.3.3. > Spark Scala code to reproduce this issue > {noformat} > val DF = Seq(("test1", 123)).toDF("name", "num") > DF.write.option("path", > "gs://test_dd123/").mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("schema_name.table_name") > val DF1 = Seq(("test2", 125)).toDF("name", "num") > DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("schema_name.table_name") > java.lang.NullPointerException > at org.apache.hadoop.fs.Path.<init>(Path.java:141) > at org.apache.hadoop.fs.Path.<init>(Path.java:120) > at org.apache.hadoop.fs.Path.suffix(Path.java:441) > at > org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254) > {noformat} > > -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org For additional commands, e-mail: common-issues-h...@hadoop.apache.org