[jira] [Comment Edited] (HADOOP-18856) Spark insertInto with location GCS bucket root causes NPE

Steve Loughran (Jira) Mon, 21 Aug 2023 04:10:05 -0700


    [ 
https://issues.apache.org/jira/browse/HADOOP-18856?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17756834#comment-17756834
 ]


Steve Loughran edited comment on HADOOP-18856 at 8/21/23 11:09 AM:
-------------------------------------------------------------------

bq. Dataproc's latest 2.1 uses Hadoop 3.3.3 and this needs to be fixed in 3.3.3.

# that's google's product, so their problem.
# telling an open source project what version they need to fix misses a 
fundamental point: it is up to the project to decide what versions things get 
fixed in. for hadoop. 3.2.x. gets security fixes, for branch 3.3, it will be 
the latest 
# this is HADOOP-18652; already fixed in hadoop 3.3.6. please upgrade
# or get google to update theirs. 
 
i'm going to warn, even with HADOOP-18652, you are unlikely to have a spark job 
successfully write into the root path of an object store -any object store. 
root directories are special, in that you can never delete them; most 
job/committer setup code assumes that an "rm -r $jobDest" will remove 
everything, including the actual directory of the job. 

it'll be up to you to work out whether this is the case for google's code.

Assuming you are using FileOutputCommitter with v2 algorithm, as google 
recommend, it'll blow up as mkdirs() will fail (dest exists). we aren't going 
to fix that as that committer is considered "stable with critical bug fixes 
only".

Changes to the ManifestCommitter -which is designed to deliver correctness and 
performance on gcs and azure storage are welcome, a quick look at that code 
shows it can't handle / as a destination. Created MAPREDUCE-7452 for someone 
(you?) to handle.

to close then: unless you get the existing HADOOP-18652 patch and support for / 
as a destination in mapreduce/spark code, you are going to have to commit your 
work to a subdirectory. 

as noted, fixes to MAPREDUCE-7452 are welcome, targeting hadoop trunk and the 
3.3.9 branch. although the gcs connector isn't in our codebase, an integration 
test which targets azure abfs will suffice.

sorry we can't be of more help; personally i'd write to a subdir. / is special


was (Author: ste...@apache.org):
bq. Dataproc's latest 2.1 uses Hadoop 3.3.3 and this needs to be fixed in 3.3.3.

# that's google's product, so their problem.
# telling an open source project what version they need to fix misses a 
fundamental point: it is up to the project to describe what versions things get 
fixed in. 
# this is HADOOP-18652; already fixed in hadoop 3.3.6. please move to a version 
with this
 
i'm going to warn, even with HADOOP-18652, you are unlikely to have a spark job 
successfully write into the root path of an object store -any object store. 
root directories are special, in that you can never delete them; most 
job/committer setup code assumes that an "rm -r $jobDest" will remove 
everything, including the actual directory of the job. 

it'll be up to you to work out whether this is the case for google's code.

Assuming you are using FileOutputCommitter with v2 algorithm, as google 
recommend, it'll blow up as mkdirs() will fail (dest exists). we aren't going 
to fix that as that committer is considered "stable with critical bug fixes 
only".

Changes to the ManifestCommitter -which is designed to deliver correctness and 
performance on gcs and azure storage are welcome, a quick look at that code 
shows it can't handle / as a destination. Created MAPREDUCE-7452 for someone 
(you?) to handle.

to close then: unless you get the existing HADOOP-18652 patch and support for / 
as a destination in mapreduce/spark code, you are going to have to commit your 
work to a subdirectory. 

as noted, fixes to MAPREDUCE-7452 are welcome, targeting hadoop trunk and the 
3.3.9 branch. although the gcs connector isn't in our codebase, an integration 
test which targets azure abfs will suffice.

> Spark insertInto with location GCS bucket root causes NPE
> ---------------------------------------------------------
>
>                 Key: HADOOP-18856
>                 URL: https://issues.apache.org/jira/browse/HADOOP-18856
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: common
>    Affects Versions: 3.3.3
>            Reporter: Dipayan Dev
>            Priority: Minor
>
>  
> {noformat}
> scala> import org.apache.hadoop.fs.Path
> import org.apache.hadoop.fs.Path
> scala> val path: Path = new Path("gs://test_dd123/")
> path: org.apache.hadoop.fs.Path = gs://test_dd123/
> scala> path.suffix("/num=123")
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.Path.<init>(Path.java:150)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:129)
>   at org.apache.hadoop.fs.Path.suffix(Path.java:450){noformat}
>  
> Path.suffix throws NPE when writing into GS buckets root. 
>  
> In our Organisation, we are using GCS bucket root location to point to our 
> Hive table. Dataproc's latest 2.1 uses *Hadoop* *3.3.3* and this needs to be 
> fixed in 3.3.3.
> Spark Scala code to reproduce this issue
> {noformat}
> val DF = Seq(("test1", 123)).toDF("name", "num")
> DF.write.option("path", 
> "gs://test_dd123/").mode(SaveMode.Overwrite).partitionBy("num").format("orc").saveAsTable("schema_name.table_name")
> val DF1 = Seq(("test2", 125)).toDF("name", "num")
> DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("schema_name.table_name")
> java.lang.NullPointerException
>   at org.apache.hadoop.fs.Path.<init>(Path.java:141)
>   at org.apache.hadoop.fs.Path.<init>(Path.java:120)
>   at org.apache.hadoop.fs.Path.suffix(Path.java:441)
>   at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)
>  {noformat}
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: common-issues-unsubscr...@hadoop.apache.org
For additional commands, e-mail: common-issues-h...@hadoop.apache.org

[jira] [Comment Edited] (HADOOP-18856) Spark insertInto with location GCS bucket root causes NPE

Reply via email to