Re: Probable Spark Bug while inserting into flat GCS bucket?

Dipayan Dev Sun, 20 Aug 2023 03:13:28 -0700

Hi Mich,

It's not specific to ORC, and looks like a bug from Hadoop Common project.
I have raised a bug and am happy to contribute to Hadoop 3.3.0 version. Do
you know if anyone could help me to set the Assignee?
https://issues.apache.org/jira/browse/HADOOP-18856



With Best Regards,

Dipayan Dev



On Sun, Aug 20, 2023 at 2:47 AM Mich Talebzadeh <mich.talebza...@gmail.com>
wrote:

> Under gs directory
>
> "gs://test_dd1/abc/"
>
> What do you see?
>
> gsutil ls gs://test_dd1/abc
>
> and the same
>
> gs://test_dd1/
>
> gsutil ls gs://test_dd1
>
> I suspect you need a folder for multiple ORC slices!
>
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>    view my Linkedin profile
> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 19 Aug 2023 at 21:36, Dipayan Dev <dev.dipaya...@gmail.com> wrote:
>
>> Hi Everyone,
>>
>> I'm stuck with one problem, where I need to provide a custom GCS location
>> for the Hive table from Spark. The code fails while doing an *'insert
>> into'* whenever my Hive table has a flag GS location like
>> gs://<bucket_name>, but works for nested locations like
>> gs://bucket_name/blob_name.
>>
>> Is anyone aware if it's an issue from Spark side or any config I need to
>> pass for it?
>>
>> *The issue is happening in 2.x and 3.x both.*
>>
>> Config using:
>>
>> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
>> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition", true)
>> spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
>> spark.conf.set("hive.exec.dynamic.partition", true)
>>
>>
>> *Case 1 : FAILS*
>>
>> val DF = Seq(("test1", 123)).toDF("name", "num")
>>  val partKey = List("num").map(x => x)
>>
>> DF.write.option("path", 
>> "gs://test_dd1/").mode(SaveMode.Overwrite).partitionBy(partKey: 
>> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb1")
>>
>> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>> DF.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb1")
>>
>>
>>
>>
>>
>> *java.lang.NullPointerException  at 
>> org.apache.hadoop.fs.Path.<init>(Path.java:141)  at 
>> org.apache.hadoop.fs.Path.<init>(Path.java:120)  at 
>> org.apache.hadoop.fs.Path.suffix(Path.java:441)  at 
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)*
>>
>>
>> *Case 2: Succeeds  *
>>
>> val DF = Seq(("test1", 123)).toDF("name", "num")
>>  val partKey = List("num").map(x => x)
>>
>> DF.write.option("path", 
>> "gs://test_dd1/abc/").mode(SaveMode.Overwrite).partitionBy(partKey: 
>> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb2")
>>
>> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>>
>> DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb2")
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>

Re: Probable Spark Bug while inserting into flat GCS bucket?

Reply via email to