subject:"Re\: Probable Spark Bug while inserting into flat GCS bucket\?"

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-20 Thread Dipayan Dev

Hi Mich,

It's not specific to ORC, and looks like a bug from Hadoop Common project.
I have raised a bug and am happy to contribute to Hadoop 3.3.0 version. Do
you know if anyone could help me to set the Assignee?
https://issues.apache.org/jira/browse/HADOOP-18856


With Best Regards,

Dipayan Dev



On Sun, Aug 20, 2023 at 2:47 AM Mich Talebzadeh 
wrote:

> Under gs directory
>
> "gs://test_dd1/abc/"
>
> What do you see?
>
> gsutil ls gs://test_dd1/abc
>
> and the same
>
> gs://test_dd1/
>
> gsutil ls gs://test_dd1
>
> I suspect you need a folder for multiple ORC slices!
>
>
>
> Mich Talebzadeh,
> Solutions Architect/Engineering Lead
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Sat, 19 Aug 2023 at 21:36, Dipayan Dev  wrote:
>
>> Hi Everyone,
>>
>> I'm stuck with one problem, where I need to provide a custom GCS location
>> for the Hive table from Spark. The code fails while doing an *'insert
>> into'* whenever my Hive table has a flag GS location like
>> gs://, but works for nested locations like
>> gs://bucket_name/blob_name.
>>
>> Is anyone aware if it's an issue from Spark side or any config I need to
>> pass for it?
>>
>> *The issue is happening in 2.x and 3.x both.*
>>
>> Config using:
>>
>> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
>> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition", true)
>> spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
>> spark.conf.set("hive.exec.dynamic.partition", true)
>>
>>
>> *Case 1 : FAILS*
>>
>> val DF = Seq(("test1", 123)).toDF("name", "num")
>>  val partKey = List("num").map(x => x)
>>
>> DF.write.option("path", 
>> "gs://test_dd1/").mode(SaveMode.Overwrite).partitionBy(partKey: 
>> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb1")
>>
>> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>> DF.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb1")
>>
>>
>>
>>
>>
>> *java.lang.NullPointerException  at 
>> org.apache.hadoop.fs.Path.(Path.java:141)  at 
>> org.apache.hadoop.fs.Path.(Path.java:120)  at 
>> org.apache.hadoop.fs.Path.suffix(Path.java:441)  at 
>> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)*
>>
>>
>> *Case 2: Succeeds  *
>>
>> val DF = Seq(("test1", 123)).toDF("name", "num")
>>  val partKey = List("num").map(x => x)
>>
>> DF.write.option("path", 
>> "gs://test_dd1/abc/").mode(SaveMode.Overwrite).partitionBy(partKey: 
>> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb2")
>>
>> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>>
>> DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb2")
>>
>>
>> With Best Regards,
>>
>> Dipayan Dev
>>
>

Re: Probable Spark Bug while inserting into flat GCS bucket?

2023-08-19 Thread Mich Talebzadeh

Under gs directory

"gs://test_dd1/abc/"

What do you see?

gsutil ls gs://test_dd1/abc

and the same

gs://test_dd1/

gsutil ls gs://test_dd1

I suspect you need a folder for multiple ORC slices!



Mich Talebzadeh,
Solutions Architect/Engineering Lead
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Sat, 19 Aug 2023 at 21:36, Dipayan Dev  wrote:

> Hi Everyone,
>
> I'm stuck with one problem, where I need to provide a custom GCS location
> for the Hive table from Spark. The code fails while doing an *'insert
> into'* whenever my Hive table has a flag GS location like
> gs://, but works for nested locations like
> gs://bucket_name/blob_name.
>
> Is anyone aware if it's an issue from Spark side or any config I need to
> pass for it?
>
> *The issue is happening in 2.x and 3.x both.*
>
> Config using:
>
> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition.mode", "nonstrict")
> spark.conf.set("spark.hadoop.hive.exec.dynamic.partition", true)
> spark.conf.set("hive.exec.dynamic.partition.mode","nonstrict")
> spark.conf.set("hive.exec.dynamic.partition", true)
>
>
> *Case 1 : FAILS*
>
> val DF = Seq(("test1", 123)).toDF("name", "num")
>  val partKey = List("num").map(x => x)
>
> DF.write.option("path", 
> "gs://test_dd1/").mode(SaveMode.Overwrite).partitionBy(partKey: 
> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb1")
>
> val DF1 = Seq(("test2", 125)).toDF("name", "num")
> DF.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb1")
>
>
>
>
>
> *java.lang.NullPointerException  at 
> org.apache.hadoop.fs.Path.(Path.java:141)  at 
> org.apache.hadoop.fs.Path.(Path.java:120)  at 
> org.apache.hadoop.fs.Path.suffix(Path.java:441)  at 
> org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelationCommand.$anonfun$getCustomPartitionLocations$1(InsertIntoHadoopFsRelationCommand.scala:254)*
>
>
> *Case 2: Succeeds  *
>
> val DF = Seq(("test1", 123)).toDF("name", "num")
>  val partKey = List("num").map(x => x)
>
> DF.write.option("path", 
> "gs://test_dd1/abc/").mode(SaveMode.Overwrite).partitionBy(partKey: 
> _*).format("orc").saveAsTable("us_wm_supply_chain_otif_stg.test_tb2")
>
> val DF1 = Seq(("test2", 125)).toDF("name", "num")
>
> DF1.write.mode(SaveMode.Overwrite).format("orc").insertInto("us_wm_supply_chain_otif_stg.test_tb2")
>
>
> With Best Regards,
>
> Dipayan Dev
>

Re: Probable Spark Bug while inserting into flat GCS bucket?

Re: Probable Spark Bug while inserting into flat GCS bucket?

2 matches

Site Navigation

Mail list logo

Footer information