RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi, I tried replacing just this JAR but getting errors.

From: Nagatomi Yasukazu 
Sent: Friday, September 8, 2023 9:35 AM
To: Agrawal, Sanket 
Cc: Chao Sun ; Yeachan Park ; 
user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

Hi Sanket,

While migrating to Hive 3.1.3 may resolve many issues, the link below suggests 
that there might still be some vulnerabilities present.
Do you think the specific vulnerability you're concerned about can be addressed 
with Hive 3.1.3?

https://mvnrepository.com/artifact/org.apache.hive/hive-exec/3.1.3

Regards,
Yasukazu

2023年9月8日(金) 12:36 Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>>:
Hi Chao,

The reason to migrate to Hive 3.1.3 is to remove a vulnerability from 
hive-exec-2.3.9.jar.

Thanks
Sanket

From: Chao Sun mailto:sunc...@apache.org>>
Sent: Thursday, September 7, 2023 10:23 PM
To: Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>>
Cc: Yeachan Park mailto:yeachan...@gmail.com>>; 
user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

Hi Sanket,

Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a lot of 
work to upgrade the Hive version to 3.x and up.

Normally though, you only need the Hive client in Spark to talk to 
HiveMetastore (HMS) for things like table or partition metadata information. In 
this case, Hive 2.3.9 used by Spark is already capable of communicating with 
HMS of other versions like Hive 3.x. So, could you share a bit of context why 
you want to use Hive 3.1.3 with Spark?

Chao


On Thu, Sep 7, 2023 at 6:22 AM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>> 
wrote:
Hi

I Tried using the maven option and it’s working. But we are not allowed to 
download jars at runtime from maven because of some security restrictions.

So, I tried again with downloading hive 3.1.3 and giving the location of jars 
and it worked this time. But now in our docker image we have 40 new Critical 
vulnerabilities due to Hive (scanned by AWS Inspector).

So, The only solution I see here is to build Spark 3.4.1 with Hive 3.1.3. But 
when I do so the build is failing while compiling the files in /spark/sql/hive. 
But when I am trying to build Spark 3.4.1 with Hive 2.3.9 the build is 
completed successfully.

Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?

Thanks,
Sanket A.

From: Yeachan Park mailto:yeachan...@gmail.com>>
Sent: Tuesday, September 5, 2023 8:52 PM
To: Agrawal, Sanket 
mailto:sankeagra...@deloitte.com>>
Cc: user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

What's the full traceback when you run the same thing via spark-shell? So 
something like:

$SPARK_HOME/bin/spark-shell \
   --conf "spark.sql.hive.metastore.version=3.1.3" \
   --conf "spark.sql.hive.metastore.jars=path" \
   --conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"

W.r.t building hive, there's no need - either download it from 
https://downloads.apache.org/hive/hive-3.1.3/
 or use the maven option like Yasukazu suggested. If you do want to build it 
make sure you are using Java 8 to do so.

On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com>> wrote:
Hi,

I tried pointing to hive 3.1.3 using the below command. But still getting 
error. I see that the spark-hive-thriftserver_2.12/3.4.1 and 
spark-hive_2.12/3.4.1 have dependency on hive 2.3.9

Command: pyspark --conf "spark.sql.hive.metastore.version=3.1.3" --conf 
"spark.sql.hive.metastore.jars=path" --conf 
"spark.sql.hive.metastore.jars.path=file://opt/hive/lib/*.jar"

Error:


Also, when I am trying to build spark with Hive 3.1.3 I am getting following 
error.

If anyone can 

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Nagatomi Yasukazu
Hi Sanket,

While migrating to Hive 3.1.3 may resolve many issues, the link below
suggests that there might still be some vulnerabilities present.
Do you think the specific vulnerability you're concerned about can be
addressed with Hive 3.1.3?

https://mvnrepository.com/artifact/org.apache.hive/hive-exec/3.1.3

Regards,
Yasukazu

2023年9月8日(金) 12:36 Agrawal, Sanket :

> Hi Chao,
>
>
>
> The reason to migrate to Hive 3.1.3 is to remove a vulnerability from
> hive-exec-2.3.9.jar.
>
>
>
> Thanks
>
> Sanket
>
>
>
> *From:* Chao Sun 
> *Sent:* Thursday, September 7, 2023 10:23 PM
> *To:* Agrawal, Sanket 
> *Cc:* Yeachan Park ; user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> Hi Sanket,
>
>
>
> Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a
> lot of work to upgrade the Hive version to 3.x and up.
>
>
>
> Normally though, you only need the Hive client in Spark to talk to
> HiveMetastore (HMS) for things like table or partition metadata
> information. In this case, Hive 2.3.9 used by Spark is already capable of
> communicating with HMS of other versions like Hive 3.x. So, could you share
> a bit of context why you want to use Hive 3.1.3 with Spark?
>
>
>
> Chao
>
>
>
>
>
> On Thu, Sep 7, 2023 at 6:22 AM Agrawal, Sanket <
> sankeagra...@deloitte.com.invalid> wrote:
>
> Hi
>
>
>
> I Tried using the maven option and it’s working. But we are not allowed to
> download jars at runtime from maven because of some security restrictions.
>
>
>
> So, I tried again with downloading hive 3.1.3 and giving the location of
> jars and it worked this time. But now in our docker image we have 40 new
> Critical vulnerabilities due to Hive (scanned by AWS Inspector).
>
>
>
> So, The only solution I see here is to build *Spark 3.4.1* *with Hive
> 3.1.3*. But when I do so the build is failing while compiling the files
> in /spark/sql/hive. But when I am trying to build *Spark 3.4.1* *with
> Hive 2.3.9* the build is completed successfully.
>
>
>
> Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?
>
>
>
> Thanks,
>
> Sanket A.
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 8:52 PM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> What's the full traceback when you run the same thing via spark-shell? So
> something like:
>
>
>
> $SPARK_HOME/bin/spark-shell \
>--conf "spark.sql.hive.metastore.version=3.1.3" \
>--conf "spark.sql.hive.metastore.jars=path" \
>--conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"
>
>
>
> W.r.t building hive, there's no need - either download it from
> https://downloads.apache.org/hive/hive-3.1.3/
> 
> or use the maven option like Yasukazu suggested. If you do want to build it
> make sure you are using Java 8 to do so.
>
>
>
> On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
> wrote:
>
> Hi,
>
>
>
> I tried pointing to hive 3.1.3 using the below command. But still getting
> error. I see that the spark-hive-thriftserver_2.12/3.4.1 and
> spark-hive_2.12/3.4.1 have dependency on hive 2.3.9
>
>
>
> Command: pyspark --conf "spark.sql.hive.metastore.version=3.1.3" --conf
> "spark.sql.hive.metastore.jars=path" --conf
> "spark.sql.hive.metastore.jars.path=file://opt/hive/lib/*.jar"
>
>
>
> Error:
>
>
>
>
>
> Also, when I am trying to build spark with Hive 3.1.3 I am getting
> following error.
>
>
>
> If anyone can give me some direction then it would of great help.
>
>
>
> Thanks,
>
> Sanket
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 1:32 AM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> Hi,
>
>
>
> Why not download/build the hive 3.1.3 bundle and tell Spark to use that?
> See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> 

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
I mean, have you checked if this is in your jar? Are you building an
assembly? Where do you expect elastic classes to be and are they there?
Need some basic debugging here

On Thu, Sep 7, 2023, 8:49 PM Dipayan Dev  wrote:

> Hi Sean,
>
> Removed the provided thing, but still the same issue.
>
> 
> org.elasticsearch
> elasticsearch-spark-30_${scala.compat.version}
> 7.12.1
> 
>
>
> On Fri, Sep 8, 2023 at 4:41 AM Sean Owen  wrote:
>
>> By marking it provided, you are not including this dependency with your
>> app. If it is also not somehow already provided by your spark cluster (this
>> is what it means), then yeah this is not anywhere on the class path at
>> runtime. Remove the provided scope.
>>
>> On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev  wrote:
>>
>>> Hi,
>>>
>>> Can you please elaborate your last response? I don’t have any external
>>> dependencies added, and just updated the Spark version as mentioned below.
>>>
>>> Can someone help me with this?
>>>
>>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>>
 could the provided scope be the issue?

 On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
 wrote:

> Using the following dependency for Spark 3 in POM file (My Scala
> version is 2.12.14)
>
>
>
>
>
>
> *org.elasticsearch
> elasticsearch-spark-30_2.12
> 7.12.0provided*
>
>
> The code throws error at this line :
> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
> The same code is working with Spark 2.4.0 and the following dependency
>
>
>
>
>
> *org.elasticsearch
> elasticsearch-spark-20_2.12
> 7.12.0*
>
>
> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
> wrote:
>
>> What’s the version of the ES connector you are using?
>>
>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>> wrote:
>>
>>> Hi All,
>>>
>>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>>> index.
>>> As we're upgrading to Spark 3.3.0, it throwing out error
>>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>>> at
>>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>>
>>> Looking at a few responses from Stackoverflow
>>> . it seems this is not yet
>>> supported by Elasticsearch-hadoop.
>>>
>>> Does anyone have experience with this? Or faced/resolved this issue
>>> in Spark 3?
>>>
>>> Thanks in advance!
>>>
>>> Regards
>>> Dipayan
>>>
>> --
>> Twitter: https://twitter.com/holdenkarau
>> Books (Learning Spark, High Performance Spark, etc.):
>> https://amzn.to/2MaRAG9  
>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>
>
 CONFIDENTIALITY NOTICE: This electronic communication and any files
 transmitted with it are confidential, privileged and intended solely for
 the use of the individual or entity to whom they are addressed. If you are
 not the intended recipient, you are hereby notified that any disclosure,
 copying, distribution (electronic or otherwise) or forwarding of, or the
 taking of any action in reliance on the contents of this transmission is
 strictly prohibited. Please notify the sender immediately by e-mail if you
 have received this email by mistake and delete this email from your system.

 Is it necessary to print this email? If you care about the environment
 like we do, please refrain from printing emails. It helps to keep the
 environment forested and litter-free.
>>>
>>>


Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi Sean,

Removed the provided thing, but still the same issue.


org.elasticsearch
elasticsearch-spark-30_${scala.compat.version}
7.12.1



On Fri, Sep 8, 2023 at 4:41 AM Sean Owen  wrote:

> By marking it provided, you are not including this dependency with your
> app. If it is also not somehow already provided by your spark cluster (this
> is what it means), then yeah this is not anywhere on the class path at
> runtime. Remove the provided scope.
>
> On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev  wrote:
>
>> Hi,
>>
>> Can you please elaborate your last response? I don’t have any external
>> dependencies added, and just updated the Spark version as mentioned below.
>>
>> Can someone help me with this?
>>
>> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>>
>>> could the provided scope be the issue?
>>>
>>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>>> wrote:
>>>
 Using the following dependency for Spark 3 in POM file (My Scala
 version is 2.12.14)






 *org.elasticsearch
 elasticsearch-spark-30_2.12
 7.12.0provided*


 The code throws error at this line :
 df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
 The same code is working with Spark 2.4.0 and the following dependency





 *org.elasticsearch
 elasticsearch-spark-20_2.12
 7.12.0*


 On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
 wrote:

> What’s the version of the ES connector you are using?
>
> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
> wrote:
>
>> Hi All,
>>
>> We're using Spark 2.4.x to write dataframe into the Elasticsearch
>> index.
>> As we're upgrading to Spark 3.3.0, it throwing out error
>> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
>> at
>> java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
>> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>>
>> Looking at a few responses from Stackoverflow
>> . it seems this is not yet
>> supported by Elasticsearch-hadoop.
>>
>> Does anyone have experience with this? Or faced/resolved this issue
>> in Spark 3?
>>
>> Thanks in advance!
>>
>> Regards
>> Dipayan
>>
> --
> Twitter: https://twitter.com/holdenkarau
> Books (Learning Spark, High Performance Spark, etc.):
> https://amzn.to/2MaRAG9  
> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>

>>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>>> transmitted with it are confidential, privileged and intended solely for
>>> the use of the individual or entity to whom they are addressed. If you are
>>> not the intended recipient, you are hereby notified that any disclosure,
>>> copying, distribution (electronic or otherwise) or forwarding of, or the
>>> taking of any action in reliance on the contents of this transmission is
>>> strictly prohibited. Please notify the sender immediately by e-mail if you
>>> have received this email by mistake and delete this email from your system.
>>>
>>> Is it necessary to print this email? If you care about the environment
>>> like we do, please refrain from printing emails. It helps to keep the
>>> environment forested and litter-free.
>>
>>


RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi Chao,

The reason to migrate to Hive 3.1.3 is to remove a vulnerability from 
hive-exec-2.3.9.jar.

Thanks
Sanket

From: Chao Sun 
Sent: Thursday, September 7, 2023 10:23 PM
To: Agrawal, Sanket 
Cc: Yeachan Park ; user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

Hi Sanket,

Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a lot of 
work to upgrade the Hive version to 3.x and up.

Normally though, you only need the Hive client in Spark to talk to 
HiveMetastore (HMS) for things like table or partition metadata information. In 
this case, Hive 2.3.9 used by Spark is already capable of communicating with 
HMS of other versions like Hive 3.x. So, could you share a bit of context why 
you want to use Hive 3.1.3 with Spark?

Chao


On Thu, Sep 7, 2023 at 6:22 AM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>> 
wrote:
Hi

I Tried using the maven option and it’s working. But we are not allowed to 
download jars at runtime from maven because of some security restrictions.

So, I tried again with downloading hive 3.1.3 and giving the location of jars 
and it worked this time. But now in our docker image we have 40 new Critical 
vulnerabilities due to Hive (scanned by AWS Inspector).

So, The only solution I see here is to build Spark 3.4.1 with Hive 3.1.3. But 
when I do so the build is failing while compiling the files in /spark/sql/hive. 
But when I am trying to build Spark 3.4.1 with Hive 2.3.9 the build is 
completed successfully.

Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?

Thanks,
Sanket A.

From: Yeachan Park mailto:yeachan...@gmail.com>>
Sent: Tuesday, September 5, 2023 8:52 PM
To: Agrawal, Sanket 
mailto:sankeagra...@deloitte.com>>
Cc: user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

What's the full traceback when you run the same thing via spark-shell? So 
something like:

$SPARK_HOME/bin/spark-shell \
   --conf "spark.sql.hive.metastore.version=3.1.3" \
   --conf "spark.sql.hive.metastore.jars=path" \
   --conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"

W.r.t building hive, there's no need - either download it from 
https://downloads.apache.org/hive/hive-3.1.3/
 or use the maven option like Yasukazu suggested. If you do want to build it 
make sure you are using Java 8 to do so.

On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com>> wrote:
Hi,

I tried pointing to hive 3.1.3 using the below command. But still getting 
error. I see that the spark-hive-thriftserver_2.12/3.4.1 and 
spark-hive_2.12/3.4.1 have dependency on hive 2.3.9

Command: pyspark --conf "spark.sql.hive.metastore.version=3.1.3" --conf 
"spark.sql.hive.metastore.jars=path" --conf 
"spark.sql.hive.metastore.jars.path=file://opt/hive/lib/*.jar"

Error:


Also, when I am trying to build spark with Hive 3.1.3 I am getting following 
error.

If anyone can give me some direction then it would of great help.

Thanks,
Sanket

From: Yeachan Park mailto:yeachan...@gmail.com>>
Sent: Tuesday, September 5, 2023 1:32 AM
To: Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>>
Cc: user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

Hi,

Why not download/build the hive 3.1.3 bundle and tell Spark to use that? See 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

Basically, set:
spark.sql.hive.metastore.version 3.1.3
spark.sql.hive.metastore.jars path
spark.sql.hive.metastore.jars.path 

On Mon, Sep 4, 2023 at 7:42 PM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>> 
wrote:
Hi,

Has anyone tried building Spark 3.4.1 with Hive 3.1.3. I tried by making 

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Sean Owen
By marking it provided, you are not including this dependency with your
app. If it is also not somehow already provided by your spark cluster (this
is what it means), then yeah this is not anywhere on the class path at
runtime. Remove the provided scope.

On Thu, Sep 7, 2023, 4:09 PM Dipayan Dev  wrote:

> Hi,
>
> Can you please elaborate your last response? I don’t have any external
> dependencies added, and just updated the Spark version as mentioned below.
>
> Can someone help me with this?
>
> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>
>> could the provided scope be the issue?
>>
>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>> wrote:
>>
>>> Using the following dependency for Spark 3 in POM file (My Scala version
>>> is 2.12.14)
>>>
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-30_2.12
>>> 7.12.0provided*
>>>
>>>
>>> The code throws error at this line :
>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>> The same code is working with Spark 2.4.0 and the following dependency
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-20_2.12
>>> 7.12.0*
>>>
>>>
>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>> wrote:
>>>
 What’s the version of the ES connector you are using?

 On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
 wrote:

> Hi All,
>
> We're using Spark 2.4.x to write dataframe into the Elasticsearch
> index.
> As we're upgrading to Spark 3.3.0, it throwing out error
> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>
> Looking at a few responses from Stackoverflow
> . it seems this is not yet
> supported by Elasticsearch-hadoop.
>
> Does anyone have experience with this? Or faced/resolved this issue in
> Spark 3?
>
> Thanks in advance!
>
> Regards
> Dipayan
>
 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>> transmitted with it are confidential, privileged and intended solely for
>> the use of the individual or entity to whom they are addressed. If you are
>> not the intended recipient, you are hereby notified that any disclosure,
>> copying, distribution (electronic or otherwise) or forwarding of, or the
>> taking of any action in reliance on the contents of this transmission is
>> strictly prohibited. Please notify the sender immediately by e-mail if you
>> have received this email by mistake and delete this email from your system.
>>
>> Is it necessary to print this email? If you care about the environment
>> like we do, please refrain from printing emails. It helps to keep the
>> environment forested and litter-free.
>
>


Re: Change default timestamp offset on data load

2023-09-07 Thread Jack Goodson
Thanks Mich figured that might be the case, regardless, appreciate the help
:)

On Thu, Sep 7, 2023 at 8:36 PM Mich Talebzadeh 
wrote:

> Hi,
>
> As far as I am aware there is no Spark or JVM setting that can make Spark
> assume a different timezone during the initial load from Parquet as Parquet
> files store timestamps in UTC. The timezone conversion can be done (as I
> described before) after the load.
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Thu, 7 Sept 2023 at 01:42, Jack Goodson  wrote:
>
>> Thanks Mich, sorry, I might have been a bit unclear in my original email.
>> The timestamps are getting loaded as 2003-11-24T09:02:32+ for
>> example but I want it loaded as 2003-11-24T09:02:32+1300 I know how to
>> do this with various transformations however I'm wondering if there's any
>> spark or jvm settings that I can change so it assumes +1300 (as the time in
>> the column is relative to NZ local time not UTC) on load instead of +.
>> I inspected the parquet column with my created date with pyarrow with the
>> below results.
>>
>> I had a look in here
>> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md and
>> it looks like I need isAdjustedUTC=false (maybe?) but am at a loss on how
>> to set it
>>
>> 
>>
>>   file_offset: 6019
>>
>>   file_path:
>>
>>   physical_type: INT96
>>
>>   num_values: 4
>>
>>   path_in_schema: created
>>
>>   is_stats_set: False
>>
>>   statistics:
>>
>> None
>>
>>   compression: SNAPPY
>>
>>   encodings: ('BIT_PACKED', 'PLAIN', 'RLE')
>>
>>   has_dictionary_page: False
>>
>>   dictionary_page_offset: None
>>
>>   data_page_offset: 6019
>>
>>   total_compressed_size: 90
>>
>>   total_uncompressed_size: 103
>>
>> On Wed, Sep 6, 2023 at 8:14 PM Mich Talebzadeh 
>> wrote:
>>
>>> Hi Jack,
>>>
>>> You may use from_utc_timestamp and to_utc_timestamp to see if they help.
>>>
>>> from pyspark.sql.functions import from_utc_timestamp
>>>
>>> You can read your Parquet file into DF
>>>
>>> df = spark.read.parquet('parquet_file_path')
>>>
>>> # Convert timestamps (assuming your column name) from UTC to
>>> Pacific/Auckland timezone
>>>
>>> df_with_local_timezone = df.withColumn( 'timestamp',
>>> from_utc_timestamp(df['timestamp'], 'Pacific/Auckland') )
>>>
>>> HTH
>>>
>>> Mich Talebzadeh,
>>> Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>>> loss, damage or destruction of data or any other property which may arise
>>> from relying on this email's technical content is explicitly disclaimed.
>>> The author will in no case be liable for any monetary damages arising from
>>> such loss, damage or destruction.
>>>
>>>
>>>
>>> Mich Talebzadeh,
>>> Distinguished Technologist, Solutions Architect & Engineer
>>> London
>>> United Kingdom
>>>
>>>
>>>view my Linkedin profile
>>> 
>>>
>>>
>>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Wed, 6 Sept 2023 at 04:19, Jack Goodson 
>>> wrote:
>>>
 Hi,

 I've got a number of tables that I'm loading in from a SQL server. The
 timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these
 as parquet files in our raw storage location and pick them up in
 Databricks. When I load the data in databricks, the dataframe/spark assumes
 UTC or + on the timestamp like 2003-11-24T09:02:32+ the time
 and date is the same as what's in SQL server however the offset is
 incorrect

 I've tried various methods like the below code to set the JVM timezone
 to my local timezone but when viewing the data it seems to just subtract
 the offset from the timestamp and add it to the offset part like 
 2003-11-24T09:02:32+
 -> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter)

 spark = pyspark.sql.SparkSession \
 .Builder()\
 .appName('test') \
 .master('local') \
 .config('spark.driver.extraJavaOptions',
 

Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Chao Sun
Hi Sanket,

Spark 3.4.1 currently only works with Hive 2.3.9, and it would require a
lot of work to upgrade the Hive version to 3.x and up.

Normally though, you only need the Hive client in Spark to talk to
HiveMetastore (HMS) for things like table or partition metadata
information. In this case, Hive 2.3.9 used by Spark is already capable of
communicating with HMS of other versions like Hive 3.x. So, could you share
a bit of context why you want to use Hive 3.1.3 with Spark?

Chao


On Thu, Sep 7, 2023 at 6:22 AM Agrawal, Sanket
 wrote:

> Hi
>
>
>
> I Tried using the maven option and it’s working. But we are not allowed to
> download jars at runtime from maven because of some security restrictions.
>
>
>
> So, I tried again with downloading hive 3.1.3 and giving the location of
> jars and it worked this time. But now in our docker image we have 40 new
> Critical vulnerabilities due to Hive (scanned by AWS Inspector).
>
>
>
> So, The only solution I see here is to build *Spark 3.4.1* *with Hive
> 3.1.3*. But when I do so the build is failing while compiling the files
> in /spark/sql/hive. But when I am trying to build *Spark 3.4.1* *with
> Hive 2.3.9* the build is completed successfully.
>
>
>
> Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?
>
>
>
> Thanks,
>
> Sanket A.
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 8:52 PM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> What's the full traceback when you run the same thing via spark-shell? So
> something like:
>
>
>
> $SPARK_HOME/bin/spark-shell \
>--conf "spark.sql.hive.metastore.version=3.1.3" \
>--conf "spark.sql.hive.metastore.jars=path" \
>--conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"
>
>
>
> W.r.t building hive, there's no need - either download it from
> https://downloads.apache.org/hive/hive-3.1.3/
> 
> or use the maven option like Yasukazu suggested. If you do want to build it
> make sure you are using Java 8 to do so.
>
>
>
> On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
> wrote:
>
> Hi,
>
>
>
> I tried pointing to hive 3.1.3 using the below command. But still getting
> error. I see that the spark-hive-thriftserver_2.12/3.4.1 and
> spark-hive_2.12/3.4.1 have dependency on hive 2.3.9
>
>
>
> Command: pyspark --conf "spark.sql.hive.metastore.version=3.1.3" --conf
> "spark.sql.hive.metastore.jars=path" --conf
> "spark.sql.hive.metastore.jars.path=file://opt/hive/lib/*.jar"
>
>
>
> Error:
>
>
>
>
>
> Also, when I am trying to build spark with Hive 3.1.3 I am getting
> following error.
>
>
>
> If anyone can give me some direction then it would of great help.
>
>
>
> Thanks,
>
> Sanket
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 1:32 AM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> Hi,
>
>
>
> Why not download/build the hive 3.1.3 bundle and tell Spark to use that?
> See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> 
>
>
>
> Basically, set:
>
> spark.sql.hive.metastore.version 3.1.3
>
> spark.sql.hive.metastore.jars path
>
> spark.sql.hive.metastore.jars.path 
>
>
>
> On Mon, Sep 4, 2023 at 7:42 PM Agrawal, Sanket <
> sankeagra...@deloitte.com.invalid> wrote:
>
> Hi,
>
>
>
> Has anyone tried building Spark 3.4.1 with Hive 3.1.3. I tried by making
> below changes in spark pom.xml but it’s failing.
>
>
>
> Pom.xml
>
>
>
> Error:
>
>
>
> Can anyone help me with the required configurations?
>
>
>
> Thanks,
>
> SA
>
> This message (including any attachments) contains confidential information
> intended for a specific individual and purpose, and is 

Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
Hi,

Can you please elaborate your last response? I don’t have any external
dependencies added, and just updated the Spark version as mentioned below.

Can someone help me with this?

On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:

> could the provided scope be the issue?
>
> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
> wrote:
>
>> Using the following dependency for Spark 3 in POM file (My Scala version
>> is 2.12.14)
>>
>>
>>
>>
>>
>>
>> *org.elasticsearch
>> elasticsearch-spark-30_2.12
>> 7.12.0provided*
>>
>>
>> The code throws error at this line :
>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>> The same code is working with Spark 2.4.0 and the following dependency
>>
>>
>>
>>
>>
>> *org.elasticsearch
>> elasticsearch-spark-20_2.12
>> 7.12.0*
>>
>>
>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>> wrote:
>>
>>> What’s the version of the ES connector you are using?
>>>
>>> On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
>>> wrote:
>>>
 Hi All,

 We're using Spark 2.4.x to write dataframe into the Elasticsearch
 index.
 As we're upgrading to Spark 3.3.0, it throwing out error
 Caused by: java.lang.ClassNotFoundException: es.DefaultSource
 at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
 at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
 at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)

 Looking at a few responses from Stackoverflow
 . it seems this is not yet
 supported by Elasticsearch-hadoop.

 Does anyone have experience with this? Or faced/resolved this issue in
 Spark 3?

 Thanks in advance!

 Regards
 Dipayan

>>> --
>>> Twitter: https://twitter.com/holdenkarau
>>> Books (Learning Spark, High Performance Spark, etc.):
>>> https://amzn.to/2MaRAG9  
>>> YouTube Live Streams: https://www.youtube.com/user/holdenkarau
>>>
>>
> CONFIDENTIALITY NOTICE: This electronic communication and any files
> transmitted with it are confidential, privileged and intended solely for
> the use of the individual or entity to whom they are addressed. If you are
> not the intended recipient, you are hereby notified that any disclosure,
> copying, distribution (electronic or otherwise) or forwarding of, or the
> taking of any action in reliance on the contents of this transmission is
> strictly prohibited. Please notify the sender immediately by e-mail if you
> have received this email by mistake and delete this email from your system.
>
> Is it necessary to print this email? If you care about the environment
> like we do, please refrain from printing emails. It helps to keep the
> environment forested and litter-free.


Re: Elasticsearch support for Spark 3.x

2023-09-07 Thread Dipayan Dev
++ Dev

On Thu, 7 Sep 2023 at 10:22 PM, Dipayan Dev  wrote:

> Hi,
>
> Can you please elaborate your last response? I don’t have any external
> dependencies added, and just updated the Spark version as mentioned below.
>
> Can someone help me with this?
>
> On Fri, 1 Sep 2023 at 5:58 PM, Koert Kuipers  wrote:
>
>> could the provided scope be the issue?
>>
>> On Sun, Aug 27, 2023 at 2:58 PM Dipayan Dev 
>> wrote:
>>
>>> Using the following dependency for Spark 3 in POM file (My Scala version
>>> is 2.12.14)
>>>
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-30_2.12
>>> 7.12.0provided*
>>>
>>>
>>> The code throws error at this line :
>>> df.write.format("es").mode("overwrite").options(elasticOptions).save("index_name")
>>> The same code is working with Spark 2.4.0 and the following dependency
>>>
>>>
>>>
>>>
>>>
>>> *org.elasticsearch
>>> elasticsearch-spark-20_2.12
>>> 7.12.0*
>>>
>>>
>>> On Mon, 28 Aug 2023 at 12:17 AM, Holden Karau 
>>> wrote:
>>>
 What’s the version of the ES connector you are using?

 On Sat, Aug 26, 2023 at 10:17 AM Dipayan Dev 
 wrote:

> Hi All,
>
> We're using Spark 2.4.x to write dataframe into the Elasticsearch
> index.
> As we're upgrading to Spark 3.3.0, it throwing out error
> Caused by: java.lang.ClassNotFoundException: es.DefaultSource
> at java.base/java.net.URLClassLoader.findClass(URLClassLoader.java:476)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:589)
> at java.base/java.lang.ClassLoader.loadClass(ClassLoader.java:522)
>
> Looking at a few responses from Stackoverflow
> . it seems this is not yet
> supported by Elasticsearch-hadoop.
>
> Does anyone have experience with this? Or faced/resolved this issue in
> Spark 3?
>
> Thanks in advance!
>
> Regards
> Dipayan
>
 --
 Twitter: https://twitter.com/holdenkarau
 Books (Learning Spark, High Performance Spark, etc.):
 https://amzn.to/2MaRAG9  
 YouTube Live Streams: https://www.youtube.com/user/holdenkarau

>>>
>> CONFIDENTIALITY NOTICE: This electronic communication and any files
>> transmitted with it are confidential, privileged and intended solely for
>> the use of the individual or entity to whom they are addressed. If you are
>> not the intended recipient, you are hereby notified that any disclosure,
>> copying, distribution (electronic or otherwise) or forwarding of, or the
>> taking of any action in reliance on the contents of this transmission is
>> strictly prohibited. Please notify the sender immediately by e-mail if you
>> have received this email by mistake and delete this email from your system.
>>
>> Is it necessary to print this email? If you care about the environment
>> like we do, please refrain from printing emails. It helps to keep the
>> environment forested and litter-free.
>
>


Re: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Yeachan Park
Hi,

The maven option is good for testing but I wouldn't recommend it running in
production from a security perspective and also depending on your setup you
might be downloading jars at the start of every spark session.

By the way, Spark definitely not require all the jars from Hive, since from
you are only trying to connect to the metastore. Can you just try pointing
spark.sql.hive.metastore.jars.path to the following jars from Hive 3.1.3:
- hive-common-3.1.3.jar
- hive-metastore-3.1.3.jar
- hive-shims-common-3.1.3.jar

On Thu, Sep 7, 2023 at 3:20 PM Agrawal, Sanket 
wrote:

> Hi
>
>
>
> I Tried using the maven option and it’s working. But we are not allowed to
> download jars at runtime from maven because of some security restrictions.
>
>
>
> So, I tried again with downloading hive 3.1.3 and giving the location of
> jars and it worked this time. But now in our docker image we have 40 new
> Critical vulnerabilities due to Hive (scanned by AWS Inspector).
>
>
>
> So, The only solution I see here is to build *Spark 3.4.1* *with Hive
> 3.1.3*. But when I do so the build is failing while compiling the files
> in /spark/sql/hive. But when I am trying to build *Spark 3.4.1* *with
> Hive 2.3.9* the build is completed successfully.
>
>
>
> Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?
>
>
>
> Thanks,
>
> Sanket A.
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 8:52 PM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> What's the full traceback when you run the same thing via spark-shell? So
> something like:
>
>
>
> $SPARK_HOME/bin/spark-shell \
>--conf "spark.sql.hive.metastore.version=3.1.3" \
>--conf "spark.sql.hive.metastore.jars=path" \
>--conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"
>
>
>
> W.r.t building hive, there's no need - either download it from
> https://downloads.apache.org/hive/hive-3.1.3/
> 
> or use the maven option like Yasukazu suggested. If you do want to build it
> make sure you are using Java 8 to do so.
>
>
>
> On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
> wrote:
>
> Hi,
>
>
>
> I tried pointing to hive 3.1.3 using the below command. But still getting
> error. I see that the spark-hive-thriftserver_2.12/3.4.1 and
> spark-hive_2.12/3.4.1 have dependency on hive 2.3.9
>
>
>
> Command: pyspark --conf "spark.sql.hive.metastore.version=3.1.3" --conf
> "spark.sql.hive.metastore.jars=path" --conf
> "spark.sql.hive.metastore.jars.path=file://opt/hive/lib/*.jar"
>
>
>
> Error:
>
>
>
>
>
> Also, when I am trying to build spark with Hive 3.1.3 I am getting
> following error.
>
>
>
> If anyone can give me some direction then it would of great help.
>
>
>
> Thanks,
>
> Sanket
>
>
>
> *From:* Yeachan Park 
> *Sent:* Tuesday, September 5, 2023 1:32 AM
> *To:* Agrawal, Sanket 
> *Cc:* user@spark.apache.org
> *Subject:* [EXT] Re: Spark 3.4.1 and Hive 3.1.3
>
>
>
> Hi,
>
>
>
> Why not download/build the hive 3.1.3 bundle and tell Spark to use that?
> See https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html
> 
>
>
>
> Basically, set:
>
> spark.sql.hive.metastore.version 3.1.3
>
> spark.sql.hive.metastore.jars path
>
> spark.sql.hive.metastore.jars.path 
>
>
>
> On Mon, Sep 4, 2023 at 7:42 PM Agrawal, Sanket <
> sankeagra...@deloitte.com.invalid> wrote:
>
> Hi,
>
>
>
> Has anyone tried building Spark 3.4.1 with Hive 3.1.3. I tried by making
> below changes in spark pom.xml but it’s failing.
>
>
>
> Pom.xml
>
>
>
> Error:
>
>
>
> Can anyone help me with the required configurations?
>
>
>
> Thanks,
>
> SA
>
> This message (including any attachments) contains confidential information
> intended for a specific 

Re: Seeking Professional Advice on Career and Personal Growth in the Apache Spark Community

2023-09-07 Thread Mich Talebzadeh
Hi Varun,

With all that said, I forgot one worthy sentence.

"It doesn't really matter what background you come from or your wealth,
everything is possible. Use every negative source in your life as a
positive and you will never ever fail!"

Cheers

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 6 Sept 2023 at 18:33, Mich Talebzadeh 
wrote:

> Hi Varun,
>
> In answer to your questions, these are my views. However, they are just
> views and cannot be taken as facts so to speak
>
>
>1.
>
>*Focus and Time Management:* I often struggle with maintaining focus
>and effectively managing my time. This leads to productivity issues and
>affects my ability to undertake and complete projects efficiently.
>
>
>- Set clear goals.
>   - Prioritize tasks.
>   - Create a to-do list.
>   - Avoid multitasking.
>   - Eliminate distractions.
>   - Take regular breaks.
>   - Go to the gym and try to rest your mind and refresh yourself .
>
>
>1.
>
>*Graduate Studies Dilemma:*
>
>
>- Your mileage varies and it all depends on what you are trying to
>   achieve. Graduate Studies will help you to think independently and out 
> of
>   the box. Will also lead you on "how to go about solving the problem". 
> So it
>   will give you that experience.
>
>
>1.
>
>*Long-Term Project Building:* I am interested in working on long-term
>projects, but I am uncertain about the right approach and how to stay
>committed throughout the project's lifecycle.
>
>
>- I assume you have a degree. That means that you had the discipline
>   to wake up in the morning, go to lectures and not to miss the lectures
>   (hopefully you did not!). In other words, it proves that you have 
> already
>   been through a structured discipline and you have the will to do it.
>
>
>1.
>
>*Overcoming Fear of Failure and Procrastination:* I often find myself
>in a constant fear mode of failure, which leads to abandoning pet projects
>shortly after starting them or procrastinating over initiating new ones.
>
>
>- Failure is natural and can and do happen. However, the important
>   point is that you learn from your failures. Just call them experience. 
> You
>   need to overcome fear of failure and embrace the challenges.
>
>
>1.
>
>*Risk Aversion:* With no inherited wealth or financial security, I am
>often apprehensive about taking risks, even when they may potentially lead
>to significant personal or professional growth.
>- Welcome to the club! In 2020
>   
> ,
>   it was estimated that in the UK, the richest 10% of households hold 43% 
> of
>   all wealth. The poorest 50% by contrast own just 9%  Risk is part of 
> life.
>   When crossing the street, you are taking a calculated view of the cars
>   coming and going.In short, risk assessment is a fundamental aspect of 
> life!
>
> HTH
>
> Mich Talebzadeh,
> Distinguished Technologist, Solutions Architect & Engineer
> London
> United Kingdom
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Tue, 5 Sept 2023 at 22:17, Varun Shah 
> wrote:
>
>> Dear Apache Spark Community,
>>
>> I hope this email finds you well. I am writing to seek your valuable
>> insights and advice on some challenges I've been facing in my career and
>> personal development journey, particularly in the context of Apache Spark
>> and the broader big data ecosystem.
>>
>> A little background about myself: I graduated in 2019 and have since been
>> working in the field of AWS cloud and big data tools such as Spark,
>> Airflow, AWS services, Databricks, and Snowflake. My interest in the world
>> of big data tools dates back to 2016-17, where I initially began exploring
>> concepts like big data with spark using scala, and the Scala ecosystem,
>> including technologies like Akka. 

RE: Spark 3.4.1 and Hive 3.1.3

2023-09-07 Thread Agrawal, Sanket
Hi

I Tried using the maven option and it’s working. But we are not allowed to 
download jars at runtime from maven because of some security restrictions.

So, I tried again with downloading hive 3.1.3 and giving the location of jars 
and it worked this time. But now in our docker image we have 40 new Critical 
vulnerabilities due to Hive (scanned by AWS Inspector).

So, The only solution I see here is to build Spark 3.4.1 with Hive 3.1.3. But 
when I do so the build is failing while compiling the files in /spark/sql/hive. 
But when I am trying to build Spark 3.4.1 with Hive 2.3.9 the build is 
completed successfully.

Has anyone tried building Spark 3.4.1 with Hive 3.1.3 or higher?

Thanks,
Sanket A.

From: Yeachan Park 
Sent: Tuesday, September 5, 2023 8:52 PM
To: Agrawal, Sanket 
Cc: user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

What's the full traceback when you run the same thing via spark-shell? So 
something like:

$SPARK_HOME/bin/spark-shell \
   --conf "spark.sql.hive.metastore.version=3.1.3" \
   --conf "spark.sql.hive.metastore.jars=path" \
   --conf "spark.sql.hive.metastore.jars.path=/opt/hive/lib/*.jar"

W.r.t building hive, there's no need - either download it from 
https://downloads.apache.org/hive/hive-3.1.3/
 or use the maven option like Yasukazu suggested. If you do want to build it 
make sure you are using Java 8 to do so.

On Tue, Sep 5, 2023 at 12:00 PM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com>> wrote:
Hi,

I tried pointing to hive 3.1.3 using the below command. But still getting 
error. I see that the spark-hive-thriftserver_2.12/3.4.1 and 
spark-hive_2.12/3.4.1 have dependency on hive 2.3.9

Command: pyspark --conf "spark.sql.hive.metastore.version=3.1.3" --conf 
"spark.sql.hive.metastore.jars=path" --conf 
"spark.sql.hive.metastore.jars.path=file://opt/hive/lib/*.jar"

Error:
[cid:image001.png@01D9E00D.CBDA2C50]


Also, when I am trying to build spark with Hive 3.1.3 I am getting following 
error.
[cid:image002.png@01D9E00D.CBDA2C50]

If anyone can give me some direction then it would of great help.

Thanks,
Sanket

From: Yeachan Park mailto:yeachan...@gmail.com>>
Sent: Tuesday, September 5, 2023 1:32 AM
To: Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>>
Cc: user@spark.apache.org
Subject: [EXT] Re: Spark 3.4.1 and Hive 3.1.3

Hi,

Why not download/build the hive 3.1.3 bundle and tell Spark to use that? See 
https://spark.apache.org/docs/latest/sql-data-sources-hive-tables.html

Basically, set:
spark.sql.hive.metastore.version 3.1.3
spark.sql.hive.metastore.jars path
spark.sql.hive.metastore.jars.path 

On Mon, Sep 4, 2023 at 7:42 PM Agrawal, Sanket 
mailto:sankeagra...@deloitte.com.invalid>> 
wrote:
Hi,

Has anyone tried building Spark 3.4.1 with Hive 3.1.3. I tried by making below 
changes in spark pom.xml but it’s failing.

Pom.xml

Error:

Can anyone help me with the required configurations?

Thanks,
SA

This message (including any attachments) contains confidential information 
intended for a specific individual and purpose, and is protected by law. If you 
are not the intended recipient, you should delete this message and any 
disclosure, copying, or distribution of this message, or the taking of any 
action based on it, by you is strictly prohibited.

Deloitte refers to a Deloitte member firm, one of its related entities, or 
Deloitte Touche Tohmatsu Limited ("DTTL"). Each Deloitte member firm is a 
separate legal entity and a member of DTTL. DTTL does not provide services to 
clients. Please see www.deloitte.com/about to 
learn more.

v.E.1


Re: Change default timestamp offset on data load

2023-09-07 Thread Mich Talebzadeh
Hi,

As far as I am aware there is no Spark or JVM setting that can make Spark
assume a different timezone during the initial load from Parquet as Parquet
files store timestamps in UTC. The timezone conversion can be done (as I
described before) after the load.

HTH

Mich Talebzadeh,
Distinguished Technologist, Solutions Architect & Engineer
London
United Kingdom


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Thu, 7 Sept 2023 at 01:42, Jack Goodson  wrote:

> Thanks Mich, sorry, I might have been a bit unclear in my original email.
> The timestamps are getting loaded as 2003-11-24T09:02:32+ for example
> but I want it loaded as 2003-11-24T09:02:32+1300 I know how to do this
> with various transformations however I'm wondering if there's any spark or
> jvm settings that I can change so it assumes +1300 (as the time in the
> column is relative to NZ local time not UTC) on load instead of +. I
> inspected the parquet column with my created date with pyarrow with the
> below results.
>
> I had a look in here
> https://github.com/apache/parquet-format/blob/master/LogicalTypes.md and
> it looks like I need isAdjustedUTC=false (maybe?) but am at a loss on how
> to set it
>
> 
>
>   file_offset: 6019
>
>   file_path:
>
>   physical_type: INT96
>
>   num_values: 4
>
>   path_in_schema: created
>
>   is_stats_set: False
>
>   statistics:
>
> None
>
>   compression: SNAPPY
>
>   encodings: ('BIT_PACKED', 'PLAIN', 'RLE')
>
>   has_dictionary_page: False
>
>   dictionary_page_offset: None
>
>   data_page_offset: 6019
>
>   total_compressed_size: 90
>
>   total_uncompressed_size: 103
>
> On Wed, Sep 6, 2023 at 8:14 PM Mich Talebzadeh 
> wrote:
>
>> Hi Jack,
>>
>> You may use from_utc_timestamp and to_utc_timestamp to see if they help.
>>
>> from pyspark.sql.functions import from_utc_timestamp
>>
>> You can read your Parquet file into DF
>>
>> df = spark.read.parquet('parquet_file_path')
>>
>> # Convert timestamps (assuming your column name) from UTC to
>> Pacific/Auckland timezone
>>
>> df_with_local_timezone = df.withColumn( 'timestamp',
>> from_utc_timestamp(df['timestamp'], 'Pacific/Auckland') )
>>
>> HTH
>>
>> Mich Talebzadeh,
>> Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>
>> Disclaimer: Use it at your own risk. Any and all responsibility for any
>> loss, damage or destruction of data or any other property which may arise
>> from relying on this email's technical content is explicitly disclaimed.
>> The author will in no case be liable for any monetary damages arising from
>> such loss, damage or destruction.
>>
>>
>>
>> Mich Talebzadeh,
>> Distinguished Technologist, Solutions Architect & Engineer
>> London
>> United Kingdom
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 6 Sept 2023 at 04:19, Jack Goodson 
>> wrote:
>>
>>> Hi,
>>>
>>> I've got a number of tables that I'm loading in from a SQL server. The
>>> timestamp in SQL server is stored like 2003-11-24T09:02:32 I get these
>>> as parquet files in our raw storage location and pick them up in
>>> Databricks. When I load the data in databricks, the dataframe/spark assumes
>>> UTC or + on the timestamp like 2003-11-24T09:02:32+ the time
>>> and date is the same as what's in SQL server however the offset is
>>> incorrect
>>>
>>> I've tried various methods like the below code to set the JVM timezone
>>> to my local timezone but when viewing the data it seems to just subtract
>>> the offset from the timestamp and add it to the offset part like 
>>> 2003-11-24T09:02:32+
>>> -> 2003-11-23T20:02:32+1300 (NZ has a +13 offset in winter)
>>>
>>> spark = pyspark.sql.SparkSession \
>>> .Builder()\
>>> .appName('test') \
>>> .master('local') \
>>> .config('spark.driver.extraJavaOptions',
>>> '-Duser.timezone=Pacific/Auckland') \
>>> .config('spark.executor.extraJavaOptions',
>>> '-Duser.timezone=Pacific/Auckland') \
>>> .config('spark.sql.session.timeZone', 'Pacific/Auckland') \
>>> .getOrCreate()
>>>
>>>
>>>
>>> I understand that in Parquet these are stored as UNIX time and aren't
>>> timezone aware, however are there any settings