Re: [EXTERNAL] Spark Thrift Server - Autoscaling on K8

2023-03-09 Thread Saurabh Gulati
Hey Jayabindu,
We use thriftserver for on K8S. May I ask why you are not going for Trino 
instead? I know it didn't support autoscaling when we tested it in the past but 
not sure if it does now.
Autoscaling also means that users might have to wait for the cluster to 
autoscale but that usually happens not so slow and once its done then other 
queries have the new nodes available.
Also the workload on our thriftserver is not so large so it solves the purpose 
for now.
You can also take a look at Apache Kyuubi.

I can put in some details below and attach the config we use for spark 
thriftserver, you can pick whatever is relevant for you:

  *   We run thriftserver on default(stable) nodes and its executors on 
preemptible(spot) nodes
  *   We use driver and executor templates to make above possible by using node 
selectors
  *   We use fair scheduling to manage workload


Mvg/Regards
Saurabh

From: Jayabindu Singh 
Sent: 09 March 2023 06:31
To: u...@spark.incubator.apache.org 
Subject: [EXTERNAL] Spark Thrift Server - Autoscaling on K8

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi All,

We are in the process of moving our workloads to K8 and looking for some 
guidance to run Spark Thrift Server on K8.
We need the executor pods to autoscale based on the workload vs running it with 
a static number of executors.

If any one has done it and can share the details, it will be really appreciated.

Regards
Jayabindu Singh




spark-defaults.conf
Description: spark-defaults.conf

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: [EXTERNAL] Re: Online classes for spark topics

2023-03-09 Thread Saurabh Gulati
Hey guys,
Its a nice idea and appreciate the effort you guys are taking.
I can add to the list of topics which might be of interest:

  *   Spark UI
  *   Dynamic allocation
  *   Tuning of jobs
  *   Collecting spark metrics for monitoring and alerting

HTH

From: Mich Talebzadeh 
Sent: 09 March 2023 09:00
To: Deepak Sharma 
Cc: Denny Lee ; Sofia’s World ; 
User ; Winston Lai ; 
ashok34...@yahoo.com ; asma zgolli 
; karan alang 
Subject: [EXTERNAL] Re: Online classes for spark topics

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi Deepak,

The priority list of topics is a very good point. The theard owner mentioned 
Spark on k8s, Data Science and Spark Structured Streaming. What other topics 
need to be included I guess it depends on demand.. I suggest we wait a couple 
of days to see the demand .

We just need to create a draft list of topics of interest and share them in the 
forum to get the priority order.

Well that is my thoughts.

Cheers






 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Thu, 9 Mar 2023 at 06:13, Deepak Sharma 
mailto:deepakmc...@gmail.com>> wrote:
I can prepare some topics and present as well , if we have a prioritised list 
of topics already .

On Thu, 9 Mar 2023 at 11:42 AM, Denny Lee 
mailto:denny.g@gmail.com>> wrote:
We used to run Spark webinars on the Apache Spark LinkedIn 
group
 but honestly the turnout was pretty low.  We had dove into various features.  
If there are particular topics that. you would like to discuss during a live 
session, please let me know and we can try to restart them.  HTH!

On Wed, Mar 8, 2023 at 9:45 PM Sofia’s World 
mailto:mmistr...@gmail.com>> wrote:
+1

On Wed, Mar 8, 2023 at 10:40 PM Winston Lai 
mailto:weiruanl...@gmail.com>> wrote:
+1, any webinar on Spark related topic is appreciated 

Thank You & Best Regards
Winston Lai

From: asma zgolli mailto:zgollia...@gmail.com>>
Sent: Thursday, March 9, 2023 5:43:06 AM
To: karan alang mailto:karan.al...@gmail.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; 
ashok34...@yahoo.com 
mailto:ashok34...@yahoo.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: Online classes for spark topics

+1

Le mer. 8 mars 2023 à 21:32, karan alang 
mailto:karan.al...@gmail.com>> a écrit :
+1 .. I'm happy to be part of these discussions as well !




On Wed, Mar 8, 2023 at 12:27 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
Hi,

I guess I can schedule this work over a course of time. I for myself can 
contribute plus learn from others.

So +1 for me.

Let us see if anyone else is interested.

HTH



 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Wed, 8 Mar 2023 at 17:48, ashok34...@yahoo.com 
mailto:ashok34...@yahoo.com>> wrote:

Hello Mich.

Greetings. Would you be able to arrange for Spark Structured Streaming learning 
webinar.?


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
and 2 single quotes together'' are looking like a single double quote ".

Mvg/Regards
Saurabh Gulati

From: Saurabh Gulati 
Sent: 05 January 2023 12:24
To: Sean Owen 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

Its the same input except that headers are also being read with csv reader.

Mvg/Regards
Saurabh Gulati

From: Sean Owen 
Sent: 04 January 2023 15:12
To: Saurabh Gulati 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That does not appear to be the same input you used in your example. What is the 
contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Hi @Sean Owen<mailto:sro...@gmail.com>
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show() and df.select("c").show()

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen mailto:sro...@gmail.com>>
Sent: 04 January 2023 14:25
To: Saurabh Gulati mailto:saurabh.gul...@fedex.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
Its the same input except that headers are also being read with csv reader.

Mvg/Regards
Saurabh Gulati

From: Sean Owen 
Sent: 04 January 2023 15:12
To: Saurabh Gulati 
Cc: User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That does not appear to be the same input you used in your example. What is the 
contents of test.csv?

On Wed, Jan 4, 2023 at 7:45 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Hi @Sean Owen<mailto:sro...@gmail.com>
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show() and df.select("c").show()

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen mailto:sro...@gmail.com>>
Sent: 04 January 2023 14:25
To: Saurabh Gulati mailto:saurabh.gul...@fedex.com>>
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; User 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within the data

2023-01-05 Thread Saurabh Gulati
Yes, there are other ways to solve this but trying to understand why there is a 
difference in behaviour between df.show() and df.select("c").show()​

Mvg/Regards
Saurabh Gulati

From: Shay Elbaz 
Sent: 04 January 2023 14:54
To: Saurabh Gulati ; Sean Owen 

Cc: Mich Talebzadeh ; User 
Subject: Re: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used 
within the data

If you have found a parser that works, simply read the data as text files, 
apply the parser manually, and convert to DataFrame (if needed at all),
____
From: Saurabh Gulati 
Sent: Wednesday, January 4, 2023 3:45 PM
To: Sean Owen 
Cc: Mich Talebzadeh ; User 
Subject: [EXTERNAL] Re: Re: Incorrect csv parsing when delimiter used within 
the data


ATTENTION: This email originated from outside of GM.


Hi @Sean Owen<mailto:sro...@gmail.com>
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show()​ and df.select("c").show()​

Mvg/Regards
Saurabh Gulati
Data Platform
____
From: Sean Owen 
Sent: 04 January 2023 14:25
To: Saurabh Gulati 
Cc: Mich Talebzadeh ; User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Saurabh Gulati
Hi @Sean Owen<mailto:sro...@gmail.com>
Probably the data is incorrect, and the source needs to fix it.
But using python's csv parser returns the correct results.

import csv

with open("/tmp/test.csv") as c_file:

csv_reader = csv.reader(c_file, delimiter=",")
for row in csv_reader:
print(row)

['a', 'b', 'c']
['1', '', ',see what "I did",\ni am still writing']
['2', '', 'abc']
And also, I don't understand why there is a distinction in outputs from 
df.show()​ and df.select("c").show()​

Mvg/Regards
Saurabh Gulati
Data Platform

From: Sean Owen 
Sent: 04 January 2023 14:25
To: Saurabh Gulati 
Cc: Mich Talebzadeh ; User 
Subject: Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within 
the data

That input is just invalid as CSV for any parser. You end a quoted col without 
following with a col separator. What would the intended parsing be and how 
would it work?

On Wed, Jan 4, 2023 at 4:30 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"


Re: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the data

2023-01-04 Thread Saurabh Gulati
Hey guys, much appreciate your quick responses.

To answer your questions,
@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We get data from multiple 
sources, and we don't have any control over what they put in. In this case the 
column is supposed to contain some feedback and it can also contain quoted 
strings.

@Sean Owen<mailto:sro...@gmail.com> Also see the example below with quotes 
feedback:
"a","b","c"
"1","",",see what ""I did"","
"2","","abc"
Here if we don't escape with "

df = spark.read.option("multiLine", True).option("enforceSchema", 
False).option("header", True).csv(f"/tmp/test.csv")

df.show(100, False)

+---+++
|a  |b   |c   |
+---+++
|1  |null|",see what ""I did""|

+---+++

df.count()

1

So, we put in "? as the escape character and then its parsed fine but the count 
is wrong.


df = spark.read.option("escape", '"').option("multiLine", 
True).option("enforceSchema", False).option("header", 
True).csv(f"/tmp/test.csv")

df.show(100, False)

+---++--+
|a  |b   |c |
+---++--+
|1  |null|,see what "I did",|
|2  |null|abc   |
+---++--+

df.count()
1

I understand its a complex case or maybe an edge case which makes it difficult 
for spark
to understand when a column ends as we have even enabled multiline=True?.

See another example below which even has multiline value for column c?.

"a","b","c"
"1","",",see what ""I did"",
i am still writing"
"2","","abc"

# with escape

df = spark.read.option("escape", '"').option("multiLine", 
True).option("enforceSchema", False).option("header", 
True).csv(f"/tmp/test.csv")

df.show(10, False)

+---++--+
|a  |b   |c |
+---++--+
|1  |null|,see what "I did",\ni am still writing|
|2  |null|abc   |
+---++--+

df.count()
1

df.select("c").show(10, False)
+--+
|c |
+--+
|see what ""I did""|
|null  |
|abc   |
+--+

# without escape "


df.show(10, False)

+---+++
|a  |b   |c   |
+---+++
|1  |null|",see what ""I did""|
|i am still writing"|null|null|
|2  |null|abc |
+---+----+--------+

df.select("c").show(10, False)

++
|c   |
++
|",see what ""I did""|
|null|
|abc |
++


The issue is that it can print the complete data frame correctly with escape 
enabled,
but when you select a column or ask a count then it gives wrong output.


Regards
Saurabh

From: Mich Talebzadeh 
Sent: 04 January 2023 10:14
To: Sean Owen 
Cc: Saurabh Gulati ; User 
Subject: [EXTERNAL] Re: Incorrect csv parsing when delimiter used within the 
data

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

What is the point of having  , as a column value? From a business point of view 
it does not signify anything IMO




 
[https://ci3.googleusercontent.com/mail-sig/AIorK4zholKucR2Q9yMrKbHNn-o1TuS4mYXyi2KO6Xmx6ikHPySa9MLaLZ8t2hrA6AUcxSxDgHIwmKE]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!BL9GA0TyTA!erAOVE6gcxktT3dQCu6OSdzFqng9xRG1oLmXuetC6pn_3nMnzlWnC_pNmhtZMwXPc3QxaSb8w6V55rIjuRXHqVXSIPo5aQ$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!BL9GA0TyTA!erAOVE6gcxktT3dQCu6OSdzFqng9xRG1oLmXuetC6pn_3nMnzlWnC_pNmhtZMwXPc3QxaSb8w6V55rIjuRXHqVUcys0piQ$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Tue, 3 Jan 2023 at 20:39, Sean Owen 
mailto:sro...@g

Re: How to set a config for a single query?

2023-01-04 Thread Saurabh Gulati
Hey Felipe,
Since you are collecting the dataframes, you might as well run them separately 
with desired configs and store them in your storage.

Regards
Saurabh

From: Felipe Pessoto 
Sent: 04 January 2023 01:14
To: user@spark.apache.org 
Subject: [EXTERNAL] How to set a config for a single query?

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.


Hi,



In Scala is it possible to set a config value to a single query?



I could set/unset the value, but it won’t work for multithreading scenarios.



Example:



spark.sql.adaptive.coalescePartitions.enabled = false

queryA_df.collect

spark.sql.adaptive.coalescePartitions.enabled=original value

queryB_df.collect

queryC_df.collect

queryD_df.collect





If I execute that block of code multiple times using multiple thread, I can end 
up executing Query A with coalescePartitions.enabled=true, and Queries B, C and 
D with the config set to false, because another thread could set it between the 
executions.



Is there any good alternative to this?



Thanks.


Incorrect csv parsing when delimiter used within the data

2023-01-03 Thread Saurabh Gulati
Hello,
We are seeing a case with csv data when it parses csv data incorrectly.
The issue can be replicated using the below csv data

"a","b","c"
"1","",","
"2","","abc"
and using the spark csv read command.
df = spark.read.format("csv")\
.option("multiLine", True)\
.option("escape", '"')\
.option("enforceSchema", False) \
.option("header", True)\
.load(f"/tmp/test.csv")

df.show(100, False) # prints both rows
|a  |b   |c  |
+---++---+
|1  |null|,  |
|2  |null|abc|

df.select("c").show() # merges last column of first row and first column of 
second row
+--+
| c|
+--+
|"\n"2"|

print(df.count()) # prints 1, should be 2

It feels like a bug and I thought of asking the community before creating a bug 
on jira.

Mvg/Regards
Saurabh



Re: spark-submit fails in kubernetes 1.24.x cluster

2022-12-27 Thread Saurabh Gulati
Hello Thimme,
Your issue is related to 
https://kubernetes.io/docs/reference/using-api/deprecation-guide/#ingress-v122
Deprecated API Migration Guide | 
Kubernetes<https://kubernetes.io/docs/reference/using-api/deprecation-guide/#ingress-v122>
As the Kubernetes API evolves, APIs are periodically reorganized or upgraded. 
When APIs evolve, the old API is deprecated and eventually removed. This page 
contains information you need to know when migrating from deprecated API 
versions to newer and more stable API versions. Removed APIs by release v1.29 
The v1.29 release will stop serving the following deprecated API versions: Flow 
control ...
kubernetes.io

You will need to upgrade your endpoint to use the new general available 
endpoint.

Regards
Saurabh Gulati

From: Thimme Gowda TP (Nokia) 
Sent: 23 December 2022 11:31
To: user@spark.apache.org 
Subject: [EXTERNAL] spark-submit fails in kubernetes 1.24.x cluster

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.


Hello,



We are facing issue with ingress during spark-submit with kubernetes cluster 
1.24.x .

We are using spark 3.3.0 to do spark-submit.



# kubectl version

WARNING: This version information is deprecated and will be replaced with the 
output from kubectl version --short.  Use --output=yaml|json to get the full 
version.

Client Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.9", 
GitCommit:"9710807c82740b9799453677c977758becf0acbb", GitTreeState:"clean", 
BuildDate:"2022-12-08T10:15:09Z", GoVersion:"go1.18.9", Compiler:"gc", 
Platform:"linux/amd64"}

Kustomize Version: v4.5.4

Server Version: version.Info{Major:"1", Minor:"24", GitVersion:"v1.24.9", 
GitCommit:"9710807c82740b9799453677c977758becf0acbb", GitTreeState:"clean", 
BuildDate:"2022-12-08T10:08:06Z", GoVersion:"go1.18.9", Compiler:"gc", 
Platform:"linux/amd64"}



Error:

{"type":"log", "level":"WARN", "time":"2022-12-23T10:04:57.536Z", 
"timezone":"UTC", "log":"The client is using resource type 'ingresses' with 
unstable version 'v1beta1'"}

Exception in thread "main" 
io.fabric8.kubernetes.client.KubernetesClientException: Failure executing: POST 
at: 
https://xx.xx.xx.xx:6443/apis/extensions/v1beta1/namespaces/test/ingresses<https://urldefense.com/v3/__https://xx.xx.xx.xx:6443/apis/extensions/v1beta1/namespaces/test/ingresses__;!!BL9GA0TyTA!YkATz9TvT8G7FBX1Vox2KJJu9Tz3DV3_InVLgRMsjiNjd88IseiBHRpDzSplA9w5xhtRlNsvUGIaqU57r_hj9bhBaExnLA$>.
 Message: Not Found.



>From spark code we see that spark 3.3.0 is using 
>“5.12.2”



We also tried changing this to 6.x as in PR : 
https://github.com/apache/spark/pull/37990/files<https://urldefense.com/v3/__https://github.com/apache/spark/pull/37990/files__;!!BL9GA0TyTA!YkATz9TvT8G7FBX1Vox2KJJu9Tz3DV3_InVLgRMsjiNjd88IseiBHRpDzSplA9w5xhtRlNsvUGIaqU57r_hj9bi-yg9xnQ$>

Still we facing same issue.



Let us know if spark is tested/working with k8s 1.24.x cluster and what could 
be the solution to resolve this.

Thanks.



Regards

Thimme Gowda




Re: [EXTERNAL] Re: Spark streaming

2022-08-19 Thread Saurabh Gulati
You can also try out 
https://debezium.io/documentation/reference/0.10/connectors/mysql.html

From: Ajit Kumar Amit 
Sent: 19 August 2022 14:30
To: sandra sukumaran 
Cc: user@spark.apache.org 
Subject: [EXTERNAL] Re: Spark streaming

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

https://github.com/allwefantasy/spark-binlog

Sent from my iPhone

On 19 Aug 2022, at 5:45 PM, sandra sukumaran  
wrote:


Dear Sir,



 Is there any possible method to fetch MySQL database bin log, with the 
help of spark streaming.
Kafka streaming is not applicable in this case.



Thanks and regards
Sandra


Re: [EXTERNAL] Re: Spark streaming - Data Ingestion

2022-08-17 Thread Saurabh Gulati
Another take:

  *   
Debezium
 to read Write Ahead logs(WAL) and send to Kafka
  *   Kafka connect to write to cloud storage -> Hive
 *   OR

  *   Spark streaming to parse WAL -> Storage -> Hive

Regards

From: Gibson 
Sent: 17 August 2022 16:53
To: Akash Vellukai 
Cc: user@spark.apache.org 
Subject: [EXTERNAL] Re: Spark streaming - Data Ingestion

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

If you have space for a message log like, then you should try:

MySQL -> Kafka (via CDC) -> Spark (Structured Streaming) -> HDFS/S3/ADLS -> Hive

On Wed, Aug 17, 2022 at 5:40 PM Akash Vellukai 
mailto:akashvellukai...@gmail.com>> wrote:
Dear sir

I have tried a lot on this could you help me with this?

Data ingestion from MySql to Hive with spark- streaming?

Could you give me an overview.


Thanks and regards
Akash P


[Spark Core]: Unexpectedly exiting executor while gracefully decommissioning

2022-04-25 Thread Saurabh Gulati
Hey guys,
My colleague tried to post a question twice but somehow it doesn't show up in 
or emails, but it does exist in the 
archive. So, 
I will post the question here again.


We are running into some issues while attempting graceful
decommissioning of executors. We are running spark-thriftserver (3.2.0) on
Kubernetes (GKE 1.20.15-gke.2500). We enabled:

   - spark.decommission.enabled
   - spark.storage.decommission.rddBlocks.enabled
   - spark.storage.decommission.shuffleBlocks.enabled
   - spark.storage.decommission.enabled

and set spark.storage.decommission.fallbackStorage.path to a path in our
bucket.

The logs from the driver seems to suggest the decommissioning process
started but then unexpectedly exited and failed while the executor logs
seem to suggest that decommissioning was successful.

Attached are the error logs:

https://gist.github.com/yeachan153/9bfb2f0ab9ac7f292fb626186b014bbf


Thanks in advance.




Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-03-10 Thread Saurabh Gulati
Hi Gourav,
We use auto-scaling containers in GKE for running the Spark thriftserver.

From: Gourav Sengupta 
Sent: 07 March 2022 14:36
To: Saurabh Gulati 
Cc: Mich Talebzadeh ; Kidong Lee 
; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Hi,

are all users using the same cluster of data proc?

Regards,
Gourav

On Mon, Mar 7, 2022 at 9:28 AM Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Thanks for the response, Gourav.

Queries range from simple to large joins. We expose the data to our analytics 
users so that they can develop their models and they use superset as the SQL 
interface for testing.

Hive-metastore will not do a full scan if we specify the partitioning column.
But that's something users might/do forget, so we were thinking of enforcing a 
way to make sure people do specify partitioning column in their queries.

The only way we see for now is to parse the query in superset to check if 
partition column is being used. But we are not sure of a way which will work 
for all types of queries.

For example, we can parse the SQL and see if count (where​) == count( 
partition_column​ ), but this may not work for complex queries.


Regards
Saurabh

From: Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>>
Sent: 05 March 2022 11:06
To: Saurabh Gulati 
Cc: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>; Kidong Lee 
mailto:mykid...@gmail.com>>; 
user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Hi,

I completely agree with Saurabh, the use of BQ with SPARK does not make sense 
at all, if you are trying to cut down your costs. I think that costs do matter 
to a few people at the end.

Saurabh, is there any chance you can see what actual queries are hitting the 
thrift server? Using hive metastore is something that I have been doing in AWS 
EMR for the last 5 years and for sure it does not cause full table scan.

Hi Sean,
for some reason, I am not able to receive any emails from the spark user group. 
My account should be a very old one, is there any chance you can kindly have a 
look into it and kindly let me know if there is something blocking me? I will 
be sincerely obliged.

Regards,
Gourav Sengupta


On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati 
 wrote:
Hey Mich,
We use spark 3.2 now. We are using BQ but migrating away because:

  *   Its not reflective of our current lake structure with all deltas/history 
tables/models outputs etc
  *   Its pretty expensive to load everything in BQ and essentially it will be 
a copy of all data in gcs. External tables in BQ didnt work for us. Currently 
we store only latest snapshots in BQ. This breaks idempotency of models which 
need to time travel and run in the past.
  *   We might move to a different cloud provider in future so we want to be 
cloud agnostic.

So we need to have an execution engine which has the same overview of data as 
we have in gcs.
We tried presto but performance was similar and presto didn't support auto 
scaling.

TIA
Saurabh

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: 22 February 2022 16:49
To: Kidong Lee mailto:mykid...@gmail.com>>; Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Ok interesting.

I am surprised why you are not using BigQuery and using Hive. My assumption is 
that your Spark is version 3.1.1 with standard GKE on auto-scaler. What 
benefits are you getting from Using Hive here? As you have your hive tables on 
gs buckets, you can easily download your hive tables into BigQuery and run 
spark on BigQuery?

HTH

On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Thanks Sean for your response.

@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We run all workloads on GKE 
as docker containers. So to answer your questions, Hive is running in a 
container as K8S service and spark thrift-server in another container as a 
service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending 
on the load. For buckets we use gcs.


TIA
Saurabh

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: 22 February 2022 16:05
To: Saurabh Gulati 
Cc: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Is your hive

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-03-07 Thread Saurabh Gulati
Thanks for the response, Gourav.

Queries range from simple to large joins. We expose the data to our analytics 
users so that they can develop their models and they use superset as the SQL 
interface for testing.

Hive-metastore will not do a full scan if we specify the partitioning column.
But that's something users might/do forget, so we were thinking of enforcing a 
way to make sure people do specify partitioning column in their queries.

The only way we see for now is to parse the query in superset to check if 
partition column is being used. But we are not sure of a way which will work 
for all types of queries.

For example, we can parse the SQL and see if count (where​) == count( 
partition_column​ ), but this may not work for complex queries.


Regards
Saurabh

From: Gourav Sengupta 
Sent: 05 March 2022 11:06
To: Saurabh Gulati 
Cc: Mich Talebzadeh ; Kidong Lee 
; user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Hi,

I completely agree with Saurabh, the use of BQ with SPARK does not make sense 
at all, if you are trying to cut down your costs. I think that costs do matter 
to a few people at the end.

Saurabh, is there any chance you can see what actual queries are hitting the 
thrift server? Using hive metastore is something that I have been doing in AWS 
EMR for the last 5 years and for sure it does not cause full table scan.

Hi Sean,
for some reason, I am not able to receive any emails from the spark user group. 
My account should be a very old one, is there any chance you can kindly have a 
look into it and kindly let me know if there is something blocking me? I will 
be sincerely obliged.

Regards,
Gourav Sengupta


On Tue, Feb 22, 2022 at 3:58 PM Saurabh Gulati 
 wrote:
Hey Mich,
We use spark 3.2 now. We are using BQ but migrating away because:

  *   Its not reflective of our current lake structure with all deltas/history 
tables/models outputs etc
  *   Its pretty expensive to load everything in BQ and essentially it will be 
a copy of all data in gcs. External tables in BQ didnt work for us. Currently 
we store only latest snapshots in BQ. This breaks idempotency of models which 
need to time travel and run in the past.
  *   We might move to a different cloud provider in future so we want to be 
cloud agnostic.

So we need to have an execution engine which has the same overview of data as 
we have in gcs.
We tried presto but performance was similar and presto didn't support auto 
scaling.

TIA
Saurabh

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: 22 February 2022 16:49
To: Kidong Lee mailto:mykid...@gmail.com>>; Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>>
Cc: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Ok interesting.

I am surprised why you are not using BigQuery and using Hive. My assumption is 
that your Spark is version 3.1.1 with standard GKE on auto-scaler. What 
benefits are you getting from Using Hive here? As you have your hive tables on 
gs buckets, you can easily download your hive tables into BigQuery and run 
spark on BigQuery?

HTH

On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Thanks Sean for your response.

@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We run all workloads on GKE 
as docker containers. So to answer your questions, Hive is running in a 
container as K8S service and spark thrift-server in another container as a 
service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending 
on the load. For buckets we use gcs.


TIA
Saurabh

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: 22 February 2022 16:05
To: Saurabh Gulati 
Cc: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati  
wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data 
stored in lake. We have hive metastore running along with Spark thrift server 
and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on 
Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the 
whole table. What we want is to limit the data scan by setting something like 
hive.mapred.mode=strict​ in spark, so that user gets an ex

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Saurabh Gulati
Hey Mich,
We use spark 3.2 now. We are using BQ but migrating away because:

  *   Its not reflective of our current lake structure with all deltas/history 
tables/models outputs etc
  *   Its pretty expensive to load everything in BQ and essentially it will be 
a copy of all data in gcs. External tables in BQ didnt work for us. Currently 
we store only latest snapshots in BQ. This breaks idempotency of models which 
need to time travel and run in the past.
  *   We might move to a different cloud provider in future so we want to be 
cloud agnostic.

So we need to have an execution engine which has the same overview of data as 
we have in gcs.
We tried presto but performance was similar and presto didn't support auto 
scaling.

TIA
Saurabh

From: Mich Talebzadeh 
Sent: 22 February 2022 16:49
To: Kidong Lee ; Saurabh Gulati 
Cc: user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Ok interesting.

I am surprised why you are not using BigQuery and using Hive. My assumption is 
that your Spark is version 3.1.1 with standard GKE on auto-scaler. What 
benefits are you getting from Using Hive here? As you have your hive tables on 
gs buckets, you can easily download your hive tables into BigQuery and run 
spark on BigQuery?

HTH

On Tue, 22 Feb 2022 at 15:34, Saurabh Gulati 
mailto:saurabh.gul...@fedex.com>> wrote:
Thanks Sean for your response.

@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We run all workloads on GKE 
as docker containers. So to answer your questions, Hive is running in a 
container as K8S service and spark thrift-server in another container as a 
service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending 
on the load. For buckets we use gcs.


TIA
Saurabh

From: Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>>
Sent: 22 February 2022 16:05
To: Saurabh Gulati 
Cc: user@spark.apache.org<mailto:user@spark.apache.org> 
mailto:user@spark.apache.org>>
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati  
wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data 
stored in lake. We have hive metastore running along with Spark thrift server 
and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on 
Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the 
whole table. What we want is to limit the data scan by setting something like 
hive.mapred.mode=strict​ in spark, so that user gets an exception if they don't 
specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict ​in spark-defaults.conf​ 
in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict​ in hive-defaults.conf for metastore 
container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.


TIA
Saurabh
--



 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.



--



 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mqVIOOMCo$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UfWFYiHTzKJJ-nH2Y6pF-eCjmx8xGyjsfI-JPuBBWC9NBHgiSu40dsH_v7mq9Bt4EX4$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of da

Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Saurabh Gulati
To correct my last message, its hive-metastore​ running as a service in a 
container and not hive. We use Spark-thriftserver for query execution.

From: Saurabh Gulati 
Sent: 22 February 2022 16:33
To: Mich Talebzadeh 
Cc: user@spark.apache.org 
Subject: Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Thanks Sean for your response.

@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We run all workloads on GKE 
as docker containers. So to answer your questions, Hive is running in a 
container as K8S service and spark thrift-server in another container as a 
service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending 
on the load. For buckets we use gcs.


TIA
Saurabh

From: Mich Talebzadeh 
Sent: 22 February 2022 16:05
To: Saurabh Gulati 
Cc: user@spark.apache.org 
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati  
wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data 
stored in lake. We have hive metastore running along with Spark thrift server 
and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on 
Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the 
whole table. What we want is to limit the data scan by setting something like 
hive.mapred.mode=strict​ in spark, so that user gets an exception if they don't 
specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict ​in spark-defaults.conf​ 
in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict​ in hive-defaults.conf for metastore 
container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.


TIA
Saurabh
--



 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




Re: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Saurabh Gulati
Thanks Sean for your response.

@Mich Talebzadeh<mailto:mich.talebza...@gmail.com> We run all workloads on GKE 
as docker containers. So to answer your questions, Hive is running in a 
container as K8S service and spark thrift-server in another container as a 
service and Superset in a third container.

We use Spark on GKE setup to run thrift-server which spawns workers depending 
on the load. For buckets we use gcs.


TIA
Saurabh

From: Mich Talebzadeh 
Sent: 22 February 2022 16:05
To: Saurabh Gulati 
Cc: user@spark.apache.org 
Subject: [EXTERNAL] Re: Need to make WHERE clause compulsory in Spark SQL

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Is your hive on prem with external tables in cloud storage?

Where is your spark running from and what cloud buckets are you using?

HTH

On Tue, 22 Feb 2022 at 12:36, Saurabh Gulati  
wrote:
Hello,
We are trying to setup Spark as the execution engine for exposing our data 
stored in lake. We have hive metastore running along with Spark thrift server 
and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on 
Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the 
whole table. What we want is to limit the data scan by setting something like 
hive.mapred.mode=strict​ in spark, so that user gets an exception if they don't 
specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict ​in spark-defaults.conf​ 
in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict​ in hive-defaults.conf for metastore 
container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.


TIA
Saurabh
--



 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile<https://urldefense.com/v3/__https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbg4rhI9Q$>


 
https://en.everybodywiki.com/Mich_Talebzadeh<https://urldefense.com/v3/__https://en.everybodywiki.com/Mich_Talebzadeh__;!!AhGNFqKB8wRZstQ!UkIXXdMGzZQ1fweFWq7S_xng9u_1Pjbpz9cBjBrs_ajvgZ05vnA7VLJ1gTZbyZfziHU$>



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




Need to make WHERE clause compulsory in Spark SQL

2022-02-22 Thread Saurabh Gulati
Hello,
We are trying to setup Spark as the execution engine for exposing our data 
stored in lake. We have hive metastore running along with Spark thrift server 
and are using Superset as the UI.

We save all tables as External tables in hive metastore with storge being on 
Cloud.

We see that right now when users run a query in Superset SQL Lab it scans the 
whole table. What we want is to limit the data scan by setting something like 
hive.mapred.mode=strict​ in spark, so that user gets an exception if they don't 
specify a partition column.

We tried setting spark.hadoop.hive.mapred.mode=strict ​in spark-defaults.conf​ 
in thrift server  but it still scans the whole table.
Also tried setting hive.mapred.mode=strict​ in hive-defaults.conf for metastore 
container.

We use Spark 3.2 with hive-metastore version 3.1.2

Is there a way in spark settings to make it happen.


TIA
Saurabh


Re: [EXTERNAL] Re: Unable to access Google buckets using spark-submit

2022-02-14 Thread Saurabh Gulati
Hey Karan,
you can get the jar from 
here

From: karan alang 
Sent: 13 February 2022 20:08
To: Gourav Sengupta 
Cc: Holden Karau ; Mich Talebzadeh 
; user @spark 
Subject: [EXTERNAL] Re: Unable to access Google buckets using spark-submit

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi Gaurav, All,
I'm doing a spark-submit from my local system to a GCP Dataproc cluster .. This 
is more for dev/testing.
I can run a -- 'gcloud dataproc jobs submit' command as well, which is what 
will be done in Production.

Hope that clarifies.

regds,
Karan Alang


On Sat, Feb 12, 2022 at 10:31 PM Gourav Sengupta 
mailto:gourav.sengu...@gmail.com>> wrote:
Hi,

agree with Holden, have faced quite a few issues with FUSE.

Also trying to understand "spark-submit from local" . Are you submitting your 
SPARK jobs from a local laptop or in local mode from a GCP dataproc / system?

If you are submitting the job from your local laptop, there will be performance 
bottlenecks I guess based on the internet bandwidth and volume of data.

Regards,
Gourav


On Sat, Feb 12, 2022 at 7:12 PM Holden Karau 
mailto:hol...@pigscanfly.ca>> wrote:
You can also put the GS access jar with your Spark jars — that’s what the class 
not found exception is pointing you towards.

On Fri, Feb 11, 2022 at 11:58 PM Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
BTW I also answered you in in stackoverflow :

https://stackoverflow.com/questions/71088934/unable-to-access-google-buckets-using-spark-submit


HTH


 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile


 
https://en.everybodywiki.com/Mich_Talebzadeh



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may arise from 
relying on this email's technical content is explicitly disclaimed. The author 
will in no case be liable for any monetary damages arising from such loss, 
damage or destruction.




On Sat, 12 Feb 2022 at 08:24, Mich Talebzadeh 
mailto:mich.talebza...@gmail.com>> wrote:
You are trying to access a Google storage bucket gs:// from your local host.

It does not see it because spark-submit assumes that it is a local file system 
on the host which is not.

You need to mount gs:// bucket as a local file system.

You can use the tool called gcsfuse 
https://cloud.google.com/storage/docs/gcs-fuse
 . Cloud Storage FUSE is an open source 
FUSE
 adapter that allows you to mount Cloud Storage buckets as file systems on 
Linux or macOS systems. You can download gcsfuse from 
here


Pretty simple.


It will be installed as /usr/bin/gcsfuse and you can mount it by creating a 
local mount file like /mnt/gs as root and give permission to others to use it.


As a normal user that needs to access gs:// bucket (not as root), use gcsfuse 
to mount it. For example I am mounting a gcs bucket called spark-jars-karan here


Just use the bucket name itself


gcsfuse spark-jars-karan /mnt/gs


Then you can refer to it as /mnt/gs in spark-submit from on-premise host

spark-submit --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.2.0 --jars 
/mnt/gs/spark-bigquery-with-dependencies_2.12-0.23.2.jar

HTH

 
[https://docs.google.com/uc?export=download=1-q7RFGRfLMObPuQPWSd9sl_H1UPNFaIZ=0B1BiUVX33unjMWtVUWpINWFCd0ZQTlhTRHpGckh4Wlg4RG80PQ]
   view my Linkedin 
profile



Disclaimer: Use it at your own risk. Any and all responsibility for any loss, 
damage or destruction of data or any other property which may 

Re: [EXTERNAL] [Marketing Mail] Re: [Spark] Optimize spark join on different keys for same data frame

2021-10-05 Thread Saurabh Gulati
Hi Amit,
The only approach I can think of is to create 2 copies of schema_df1​, one 
partitioned on key1 and other on key2 and then use these to Join.

From: Amit Joshi 
Sent: 04 October 2021 19:13
To: spark-user 
Subject: [EXTERNAL] [Marketing Mail] Re: [Spark] Optimize spark join on 
different keys for same data frame

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi spark users,

Can anyone please provide any views on the topic.


Regards
Amit Joshi

On Sunday, October 3, 2021, Amit Joshi 
mailto:mailtojoshia...@gmail.com>> wrote:
Hi Spark-Users,

Hope you are doing good.

I have been working on cases where a dataframe is joined with more than one 
data frame separately, on different cols, that too frequently.
I was wondering how to optimize the join to make them faster.
We can consider the dataset to be big in size so broadcast joins is not an 
option.

For eg:

schema_df1  = new StructType()
.add(StructField("key1", StringType, true))
.add(StructField("key2", StringType, true))
.add(StructField("val", DoubleType, true))


schema_df2  = new StructType()
.add(StructField("key1", StringType, true))
.add(StructField("val", DoubleType, true))


schema_df3  = new StructType()
.add(StructField("key2", StringType, true))
.add(StructField("val", DoubleType, true))

Now if we want to join
join1 =  df1.join(df2,"key1")
join2 =  df1.join(df3,"key2")

I was thinking of bucketing as a solution to speed up the joins. But if I 
bucket df1 on the key1,then join2  may not benefit, and vice versa (if bucket 
on key2 for df1).

or Should we bucket df1 twice, one with key1 and another with key2?
Is there a strategy to make both the joins faster for both the joins?


Regards
Amit Joshi





Re: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in SPARK 2.4.x

2021-08-12 Thread Saurabh Gulati
We had issues with this migration mainly because of changes in spark date 
calendars. 
See
We got this working by setting the below params:

("spark.sql.legacy.parquet.datetimeRebaseModeInRead", "LEGACY"),
("spark.sql.legacy.parquet.datetimeRebaseModeInWrite", "CORRECTED"),
("spark.sql.legacy.parquet.int96RebaseModeInRead", "LEGACY"),
("spark.sql.legacy.parquet.int96RebaseModeInWrite", "CORRECTED")


But otherwise, it's a change for good. Performance seems better.
Also, there were bugs in 3.0.1 which have been addressed in 3.1.1.

From: Gourav Sengupta 
Sent: 05 August 2021 10:17
To: user @spark 
Subject: [EXTERNAL] [Marketing Mail] Reading SPARK 3.1.x generated parquet in 
SPARK 2.4.x

Caution! This email originated outside of FedEx. Please do not open attachments 
or click links from an unknown or suspicious origin.

Hi,

we are trying to migrate some of the data lake pipelines to run in SPARK 3.x, 
where as the dependent pipelines using those tables will be still running in 
SPARK 2.4.x for sometime to come.

Does anyone know of any issues that can happen:
1. when reading Parquet files written in 3.1.x in SPARK 2.4
2. when in the data lake some partitions have parquet files written in SPARK 
2.4.x and some are in SPARK 3.1.x.

Please note that there are no changes in schema, but later on we might end up 
adding or removing some columns.

I will be really grateful for your kind help on this.

Regards,
Gourav Sengupta


Spark 3.0.1 new Proleptic Gregorian calendar

2020-11-19 Thread Saurabh Gulati
Hello,
First of all, Thanks to you guys for maintaining and improving Spark.

We just updated to Spark 3.0.1 and are facing some issues with the new 
Proleptic Gregorian calendar.

We have data from different sources in our platform and we saw there were some 
date/timestamp columns that go back to years before 1500.

According to 
this
 post, data written with spark 2.4 and read with 3.0 should result in some 
difference in dates/timestamps but we are not able to replicate this issue. We 
only encounter an exception that suggests us to set 
spark.sql.legacy.parquet.datetimeRebaseModeInRead/Write config options to make 
it work.

So, our main concern is:

  *   How can we test/replicate this behavior? Since it's not very clear to 
us/nor we see any docs for this change, we can't decide with certainty which 
parameters to set and why.
  *   What config options should we set,
 *if we are always going to read old data written from Spark2.4 using 
Spark 3.0
 *   will always be writing newer data with Spark3.0.

We couldn't make a deterministic/informed choice so it's a better idea to ask 
the community what scenarios will be impacted and what will still work fine.

Thanks
Saurabh