spark3.0 read kudu data

2020-12-07 Thread 冯宝利
Hi: Recently, we are upgrading spark from 2.4 to 3.0. We are doing performance testing and found some performance problems.Through the comparative test, it is found that spark3.0 reads kudu data much slower than 2.4. Normally, spark2.4 takes 0.1-1s to read the same amount of data, but

Is there a better way to read kerberized impala tables by spark jdbc?

2020-12-07 Thread eab...@163.com
Hi: I want to use spark jdbc to read kerberized impala tables, like: ``` val impalaUrl = "jdbc:impala://:21050;AuthMech=1;KrbRealm=REALM.COM;KrbHostFQDN=;KrbServiceName=impala" spark.read.jdbc(impalaUrl) ``` As we know, spark will read impala data by executor rather than driver, so throw

Spark in hybrid cloud in AWS & GCP

2020-12-07 Thread Bin Fan
Dear Spark users, If you are interested in running Spark in Hybrid Cloud? Checkout talks from AWS & GCP at the virtual Data Orchestration Summit on Dec. 8-9, 2020, register for free .

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Amit Joshi
Hi Gabor, Pls find the logs attached. These are truncated logs. Command used : spark-submit --verbose --packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,com.typesafe:config:1.4.0 --master yarn --deploy-mode cluster --class com.stream.Main --num-executors 2 --driver-memory 2g

Re: Spark UI Storage Memory

2020-12-07 Thread Amit Sharma
any suggestion please. Thanks Amit On Fri, Dec 4, 2020 at 2:27 PM Amit Sharma wrote: > Is there any memory leak in spark 2.3.3 version as mentioned in below > Jira. > https://issues.apache.org/jira/browse/SPARK-29055. > > Please let me know how to solve it. > > Thanks > Amit > > On Fri, Dec 4,

Re: Caching

2020-12-07 Thread Lalwani, Jayesh
* Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query. No. That would mean that Spark will need to cache DF1. Spark won’t cache dataframes unless you ask it to, even if it knows that the same dataframe is being used twice. This is

Re: Caching

2020-12-07 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
You are using same csv twice? Отправлено с iPhone > 7 дек. 2020 г., в 18:32, Amit Sharma написал(а): > >  > Hi All, I am using caching in my code. I have a DF like > val DF1 = read csv. > val DF2 = DF1.groupBy().agg().select(.) > > Val DF3 = read csv .join(DF1).join(DF2) > DF3 .save.

Re: Caching

2020-12-07 Thread Amit Sharma
Jayesh, but during logical plan spark would be knowing to use the same DF twice so it will optimize the query. Thanks Amit On Mon, Dec 7, 2020 at 1:16 PM Lalwani, Jayesh wrote: > Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, > without caching, Spark will read the

Re: Caching

2020-12-07 Thread Amit Sharma
Sean, you mean if df is used more than once in transformation then use cache. But be frankly that is also not true because at many places even if df is used once with caching and without cache also it gives same result. How to decide should we use cache or not Thanks Amit On Mon, Dec 7, 2020

Re: Caching

2020-12-07 Thread Lalwani, Jayesh
Since DF2 is dependent on DF1, and DF3 is dependent on both DF1 and DF2, without caching, Spark will read the CSV twice: Once to load it for DF1, and once to load it for DF2. When you add a cache on DF1 or DF2, it reads from CSV only once. You might want to look at doing a windowed query on

Re: Caching

2020-12-07 Thread Sean Owen
No, it's not true that one action means every DF is evaluated once. This is a good counterexample. On Mon, Dec 7, 2020 at 11:47 AM Amit Sharma wrote: > Thanks for the information. I am using spark 2.3.3 There are few more > questions > > 1. Yes I am using DF1 two times but at the end action is

Re: Caching

2020-12-07 Thread Amit Sharma
Thanks for the information. I am using spark 2.3.3 There are few more questions 1. Yes I am using DF1 two times but at the end action is one on DF3. In that case action of DF1 should be just 1 or it depends how many times this dataframe is used in transformation. I believe even if we use a

RE: Caching

2020-12-07 Thread Theodoros Gkountouvas
Hi Amit, One action might use the same DataFrame more than once. You can look at your LogicalPlan by executing DF3.explain (arguments different depending the version of Spark you are using) and see how many times you need to compute DF2 or DF1. Given the information you have provided I suspect

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Gabor Somogyi
Well, I can't do miracle without cluster and logs access. What I don't understand why you need fat jar?! Spark libraries normally need provided scope because it must exist on all machines... I would take a look at the driver and executor logs which contains the consumer configs + I would take a

Caching

2020-12-07 Thread Amit Sharma
Hi All, I am using caching in my code. I have a DF like val DF1 = read csv. val DF2 = DF1.groupBy().agg().select(.) Val DF3 = read csv .join(DF1).join(DF2) DF3 .save. If I do not cache DF2 or Df1 it is taking longer time . But i am doing 1 action only why do I need to cache. Thanks

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Amit Joshi
Hi Gabor, The code is very simple Kafka consumption of data. I guess, it may be the cluster. Can you please point out the possible problem toook for in the cluster? Regards Amit On Monday, December 7, 2020, Gabor Somogyi wrote: > + Adding back user list. > > I've had a look at the Spark code

Re: substitution invocator for a variable in PyCharm sql

2020-12-07 Thread Mich Talebzadeh
Thanks Russell f-string interpolation helped. Replace Scala 's' with Python 'f'! Mich *Disclaimer:* Use it at your own risk. Any and all responsibility for any loss, damage or destruction of data or any other property which may arise from relying on this email's technical content is explicitly

Re: substitution invocator for a variable in PyCharm sql

2020-12-07 Thread Russell Spitzer
The feature you are looking for is called "String Interpolation" and is available in python 3.6. It uses a different syntax than scala's https://www.programiz.com/python-programming/string-interpolation On Mon, Dec 7, 2020 at 7:05 AM Mich Talebzadeh wrote: > In Spark/Scala you can use 's'

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Gabor Somogyi
+ Adding back user list. I've had a look at the Spark code and it's not modifying "partition.assignment.strategy" so the problem must be either in your application or in your cluster setup. G On Mon, Dec 7, 2020 at 12:31 PM Gabor Somogyi wrote: > It's super interesting because that field has

substitution invocator for a variable in PyCharm sql

2020-12-07 Thread Mich Talebzadeh
In Spark/Scala you can use 's' substitution invocator for a variable in sql call, for example var sqltext = s""" INSERT INTO TABLE ${broadcastStagingConfig.broadcastTable} PARTITION (broadcastId = ${broadcastStagingConfig.broadcastValue},brand) SELECT

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Amit Joshi
Hi All, Thnks for the reply. I did tried removing the client version. But got the same exception. Though one point there is some dependent artifacts which I am using, which contains refrence to the Kafka client saw version. I am trying to make uber jar, which will choose the closest version.

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread Gabor Somogyi
+1 on the mentioned change, Spark uses the following kafka-clients library: 2.4.1 G On Mon, Dec 7, 2020 at 9:30 AM German Schiavon wrote: > Hi, > > I think the issue is that you are overriding the kafka-clients that comes > with spark-sql-kafka-0-10_2.12 > > > I'd try removing the

Re: Missing required configuration "partition.assignment.strategy" [ Kafka + Spark Structured Streaming ]

2020-12-07 Thread German Schiavon
Hi, I think the issue is that you are overriding the kafka-clients that comes with spark-sql-kafka-0-10_2.12 I'd try removing the kafka-clients and see if it works On Sun, 6 Dec 2020 at 08:01, Amit Joshi wrote: > Hi All, > > I am running the Spark Structured Streaming along with Kafka. >