Re: Spark3.2 on K8s with proxy-user

2022-04-21 Thread Pralabh Kumar
Further information . I have kerberized cluster and am also doing the kinit . Problem is only coming where the proxy user is being used . On Fri, Apr 22, 2022 at 10:21 AM Pralabh Kumar wrote: > Hi > > Running Spark 3.2 on K8s with --proxy-user and getting below error and > then the job fails .

Spark3.2 on K8s with proxy-user

2022-04-21 Thread Pralabh Kumar
Hi Running Spark 3.2 on K8s with --proxy-user and getting below error and then the job fails . However when running without a proxy user job is running fine . Can anyone please help me with the same . 22/04/21 17:50:30 WARN Client: Exception encountered while connecting to the server :

[ANNOUNCE] Apache Kyuubi (Incubating) released 1.5.1-incubating

2022-04-21 Thread Fu Chen
Hi all, The Apache Kyuubi (Incubating) community is pleased to announce that Apache Kyuubi (Incubating) 1.5.1-incubating has been released! Apache Kyuubi (Incubating) is a distributed multi-tenant JDBC server for large-scale data processing and analytics, built on top of Apache Spark and

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Sean Owen
The line of code triggers a job, the job triggers stages. You should see they are different operations, all supporting execution of the action on that line. On Thu, Apr 21, 2022 at 9:24 AM Joe wrote: > Hi Sean, > Thanks for replying but my question was about multiple stages running > the same

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Russell Spitzer
There are a few things going on here. 1. Spark is lazy, so nothing happens until a result is collected back to the driver or data is written to a sink. So the 1 line you see is most likely just that trigger. Once triggered, all of the work required to make that final result happen occurs. If

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Joe
Hi Sean, Thanks for replying but my question was about multiple stages running the same line of code, not about multiple stages in general. Yes single job can have multiple stages, but they should not be repeated, as far as I know, if you're caching/persisting your intermediate outputs. My

Re: Why is spark running multiple stages with the same code line?

2022-04-21 Thread Sean Owen
A job can have multiple stages for sure. One action triggers a job. This seems normal. On Thu, Apr 21, 2022, 9:10 AM Joe wrote: > Hi, > When looking at application UI (in Amazon EMR) I'm seeing one job for > my particular line of code, for example: > 64 Running count at MySparkJob.scala:540 > >

Why is spark running multiple stages with the same code line?

2022-04-21 Thread Joe
Hi, When looking at application UI (in Amazon EMR) I'm seeing one job for my particular line of code, for example: 64 Running count at MySparkJob.scala:540 When I click into the job and go to stages I can see over a 100 stages running the same line of code (stages are active, pending or

Re: When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Yuri Oleynikov (‫יורי אולייניקוב‬‎)
Hi Sean Persisting/caching is useful when you’re going to reuse dataframe. So in your case no persisting/caching is required. This is regarding to “when”. The “where” usually belongs to the closest point of reusing calculations/transformations Btw, I’m not sure if caching is useful when you

Re: When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Sean Owen
You persist before actions, not after, if you want the action's outputs to be persistent. If anything swap line 2 and 3. However, there's no point in the count() here, and because there is already only one action following to write, no caching is useful in that example. On Thu, Apr 21, 2022 at

[Spark Core]: Unexpectedly exiting executor while gracefully decommissioning

2022-04-21 Thread Yeachan Park
Hello all, we are running into some issues while attempting graceful decommissioning of executors. We are running spark-thriftserver (3.2.0) on Kubernetes (GKE 1.20.15-gke.2500). We enabled: - spark.decommission.enabled - spark.storage.decommission.rddBlocks.enabled -

Re: How is union() implemented? Need to implement column bind

2022-04-21 Thread Sean Owen
Not a max - all values are needed. pivot() if anything is much closer, but see the rest of this thread. On Thu, Apr 21, 2022 at 1:19 AM Sonal Goyal wrote: > Seems like an interesting problem to solve! > > If I have understood it correctly, you have 10114 files each with the > structure > >

When should we cache / persist ? After or Before Actions?

2022-04-21 Thread Sid
Hi Folks, I am working on Spark Dataframe API where I am doing following thing: 1) df = spark.sql("some sql on huge dataset").persist() 2) df1 = df.count() 3) df.repartition().write.mode().parquet("") AFAIK, persist should be used after count statement if at all it is needed to be used since

Re: How is union() implemented? Need to implement column bind

2022-04-21 Thread Sonal Goyal
Seems like an interesting problem to solve! If I have understood it correctly, you have 10114 files each with the structure rowid, colA r1, a r2, b r3, c ...5 million rows if you union them, you will have rowid, colA, colB r1, a, null r2, b, null r3, c, null r1, null, d r2, null, e r3,