Hi experts,
I am just get started using spark and scala.
I am confused how to read local files.
I run a spark cluster using docker-compose. There are one master and 2 worker
nodes. I think this cluster is so-called standalone cluster.
I am trying to submit a simple task to this cluster by
I am using pyspark sql to run a sql script windows function to pull in
(lead) data from the next row to populate the first row. It works reliably
on Jupyter in VS code using anaconda pyspark 3.0.0. It produces different
data results every time on aws emr using spark 2.4.4. Why? Is there any
known
Ok so it boils down on how spark does create toPandas() DF under the
bonnet. How many executors are involved in k8s cluster. In this model spark
will create executors = no of nodes - 1
On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev
wrote:
> > Just to confirm with Collect() alone, this is all on
> did you get to read the excerpts from the book of Dr. Zaharia?
I read what you have shared but didn’t manage to get your point.
Best regards,
Sergey Ivanychev
> 4 нояб. 2021 г., в 20:38, Gourav Sengupta
> написал(а):
>
> did you get to read the excerpts from the book of Dr. Zaharia?
> Just to confirm with Collect() alone, this is all on the driver?
I shared the screenshot with the plan in the first email. In the collect() case
the data gets fetched to the driver without problems.
Best regards,
Sergey Ivanychev
> 4 нояб. 2021 г., в 20:37, Mich Talebzadeh
> написал(а):
Hi,
did you get to read the excerpts from the book of Dr. Zaharia?
Regards,
Gourav
On Thu, Nov 4, 2021 at 4:11 PM Sergey Ivanychev
wrote:
> I’m sure that its running in client mode. I don’t want to have the same
> amount of RAM on drivers and executors since there’s no point in giving 64G
>
Well evidently the indication is that this is happening on the executor and
not on the driver node as assumed. Just to confirm with Collect() alone,
this is all on the driver?
HTH
On Thu, 4 Nov 2021 at 16:10, Sergey Ivanychev
wrote:
> I’m sure that its running in clientele mode. I don’t want
I’m sure that its running in client mode. I don’t want to have the same amount
of RAM on drivers and executors since there’s no point in giving 64G of ram to
executors in my case.
My question is why collect and toPandas actions produce so different plans,
which cause toPandas to fail on
>From your notes ".. IIUC, in the `toPandas` case all the data gets shuffled
to a single executor that fails with OOM, which doesn’t happen in `collect`
case. This does it work like that? How do I collect a large dataset that
fits into memory of the driver?.
The acid test would be to use pandas
My basic test is here - https://github.com/rohitkapoor1/sparkPushDownAggregate
From: German Schiavon
Date: Thursday, 4 November 2021 at 2:17 AM
To: huaxin gao
Cc: Kapoor, Rohit , user@spark.apache.org
Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2
EXTERNAL MAIL: USE CAUTION BEFORE
I will follow up with the output, but I suppose Jupyter runs in client mode
since it’s created via getOrCreate with a K8s api server as master.
Also note that I tried both “collect” and “toPandas” in the same conditions
(Jupyter client mode), so IIUC your theory doesn’t explain that difference
Do you have the output for executors from spark GUI, the one that
eventually ends up with OOM?
Also what does
kubectl get pods -n $NAMESPACE
DRIVER_POD_NAME=`kubectl get pods -n $NAMESPACE |grep driver|awk '{print
$1}'`
kubectl logs $DRIVER_POD_NAME -n $NAMESPACE
kubectl logs $EXECUTOR_WITH_OOM
Unsubscribe.
On Mon, Nov 1, 2021 at 6:57 PM Kapoor, Rohit
wrote:
> Hi,
>
>
>
> I am testing the aggregate push down for JDBC after going through the JIRA
> - https://issues.apache.org/jira/browse/SPARK-34952
>
> I have the latest Spark 3.2 setup in local mode (laptop).
>
>
>
> I have PostgreSQL
13 matches
Mail list logo