Read file from local

2021-11-04 Thread Lynx Du
Hi experts, I am just get started using spark and scala. I am confused how to read local files. I run a spark cluster using docker-compose. There are one master and 2 worker nodes. I think this cluster is so-called standalone cluster. I am trying to submit a simple task to this cluster by

Pyspark 2.4.4 window functions inconsistent

2021-11-04 Thread van wilson
I am using pyspark sql to run a sql script windows function to pull in (lead) data from the next row to populate the first row. It works reliably on Jupyter in VS code using anaconda pyspark 3.0.0. It produces different data results every time on aws emr using spark 2.4.4. Why? Is there any known

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Mich Talebzadeh
Ok so it boils down on how spark does create toPandas() DF under the bonnet. How many executors are involved in k8s cluster. In this model spark will create executors = no of nodes - 1 On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev wrote: > > Just to confirm with Collect() alone, this is all on

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
> did you get to read the excerpts from the book of Dr. Zaharia? I read what you have shared but didn’t manage to get your point. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 20:38, Gourav Sengupta > написал(а): > > did you get to read the excerpts from the book of Dr. Zaharia?

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
> Just to confirm with Collect() alone, this is all on the driver? I shared the screenshot with the plan in the first email. In the collect() case the data gets fetched to the driver without problems. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 20:37, Mich Talebzadeh > написал(а):

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Gourav Sengupta
Hi, did you get to read the excerpts from the book of Dr. Zaharia? Regards, Gourav On Thu, Nov 4, 2021 at 4:11 PM Sergey Ivanychev wrote: > I’m sure that its running in client mode. I don’t want to have the same > amount of RAM on drivers and executors since there’s no point in giving 64G >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Mich Talebzadeh
Well evidently the indication is that this is happening on the executor and not on the driver node as assumed. Just to confirm with Collect() alone, this is all on the driver? HTH On Thu, 4 Nov 2021 at 16:10, Sergey Ivanychev wrote: > I’m sure that its running in clientele mode. I don’t want

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
I’m sure that its running in client mode. I don’t want to have the same amount of RAM on drivers and executors since there’s no point in giving 64G of ram to executors in my case. My question is why collect and toPandas actions produce so different plans, which cause toPandas to fail on

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Mich Talebzadeh
>From your notes ".. IIUC, in the `toPandas` case all the data gets shuffled to a single executor that fails with OOM, which doesn’t happen in `collect` case. This does it work like that? How do I collect a large dataset that fits into memory of the driver?. The acid test would be to use pandas

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-04 Thread Kapoor, Rohit
My basic test is here - https://github.com/rohitkapoor1/sparkPushDownAggregate From: German Schiavon Date: Thursday, 4 November 2021 at 2:17 AM To: huaxin gao Cc: Kapoor, Rohit , user@spark.apache.org Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2 EXTERNAL MAIL: USE CAUTION BEFORE

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
I will follow up with the output, but I suppose Jupyter runs in client mode since it’s created via getOrCreate with a K8s api server as master. Also note that I tried both “collect” and “toPandas” in the same conditions (Jupyter client mode), so IIUC your theory doesn’t explain that difference

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Mich Talebzadeh
Do you have the output for executors from spark GUI, the one that eventually ends up with OOM? Also what does kubectl get pods -n $NAMESPACE DRIVER_POD_NAME=`kubectl get pods -n $NAMESPACE |grep driver|awk '{print $1}'` kubectl logs $DRIVER_POD_NAME -n $NAMESPACE kubectl logs $EXECUTOR_WITH_OOM

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-04 Thread Sunil Prabhakara
Unsubscribe. On Mon, Nov 1, 2021 at 6:57 PM Kapoor, Rohit wrote: > Hi, > > > > I am testing the aggregate push down for JDBC after going through the JIRA > - https://issues.apache.org/jira/browse/SPARK-34952 > > I have the latest Spark 3.2 setup in local mode (laptop). > > > > I have PostgreSQL