Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Gourav Sengupta
Hi, I am copying Dr. Zaharia in this email as I am quoting from his book (once again I may be wrong): Chapter 5: Basic Structured Operations >> Creating Rows You can create rows by manually instantiating a Row object with the values that belong in each column. It’s important to note that only

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sergey Ivanychev
I want to further clarify the use case I have: an ML engineer collects data so as to use it for training an ML model. The driver is created within Jupiter notebook and has 64G of ram for fetching the training set and feeding it to the model. Naturally, in this case executors shouldn’t be as big

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Mich Talebzadeh
Thanks for clarification on the koalas case. The thread owner states and I quote: .. IIUC, in the `toPandas` case all the data gets shuffled to a single executor that fails with OOM, I still believe that this may be related to the way k8s handles shuffling. In a balanced k8s cluster this

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sean Owen
I think you're talking about koalas, which is in Spark 3.2, but that is unrelated to toPandas(), nor to the question of how it differs from collect(). Shuffle is also unrelated. On Wed, Nov 3, 2021 at 3:45 PM Mich Talebzadeh wrote: > Hi, > > As I understood in the previous versions of Spark the

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-03 Thread German Schiavon
Hi, Rohit, can you share how it looks using DSv2? Thanks! On Wed, 3 Nov 2021 at 19:35, huaxin gao wrote: > Great to hear. Thanks for testing this! > > On Wed, Nov 3, 2021 at 4:03 AM Kapoor, Rohit > wrote: > >> Thanks for your guidance Huaxin. I have been able to test the push down >>

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sergey Ivanychev
I’m pretty sure WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 20) (10.20.167.28 executor 2): java.lang.OutOfMemoryError at java.base/java.io.ByteArrayOutputStream.hugeCapacity(Unknown Source) If you look at the «toPandas» you can see the exchange stage that doesn’t occur in the

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-03 Thread huaxin gao
Great to hear. Thanks for testing this! On Wed, Nov 3, 2021 at 4:03 AM Kapoor, Rohit wrote: > Thanks for your guidance Huaxin. I have been able to test the push down > operators successfully against Postgresql using DS v2. > > > > > > *From: *huaxin gao > *Date: *Tuesday, 2 November 2021 at

Re: [Spark SQL]: Aggregate Push Down / Spark 3.2

2021-11-03 Thread Kapoor, Rohit
Thanks for your guidance Huaxin. I have been able to test the push down operators successfully against Postgresql using DS v2. From: huaxin gao Date: Tuesday, 2 November 2021 at 12:35 AM To: Kapoor, Rohit Subject: Re: [Spark SQL]: Aggregate Push Down / Spark 3.2 EXTERNAL MAIL: USE CAUTION