Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sergey Ivanychev
ct works > at RDD. Also toPandas() converts to Python objects in memory I do not think > that collect does it. > > Regards, > Gourav > > On Wed, Nov 3, 2021 at 2:24 PM Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Hi, > > Spark

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sergey Ivanychev
as the driver. Currently, the best solution I found is to write the dataframe to S3, and then read it via pd.read_parquet. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 00:18, Mich Talebzadeh > написал(а): > >  > Thanks for clarification on the koalas case. >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
in execution plans. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 13:12, Mich Talebzadeh > написал(а): > >  > Do you have the output for executors from spark GUI, the one that eventually > ends up with OOM? > > Also what does > > kubectl get pods -n

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
executors. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 15:17, Mich Talebzadeh > написал(а): > >  > > From your notes ".. IIUC, in the `toPandas` case all the data gets shuffled > to a single executor that fails with OOM, which doesn’t happen in `collect` >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
> Just to confirm with Collect() alone, this is all on the driver? I shared the screenshot with the plan in the first email. In the collect() case the data gets fetched to the driver without problems. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 20:37, Mich Talebzadeh >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
> did you get to read the excerpts from the book of Dr. Zaharia? I read what you have shared but didn’t manage to get your point. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 20:38, Gourav Sengupta > написал(а): > > did you get to read the excerpts from the book of Dr. Zaharia?

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Sergey Ivanychev
Yes, in fact those are the settings that cause this behaviour. If set to false, everything goes fine since the implementation in spark sources in this case is pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) Best regards, Sergey Ivanychev > 11 нояб. 2021 г., в 13

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Sergey Ivanychev
deserialization of Row objects. Best regards, Sergey Ivanychev > 12 нояб. 2021 г., в 05:05, Gourav Sengupta > написал(а): >  > Hi Sergey, > > Please read the excerpts from the book of Dr. Zaharia that I had sent, they > explain these fundamentals clearly. > > Regards

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sergey Ivanychev
etary damages arising from such > loss, damage or destruction. > > > > On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Yes, in fact those are the settings that cause this behaviour. If set to > false

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sergey Ivanychev
sing your question. > I do not think it's a bug necessarily; do you end up with one partition in > your execution somewhere? > > On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Of course if I give 64G of ram to each executor t