Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sergey Ivanychev
sing your question. > I do not think it's a bug necessarily; do you end up with one partition in > your execution somewhere? > > On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Of course if I give 64G of ram to each executor they w

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sergey Ivanychev
damages arising from such > loss, damage or destruction. > > > > On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Yes, in fact those are the settings that cause this behaviour. If set to > false, eve

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Sergey Ivanychev
deserialization of Row objects. Best regards, Sergey Ivanychev > 12 нояб. 2021 г., в 05:05, Gourav Sengupta > написал(а): >  > Hi Sergey, > > Please read the excerpts from the book of Dr. Zaharia that I had sent, they > explain these fundamentals clearly. > >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-11 Thread Sergey Ivanychev
Yes, in fact those are the settings that cause this behaviour. If set to false, everything goes fine since the implementation in spark sources in this case is pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) Best regards, Sergey Ivanychev > 11 нояб. 2021 г., в 13

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
> did you get to read the excerpts from the book of Dr. Zaharia? I read what you have shared but didn’t manage to get your point. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 20:38, Gourav Sengupta > написал(а): > > did you get to read the excerpts from the book of Dr. Zaharia?

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
> Just to confirm with Collect() alone, this is all on the driver? I shared the screenshot with the plan in the first email. In the collect() case the data gets fetched to the driver without problems. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 20:37, Mich Talebzadeh >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
on executors. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 15:17, Mich Talebzadeh > написал(а): > >  > > From your notes ".. IIUC, in the `toPandas` case all the data gets shuffled > to a single executor that fails with OOM, which doesn’t happen in `collect` >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-04 Thread Sergey Ivanychev
in execution plans. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 13:12, Mich Talebzadeh > написал(а): > >  > Do you have the output for executors from spark GUI, the one that eventually > ends up with OOM? > > Also what does > > kubectl get pods -n

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sergey Ivanychev
as the driver. Currently, the best solution I found is to write the dataframe to S3, and then read it via pd.read_parquet. Best regards, Sergey Ivanychev > 4 нояб. 2021 г., в 00:18, Mich Talebzadeh > написал(а): > >  > Thanks for clarification on the koalas case. >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-03 Thread Sergey Ivanychev
at RDD. Also toPandas() converts to Python objects in memory I do not think > that collect does it. > > Regards, > Gourav > > On Wed, Nov 3, 2021 at 2:24 PM Sergey Ivanychev <mailto:sergeyivanyc...@gmail.com>> wrote: > Hi, > > Spark 3.1.2 K8s. >