Hi, Sorry
Regards, Gourav Sengupta On Fri, Nov 12, 2021 at 6:48 AM Sergey Ivanychev <sergeyivanyc...@gmail.com> wrote: > Hi Gourav, > > Please, read my question thoroughly. My problem is with the plan of the > execution and with the fact that toPandas collects all the data not on the > driver but on an executor, not with the fact that there’s some memory > overhead. > > I don’t understand how your excerpts answer my question. The chapters > you’ve shared describe that serialization is costly, that workers can fail > due to the memory constraints and inter-language serialization. > > This is irrelevant to my question — building pandas DataFrame using > Spark’s collect() works fine and this operation itself involves much > deserialization of Row objects. > > Best regards, > > > Sergey Ivanychev > > 12 нояб. 2021 г., в 05:05, Gourav Sengupta <gourav.sengu...@gmail.com> > написал(а): > > > Hi Sergey, > > Please read the excerpts from the book of Dr. Zaharia that I had sent, > they explain these fundamentals clearly. > > Regards, > Gourav Sengupta > > On Thu, Nov 11, 2021 at 9:40 PM Sergey Ivanychev < > sergeyivanyc...@gmail.com> wrote: > >> Yes, in fact those are the settings that cause this behaviour. If set to >> false, everything goes fine since the implementation in spark sources in >> this case is >> >> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns) >> >> Best regards, >> >> >> Sergey Ivanychev >> >> 11 нояб. 2021 г., в 13:58, Mich Talebzadeh <mich.talebza...@gmail.com> >> написал(а): >> >> >> Have you tried the following settings: >> >> spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true") >> >> spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true") >> >> HTH >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Thu, 4 Nov 2021 at 18:06, Mich Talebzadeh <mich.talebza...@gmail.com> >> wrote: >> >>> Ok so it boils down on how spark does create toPandas() DF under the >>> bonnet. How many executors are involved in k8s cluster. In this model spark >>> will create executors = no of nodes - 1 >>> >>> On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev <sergeyivanyc...@gmail.com> >>> wrote: >>> >>>> > Just to confirm with Collect() alone, this is all on the driver? >>>> >>>> I shared the screenshot with the plan in the first email. In the >>>> collect() case the data gets fetched to the driver without problems. >>>> >>>> Best regards, >>>> >>>> >>>> Sergey Ivanychev >>>> >>>> 4 нояб. 2021 г., в 20:37, Mich Talebzadeh <mich.talebza...@gmail.com> >>>> написал(а): >>>> >>>> Just to confirm with Collect() alone, this is all on the driver? >>>> >>>> -- >>> >>> >>> >>> view my Linkedin profile >>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >>> >>> >>> >>> *Disclaimer:* Use it at your own risk. Any and all responsibility for >>> any loss, damage or destruction of data or any other property which may >>> arise from relying on this email's technical content is explicitly >>> disclaimed. The author will in no case be liable for any monetary damages >>> arising from such loss, damage or destruction. >>> >>> >>> >>