Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
To add some more, if the number of rows to collect is large, DataFrame.collect can be slower because it launches multiple Spark jobs sequentially. Given that DataFrame.toPandas does not take how many rows to collect, it's controversial to apply the same optimization of DataFrame.collect to here.

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Hyukjin Kwon
Thanks for pinging me Sean. Yes, there's an optimization on DataFrame.collect which tries to collect few first partitioms and see if the number of rows are found (and repeat). DataFrame.toPandas does not have such optimization. I suspect that the shuffle isn't actual shuffle but just collects

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sean Owen
Hyukjin can you weigh in? Is this exchange due to something in your operations or clearly unique to the toPandas operation? I didn't think it worked that way, but maybe there is some good reason it does. On Fri, Nov 12, 2021 at 7:34 AM Sergey Ivanychev wrote: > Hi Sean, > > According to the

[no subject]

2021-11-12 Thread 河合亮 / KAWAI,RYOU
unsubscribe

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sergey Ivanychev
Hi Sean, According to the plan I’m observing, this is what happens indeed. There’s exchange operation that sends data to a single partition/task in toPandas() + PyArrow enabled case. > 12 нояб. 2021 г., в 16:31, Sean Owen написал(а): > > Yes, none of the responses are addressing your

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sean Owen
Yes, none of the responses are addressing your question. I do not think it's a bug necessarily; do you end up with one partition in your execution somewhere? On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev wrote: > Of course if I give 64G of ram to each executor they will work. But what’s >

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Sergey Ivanychev
Of course if I give 64G of ram to each executor they will work. But what’s the point? Collecting results in the driver should cause a high RAM usage in the driver and that’s what is happening in collect() case. In the case where pyarrow serialization is enabled all the data is being collected

Re: PySpark: toPandas() vs collect() execution graph differences

2021-11-12 Thread Mich Talebzadeh
OK, your findings do not imply those settings are incorrect. Those settings will work if you set-up your k8s cluster in peer-to-peer mode with equal amounts of RAM for each node which is common practice. HTH view my Linkedin profile