Re: PySpark: toPandas() vs collect() execution graph differences

Sergey Ivanychev Thu, 11 Nov 2021 22:48:42 -0800

Hi Gourav,

Please, read my question thoroughly. My problem is with the plan of the 
execution and with the fact that toPandas collects all the data not on the 
driver but on an executor, not with the fact that there’s some memory overhead.


I don’t understand how your excerpts answer my question. The chapters you’ve 
shared describe that serialization is costly, that workers can fail due to the 
memory constraints and inter-language serialization.

This is irrelevant to my question — building pandas DataFrame using Spark’s 
collect() works fine and this operation itself involves much deserialization of 
Row objects.

Best regards,


Sergey Ivanychev

> 12 нояб. 2021 г., в 05:05, Gourav Sengupta <[email protected]> 
> написал(а):
> 
> Hi Sergey,
> 
> Please read the excerpts from the book of Dr. Zaharia that I had sent, they 
> explain these fundamentals clearly.
> 
> Regards,
> Gourav Sengupta
> 
> On Thu, Nov 11, 2021 at 9:40 PM Sergey Ivanychev <[email protected]> 
> wrote:
>> Yes, in fact those are the settings that cause this behaviour. If set to 
>> false, everything goes fine since the implementation in spark sources in 
>> this case is
>> 
>> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>> 
>> Best regards,
>> 
>> 
>> Sergey Ivanychev
>> 
>>> 11 нояб. 2021 г., в 13:58, Mich Talebzadeh <[email protected]> 
>>> написал(а):
>>> 
>>> Have you tried the following settings:
>>> 
>>> spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true")
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")
>>> 
>>> HTH
>>> 
>>>    view my Linkedin profile
>>> 
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> 
>>> On Thu, 4 Nov 2021 at 18:06, Mich Talebzadeh <[email protected]> 
>>> wrote:
>>>> Ok so it boils down on how spark does create toPandas() DF under the 
>>>> bonnet. How many executors are involved in k8s cluster. In this model 
>>>> spark will create executors = no of nodes - 1
>>>> 
>>>> On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev <[email protected]> 
>>>> wrote:
>>>>> > Just to confirm with Collect() alone, this is all on the driver?
>>>>> 
>>>>> I shared the screenshot with the plan in the first email. In the 
>>>>> collect() case the data gets fetched to the driver without problems.
>>>>> 
>>>>> Best regards,
>>>>> 
>>>>> 
>>>>> Sergey Ivanychev
>>>>> 
>>>>>> 4 нояб. 2021 г., в 20:37, Mich Talebzadeh <[email protected]> 
>>>>>> написал(а):
>>>>> 
>>>>>> Just to confirm with Collect() alone, this is all on the driver?
>>>> -- 
>>>> 
>>>> 
>>>>    view my Linkedin profile
>>>> 
>>>>  
>>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>>> loss, damage or destruction of data or any other property which may arise 
>>>> from relying on this email's technical content is explicitly disclaimed. 
>>>> The author will in no case be liable for any monetary damages arising from 
>>>> such loss, damage or destruction.

Re: PySpark: toPandas() vs collect() execution graph differences

Reply via email to