Re: PySpark: toPandas() vs collect() execution graph differences

Sergey Ivanychev Fri, 12 Nov 2021 05:34:31 -0800

Hi Sean,

According to the plan I’m observing, this is what happens indeed. There’s 
exchange operation that sends data to a single partition/task in toPandas() + 
PyArrow enabled case.


> 12 нояб. 2021 г., в 16:31, Sean Owen <sro...@gmail.com> написал(а):
> 
> Yes, none of the responses are addressing your question.
> I do not think it's a bug necessarily; do you end up with one partition in 
> your execution somewhere?
> 
> On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <sergeyivanyc...@gmail.com 
> <mailto:sergeyivanyc...@gmail.com>> wrote:
> Of course if I give 64G of ram to each executor they will work. But what’s 
> the point? Collecting results in the driver should cause a high RAM usage in 
> the driver and that’s what is happening in collect() case. In the case where 
> pyarrow serialization is enabled all the data is being collected on a single 
> executor, which is clearly a wrong way to collect the result on the driver.
> 
> I guess I’ll open an issue about it in Spark Jira. It clearly looks like a 
> bug.
> 
>> 12 нояб. 2021 г., в 11:59, Mich Talebzadeh <mich.talebza...@gmail.com 
>> <mailto:mich.talebza...@gmail.com>> написал(а):
>> 
>> OK, your findings do not imply those settings are incorrect. Those settings 
>> will work if you set-up your k8s cluster in peer-to-peer mode with equal 
>> amounts of RAM for each node which is common practice.
>> 
>> HTH
>> 
>> 
>>    view my Linkedin profile 
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>  
>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>> loss, damage or destruction of data or any other property which may arise 
>> from relying on this email's technical content is explicitly disclaimed. The 
>> author will in no case be liable for any monetary damages arising from such 
>> loss, damage or destruction.
>>  
>> 
>> 
>> On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <sergeyivanyc...@gmail.com 
>> <mailto:sergeyivanyc...@gmail.com>> wrote:
>> Yes, in fact those are the settings that cause this behaviour. If set to 
>> false, everything goes fine since the implementation in spark sources in 
>> this case is
>> 
>> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>> 
>> Best regards,
>> 
>> 
>> Sergey Ivanychev
>> 
>>> 11 нояб. 2021 г., в 13:58, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> написал(а):
>>> 
>>> 
>>> Have you tried the following settings:
>>> 
>>> spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true")
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")
>>> 
>>> HTH
>>> 
>>>    view my Linkedin profile 
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>>> 
>>> 
>>> On Thu, 4 Nov 2021 at 18:06, Mich Talebzadeh <mich.talebza...@gmail.com 
>>> <mailto:mich.talebza...@gmail.com>> wrote:
>>> Ok so it boils down on how spark does create toPandas() DF under the 
>>> bonnet. How many executors are involved in k8s cluster. In this model spark 
>>> will create executors = no of nodes - 1
>>> 
>>> On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev <sergeyivanyc...@gmail.com 
>>> <mailto:sergeyivanyc...@gmail.com>> wrote:
>>> > Just to confirm with Collect() alone, this is all on the driver?
>>> 
>>> I shared the screenshot with the plan in the first email. In the collect() 
>>> case the data gets fetched to the driver without problems.
>>> 
>>> Best regards,
>>> 
>>> 
>>> Sergey Ivanychev
>>> 
>>>> 4 нояб. 2021 г., в 20:37, Mich Talebzadeh <mich.talebza...@gmail.com 
>>>> <mailto:mich.talebza...@gmail.com>> написал(а):
>>>> 
>>> 
>>>> Just to confirm with Collect() alone, this is all on the driver?
>>> -- 
>>> 
>>> 
>>>    view my Linkedin profile 
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>  
>>> Disclaimer: Use it at your own risk. Any and all responsibility for any 
>>> loss, damage or destruction of data or any other property which may arise 
>>> from relying on this email's technical content is explicitly disclaimed. 
>>> The author will in no case be liable for any monetary damages arising from 
>>> such loss, damage or destruction.
>>>  
>

Re: PySpark: toPandas() vs collect() execution graph differences

Reply via email to