Re: PySpark: toPandas() vs collect() execution graph differences

Georg Heiler Thu, 11 Nov 2021 23:12:31 -0800

https://stackoverflow.com/questions/46832394/spark-access-first-n-rows-take-vs-limit
might be related


Best,
Georg

Am Fr., 12. Nov. 2021 um 07:48 Uhr schrieb Sergey Ivanychev <
[email protected]>:

> Hi Gourav,
>
> Please, read my question thoroughly. My problem is with the plan of the
> execution and with the fact that toPandas collects all the data not on the
> driver but on an executor, not with the fact that there’s some memory
> overhead.
>
> I don’t understand how your excerpts answer my question. The chapters
> you’ve shared describe that serialization is costly, that workers can fail
> due to the memory constraints and inter-language serialization.
>
> This is irrelevant to my question — building pandas DataFrame using
> Spark’s collect() works fine and this operation itself involves much
> deserialization of Row objects.
>
> Best regards,
>
>
> Sergey Ivanychev
>
> 12 нояб. 2021 г., в 05:05, Gourav Sengupta <[email protected]>
> написал(а):
>
> 
> Hi Sergey,
>
> Please read the excerpts from the book of Dr. Zaharia that I had sent,
> they explain these fundamentals clearly.
>
> Regards,
> Gourav Sengupta
>
> On Thu, Nov 11, 2021 at 9:40 PM Sergey Ivanychev <
> [email protected]> wrote:
>
>> Yes, in fact those are the settings that cause this behaviour. If set to
>> false, everything goes fine since the implementation in spark sources in
>> this case is
>>
>> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>>
>> Best regards,
>>
>>
>> Sergey Ivanychev
>>
>> 11 нояб. 2021 г., в 13:58, Mich Talebzadeh <[email protected]>
>> написал(а):
>>
>> 
>> Have you tried the following settings:
>>
>> spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true")
>>
>> spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 4 Nov 2021 at 18:06, Mich Talebzadeh <[email protected]>
>> wrote:
>>
>>> Ok so it boils down on how spark does create toPandas() DF under the
>>> bonnet. How many executors are involved in k8s cluster. In this model spark
>>> will create executors = no of nodes - 1
>>>
>>> On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev <[email protected]>
>>> wrote:
>>>
>>>> > Just to confirm with Collect() alone, this is all on the driver?
>>>>
>>>> I shared the screenshot with the plan in the first email. In the
>>>> collect() case the data gets fetched to the driver without problems.
>>>>
>>>> Best regards,
>>>>
>>>>
>>>> Sergey Ivanychev
>>>>
>>>> 4 нояб. 2021 г., в 20:37, Mich Talebzadeh <[email protected]>
>>>> написал(а):
>>>>
>>>> Just to confirm with Collect() alone, this is all on the driver?
>>>>
>>>> --
>>>
>>>
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>

Re: PySpark: toPandas() vs collect() execution graph differences

Reply via email to