Re: PySpark: toPandas() vs collect() execution graph differences

Sean Owen Fri, 12 Nov 2021 06:16:47 -0800

Hyukjin can you weigh in?
Is this exchange due to something in your operations or clearly unique to
the toPandas operation?
I didn't think it worked that way, but maybe there is some good reason it
does.


On Fri, Nov 12, 2021 at 7:34 AM Sergey Ivanychev <sergeyivanyc...@gmail.com>
wrote:

> Hi Sean,
>
> According to the plan I’m observing, this is what happens indeed. There’s
> exchange operation that sends data to a single partition/task in toPandas()
> + PyArrow enabled case.
>
> 12 нояб. 2021 г., в 16:31, Sean Owen <sro...@gmail.com> написал(а):
>
> Yes, none of the responses are addressing your question.
> I do not think it's a bug necessarily; do you end up with one partition in
> your execution somewhere?
>
> On Fri, Nov 12, 2021 at 3:38 AM Sergey Ivanychev <
> sergeyivanyc...@gmail.com> wrote:
>
>> Of course if I give 64G of ram to each executor they will work. But
>> what’s the point? Collecting results in the driver should cause a high RAM
>> usage in the driver and that’s what is happening in collect() case. In the
>> case where pyarrow serialization is enabled all the data is being collected
>> on a single executor, which is clearly a wrong way to collect the result on
>> the driver.
>>
>> I guess I’ll open an issue about it in Spark Jira. It clearly looks like
>> a bug.
>>
>> 12 нояб. 2021 г., в 11:59, Mich Talebzadeh <mich.talebza...@gmail.com>
>> написал(а):
>>
>> OK, your findings do not imply those settings are incorrect. Those
>> settings will work if you set-up your k8s cluster in peer-to-peer mode with
>> equal amounts of RAM for each node which is common practice.
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Thu, 11 Nov 2021 at 21:39, Sergey Ivanychev <sergeyivanyc...@gmail.com>
>> wrote:
>>
>>> Yes, in fact those are the settings that cause this behaviour. If set to
>>> false, everything goes fine since the implementation in spark sources in
>>> this case is
>>>
>>> pdf = pd.DataFrame.from_records(self.collect(), columns=self.columns)
>>>
>>> Best regards,
>>>
>>>
>>> Sergey Ivanychev
>>>
>>> 11 нояб. 2021 г., в 13:58, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> написал(а):
>>>
>>> 
>>> Have you tried the following settings:
>>>
>>> spark.conf.set("spark.sql.execution.arrow.pysppark.enabled", "true")
>>>
>>> spark.conf.set("spark.sql.execution.arrow.pyspark.fallback.enabled","true")
>>>
>>> HTH
>>>
>>>    view my Linkedin profile
>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>
>>>
>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>> any loss, damage or destruction of data or any other property which may
>>> arise from relying on this email's technical content is explicitly
>>> disclaimed. The author will in no case be liable for any monetary damages
>>> arising from such loss, damage or destruction.
>>>
>>>
>>>
>>>
>>> On Thu, 4 Nov 2021 at 18:06, Mich Talebzadeh <mich.talebza...@gmail.com>
>>> wrote:
>>>
>>>> Ok so it boils down on how spark does create toPandas() DF under the
>>>> bonnet. How many executors are involved in k8s cluster. In this model spark
>>>> will create executors = no of nodes - 1
>>>>
>>>> On Thu, 4 Nov 2021 at 17:42, Sergey Ivanychev <
>>>> sergeyivanyc...@gmail.com> wrote:
>>>>
>>>>> > Just to confirm with Collect() alone, this is all on the driver?
>>>>>
>>>>> I shared the screenshot with the plan in the first email. In the
>>>>> collect() case the data gets fetched to the driver without problems.
>>>>>
>>>>> Best regards,
>>>>>
>>>>>
>>>>> Sergey Ivanychev
>>>>>
>>>>> 4 нояб. 2021 г., в 20:37, Mich Talebzadeh <mich.talebza...@gmail.com>
>>>>> написал(а):
>>>>>
>>>>> Just to confirm with Collect() alone, this is all on the driver?
>>>>>
>>>>> --
>>>>
>>>>
>>>>    view my Linkedin profile
>>>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>>>
>>>>
>>>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>>>> any loss, damage or destruction of data or any other property which may
>>>> arise from relying on this email's technical content is explicitly
>>>> disclaimed. The author will in no case be liable for any monetary damages
>>>> arising from such loss, damage or destruction.
>>>>
>>>>
>>>>
>>>
>>
>

Re: PySpark: toPandas() vs collect() execution graph differences

Reply via email to