Hi Spark devs,

I'm experiencing a Union performance degradation as well. Since this email
thread is very related, posting it here to see if anyone has any insights.

*Background*:
After upgrading a Spark job from Spark 2.4 to Spark 3.1 without any code
change, we saw *big performance degradation* (5-8 times slower). After
enabling DEBUG log, we found the Spark 3.1 job takes significantly longer
in fetching file metadata from S3. E.g., Spark 2.4 takes 5 minutes fetching
all metadata, and Spark 3.1 needs 40 minutes.
*Findings*:
After closely monitoring *thread dumps* for both Spark versions, we found
the reason for the slowness is that Spark 2.4 has *a pool of 8 threads*
(spawn from the driver main thread) to do the metadata fetch, versus Spark
3 only has the dag-scheduler-event-loop thread to do the work.
More details: Spark 2.4 is calling 2.4/core/rdd/UnionRDD.scala#L76
<https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala#L76>
to
get partition metadata, which triggers the thread pool here
<https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala#L62>,
versus Spark 3.1 is calling 3.1/core/rdd/UnionRDD.scala#L94
<https://github.com/apache/spark/blob/branch-3.1/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala#L94>,
and the thread pool is not triggered.

Any insights for why this happens or how to fix the performance issue is
welcome. In the meantime, I will do more investigation to see if I can root
fix this issue.

Thanks,
Zoe

On Wed, Feb 22, 2023 at 12:24 PM Enrico Minack <i...@enrico.minack.dev>
wrote:

> So you union two tables, union the result with another one, and finally
> with a last one?
>
> How many columns do all these tables have?
>
> Are you sure creating the plan depends on the number of rows?
>
> Enrico
>
>
> Am 22.02.23 um 19:08 schrieb Prem Sahoo:
>
> here is the information missed
> 1. Spark 3.2.0
> 2. it is scala based
> 3. size of tables will be ~60G
> 4. explain plan for catalysts shows lots of time is being spent in
> creating the plan
> 5. number of union table is 2 , and another 2 then finally 2
>
> slowness is providing resylut as the data size & column size increases .
>
> On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack <i...@enrico.minack.dev>
> wrote:
>
>> Plus number of unioned tables would be helpful, as well as which
>> downstream operations are performed on the unioned tables.
>>
>> And what "performance issues" do you exactly measure?
>>
>> Enrico
>>
>>
>>
>> Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:
>>
>> Hi,
>>
>> Few details will help
>>
>>    1. Spark version
>>    2. Spark SQL, Scala or PySpark
>>    3. size of tables in join.
>>    4. What does explain() or the joining operation show?
>>
>>
>> HTH
>>
>>
>>    view my Linkedin profile
>> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/>
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 22 Feb 2023 at 15:42, Prem Sahoo <prem.re...@gmail.com> wrote:
>>
>>> Hello Team,
>>> We are observing Spark Union performance issues when unioning big tables
>>> with lots of rows. Do we have any option apart from the Union ?
>>>
>>
>>
>

Reply via email to