Hi Spark devs, I'm experiencing a Union performance degradation as well. Since this email thread is very related, posting it here to see if anyone has any insights.
*Background*: After upgrading a Spark job from Spark 2.4 to Spark 3.1 without any code change, we saw *big performance degradation* (5-8 times slower). After enabling DEBUG log, we found the Spark 3.1 job takes significantly longer in fetching file metadata from S3. E.g., Spark 2.4 takes 5 minutes fetching all metadata, and Spark 3.1 needs 40 minutes. *Findings*: After closely monitoring *thread dumps* for both Spark versions, we found the reason for the slowness is that Spark 2.4 has *a pool of 8 threads* (spawn from the driver main thread) to do the metadata fetch, versus Spark 3 only has the dag-scheduler-event-loop thread to do the work. More details: Spark 2.4 is calling 2.4/core/rdd/UnionRDD.scala#L76 <https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala#L76> to get partition metadata, which triggers the thread pool here <https://github.com/apache/spark/blob/branch-2.4/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala#L62>, versus Spark 3.1 is calling 3.1/core/rdd/UnionRDD.scala#L94 <https://github.com/apache/spark/blob/branch-3.1/core/src/main/scala/org/apache/spark/rdd/UnionRDD.scala#L94>, and the thread pool is not triggered. Any insights for why this happens or how to fix the performance issue is welcome. In the meantime, I will do more investigation to see if I can root fix this issue. Thanks, Zoe On Wed, Feb 22, 2023 at 12:24 PM Enrico Minack <i...@enrico.minack.dev> wrote: > So you union two tables, union the result with another one, and finally > with a last one? > > How many columns do all these tables have? > > Are you sure creating the plan depends on the number of rows? > > Enrico > > > Am 22.02.23 um 19:08 schrieb Prem Sahoo: > > here is the information missed > 1. Spark 3.2.0 > 2. it is scala based > 3. size of tables will be ~60G > 4. explain plan for catalysts shows lots of time is being spent in > creating the plan > 5. number of union table is 2 , and another 2 then finally 2 > > slowness is providing resylut as the data size & column size increases . > > On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack <i...@enrico.minack.dev> > wrote: > >> Plus number of unioned tables would be helpful, as well as which >> downstream operations are performed on the unioned tables. >> >> And what "performance issues" do you exactly measure? >> >> Enrico >> >> >> >> Am 22.02.23 um 16:50 schrieb Mich Talebzadeh: >> >> Hi, >> >> Few details will help >> >> 1. Spark version >> 2. Spark SQL, Scala or PySpark >> 3. size of tables in join. >> 4. What does explain() or the joining operation show? >> >> >> HTH >> >> >> view my Linkedin profile >> <https://www.linkedin.com/in/mich-talebzadeh-ph-d-5205b2/> >> >> >> https://en.everybodywiki.com/Mich_Talebzadeh >> >> >> >> *Disclaimer:* Use it at your own risk. Any and all responsibility for >> any loss, damage or destruction of data or any other property which may >> arise from relying on this email's technical content is explicitly >> disclaimed. The author will in no case be liable for any monetary damages >> arising from such loss, damage or destruction. >> >> >> >> >> On Wed, 22 Feb 2023 at 15:42, Prem Sahoo <prem.re...@gmail.com> wrote: >> >>> Hello Team, >>> We are observing Spark Union performance issues when unioning big tables >>> with lots of rows. Do we have any option apart from the Union ? >>> >> >> >