Re: Spark Union performance issue

2023-02-22 Thread Prem Sahoo
Please see inline comments
So you union two tables, union the result with another one, and finally
with a last one?
  first Union 2 tables  = Result1
2nd Union of another 2 tables = Result2

3rd Result1UnionResult2 = finalResult

How many columns do all these tables have?

each is having around 700 columns

Are you sure creating the plan depends on the number of rows?
as the explain plan keeps on increasing along with metadata ...

On Wed, Feb 22, 2023 at 3:23 PM Enrico Minack 
wrote:

> So you union two tables, union the result with another one, and finally
> with a last one?
>
> How many columns do all these tables have?
>
> Are you sure creating the plan depends on the number of rows?
>
> Enrico
>
>
> Am 22.02.23 um 19:08 schrieb Prem Sahoo:
>
> here is the information missed
> 1. Spark 3.2.0
> 2. it is scala based
> 3. size of tables will be ~60G
> 4. explain plan for catalysts shows lots of time is being spent in
> creating the plan
> 5. number of union table is 2 , and another 2 then finally 2
>
> slowness is providing resylut as the data size & column size increases .
>
> On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack 
> wrote:
>
>> Plus number of unioned tables would be helpful, as well as which
>> downstream operations are performed on the unioned tables.
>>
>> And what "performance issues" do you exactly measure?
>>
>> Enrico
>>
>>
>>
>> Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:
>>
>> Hi,
>>
>> Few details will help
>>
>>1. Spark version
>>2. Spark SQL, Scala or PySpark
>>3. size of tables in join.
>>4. What does explain() or the joining operation show?
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:
>>
>>> Hello Team,
>>> We are observing Spark Union performance issues when unioning big tables
>>> with lots of rows. Do we have any option apart from the Union ?
>>>
>>
>>
>


Re: Spark Union performance issue

2023-02-22 Thread Zhiyuan Lin
Hi Spark devs,

I'm experiencing a Union performance degradation as well. Since this email
thread is very related, posting it here to see if anyone has any insights.

*Background*:
After upgrading a Spark job from Spark 2.4 to Spark 3.1 without any code
change, we saw *big performance degradation* (5-8 times slower). After
enabling DEBUG log, we found the Spark 3.1 job takes significantly longer
in fetching file metadata from S3. E.g., Spark 2.4 takes 5 minutes fetching
all metadata, and Spark 3.1 needs 40 minutes.
*Findings*:
After closely monitoring *thread dumps* for both Spark versions, we found
the reason for the slowness is that Spark 2.4 has *a pool of 8 threads*
(spawn from the driver main thread) to do the metadata fetch, versus Spark
3 only has the dag-scheduler-event-loop thread to do the work.
More details: Spark 2.4 is calling 2.4/core/rdd/UnionRDD.scala#L76

to
get partition metadata, which triggers the thread pool here
,
versus Spark 3.1 is calling 3.1/core/rdd/UnionRDD.scala#L94
,
and the thread pool is not triggered.

Any insights for why this happens or how to fix the performance issue is
welcome. In the meantime, I will do more investigation to see if I can root
fix this issue.

Thanks,
Zoe

On Wed, Feb 22, 2023 at 12:24 PM Enrico Minack 
wrote:

> So you union two tables, union the result with another one, and finally
> with a last one?
>
> How many columns do all these tables have?
>
> Are you sure creating the plan depends on the number of rows?
>
> Enrico
>
>
> Am 22.02.23 um 19:08 schrieb Prem Sahoo:
>
> here is the information missed
> 1. Spark 3.2.0
> 2. it is scala based
> 3. size of tables will be ~60G
> 4. explain plan for catalysts shows lots of time is being spent in
> creating the plan
> 5. number of union table is 2 , and another 2 then finally 2
>
> slowness is providing resylut as the data size & column size increases .
>
> On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack 
> wrote:
>
>> Plus number of unioned tables would be helpful, as well as which
>> downstream operations are performed on the unioned tables.
>>
>> And what "performance issues" do you exactly measure?
>>
>> Enrico
>>
>>
>>
>> Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:
>>
>> Hi,
>>
>> Few details will help
>>
>>1. Spark version
>>2. Spark SQL, Scala or PySpark
>>3. size of tables in join.
>>4. What does explain() or the joining operation show?
>>
>>
>> HTH
>>
>>
>>view my Linkedin profile
>> 
>>
>>
>>  https://en.everybodywiki.com/Mich_Talebzadeh
>>
>>
>>
>> *Disclaimer:* Use it at your own risk. Any and all responsibility for
>> any loss, damage or destruction of data or any other property which may
>> arise from relying on this email's technical content is explicitly
>> disclaimed. The author will in no case be liable for any monetary damages
>> arising from such loss, damage or destruction.
>>
>>
>>
>>
>> On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:
>>
>>> Hello Team,
>>> We are observing Spark Union performance issues when unioning big tables
>>> with lots of rows. Do we have any option apart from the Union ?
>>>
>>
>>
>


Re: Spark Union performance issue

2023-02-22 Thread Enrico Minack
So you union two tables, union the result with another one, and finally 
with a last one?


How many columns do all these tables have?

Are you sure creating the plan depends on the number of rows?

Enrico


Am 22.02.23 um 19:08 schrieb Prem Sahoo:

here is the information missed
1. Spark 3.2.0
2. it is scala based
3. size of tables will be ~60G
4. explain plan for catalysts shows lots of time is being spent in 
creating the plan

5. number of union table is 2 , and another 2 then finally 2

slowness is providing resylut as the data size & column size increases .

On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack 
 wrote:


Plus number of unioned tables would be helpful, as well as which
downstream operations are performed on the unioned tables.

And what "performance issues" do you exactly measure?

Enrico



Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:

Hi,

Few details will help

 1. Spark version
 2. Spark SQL, Scala or PySpark
 3. size of tables in join.
 4. What does explain() or the joining operation show?


HTH


**view my Linkedin profile



https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk.Any and all responsibility
for any loss, damage or destruction of data or any other property
which may arise from relying on this email's technical content is
explicitly disclaimed. The author will in no case be liable for
any monetary damages arising from such loss, damage or destruction.



On Wed, 22 Feb 2023 at 15:42, Prem Sahoo 
wrote:

Hello Team,
We are observing Spark Union performance issues when unioning
big tables with lots of rows. Do we have any option apart
from the Union ?





Re: Spark Union performance issue

2023-02-22 Thread Prem Sahoo
here is the information missed
1. Spark 3.2.0
2. it is scala based
3. size of tables will be ~60G
4. explain plan for catalysts shows lots of time is being spent in creating
the plan
5. number of union table is 2 , and another 2 then finally 2

slowness is providing resylut as the data size & column size increases .

On Wed, Feb 22, 2023 at 11:07 AM Enrico Minack 
wrote:

> Plus number of unioned tables would be helpful, as well as which
> downstream operations are performed on the unioned tables.
>
> And what "performance issues" do you exactly measure?
>
> Enrico
>
>
>
> Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:
>
> Hi,
>
> Few details will help
>
>1. Spark version
>2. Spark SQL, Scala or PySpark
>3. size of tables in join.
>4. What does explain() or the joining operation show?
>
>
> HTH
>
>
>view my Linkedin profile
> 
>
>
>  https://en.everybodywiki.com/Mich_Talebzadeh
>
>
>
> *Disclaimer:* Use it at your own risk. Any and all responsibility for any
> loss, damage or destruction of data or any other property which may arise
> from relying on this email's technical content is explicitly disclaimed.
> The author will in no case be liable for any monetary damages arising from
> such loss, damage or destruction.
>
>
>
>
> On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:
>
>> Hello Team,
>> We are observing Spark Union performance issues when unioning big tables
>> with lots of rows. Do we have any option apart from the Union ?
>>
>
>


Re: Spark Union performance issue

2023-02-22 Thread Enrico Minack
Plus number of unioned tables would be helpful, as well as which 
downstream operations are performed on the unioned tables.


And what "performance issues" do you exactly measure?

Enrico



Am 22.02.23 um 16:50 schrieb Mich Talebzadeh:

Hi,

Few details will help

 1. Spark version
 2. Spark SQL, Scala or PySpark
 3. size of tables in join.
 4. What does explain() or the joining operation show?


HTH


**view my Linkedin profile 




https://en.everybodywiki.com/Mich_Talebzadeh

*Disclaimer:* Use it at your own risk.Any and all responsibility for 
any loss, damage or destruction of data or any other property which 
may arise from relying on this email's technical content is explicitly 
disclaimed. The author will in no case be liable for any monetary 
damages arising from such loss, damage or destruction.




On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:

Hello Team,
We are observing Spark Union performance issues when unioning big
tables with lots of rows. Do we have any option apart from the Union ?



Re: Spark Union performance issue

2023-02-22 Thread Mich Talebzadeh
Hi,

Few details will help

   1. Spark version
   2. Spark SQL, Scala or PySpark
   3. size of tables in join.
   4. What does explain() or the joining operation show?


HTH


   view my Linkedin profile



 https://en.everybodywiki.com/Mich_Talebzadeh



*Disclaimer:* Use it at your own risk. Any and all responsibility for any
loss, damage or destruction of data or any other property which may arise
from relying on this email's technical content is explicitly disclaimed.
The author will in no case be liable for any monetary damages arising from
such loss, damage or destruction.




On Wed, 22 Feb 2023 at 15:42, Prem Sahoo  wrote:

> Hello Team,
> We are observing Spark Union performance issues when unioning big tables
> with lots of rows. Do we have any option apart from the Union ?
>


Spark Union performance issue

2023-02-22 Thread Prem Sahoo
Hello Team,
We are observing Spark Union performance issues when unioning big tables
with lots of rows. Do we have any option apart from the Union ?