[jira] [Commented] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

Danny Guinther (Jira) Mon, 22 Nov 2021 08:20:06 -0800


    [ 
https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447512#comment-17447512
 ]


Danny Guinther commented on SPARK-37391:
----------------------------------------

[~hyukjin.kwon] , sorry, I seem to have gotten confused when identifying the 
source of the regression. I have updated the title and description to reflect 
the true source of the issue. I'm inclined to blame this change: 
https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58

 

I'm sorry, but I don't have the capacity to provide a self-contained 
reproduction of the issue. Hopefully the problem is obvious enough that you 
will be able to see what is going on from the anecdotal evidence I can provide.

The introduction of SecurityConfigurationLock.synchronized prevents a given 
JDBC Driver from establishing more than one connection at a time (or at least 
severely limits the concurrency). This is a significant bottleneck for 
applications that use a single JDBC driver to establish many database 
connections.

The anecdotal evidence I can offer to support this claim:

1. I've attached a screenshot of some dashboards we use to monitor the QA 
deployment of the application in question. These graphs come from a 4.5 hour 
window where I had spark 3.1.2 deployed to QA. On the left side of the graph we 
were running Spark 2.4.5; in the middle we were running spark 3.1.2; and on the 
right side of the graph we are running spark 3.0.1.
 # The "Success Rate", "CountActiveTasks", "CountActiveJobs", 
"CountTableTenantJobStart", "CountTableTenantJobEnd" graphs all aim to 
demonstrate that with the deployment of spark 3.1.2 the throughput of the 
application was significantly reduced across the board.
 # The "Overall Active Thread Count", "Count Active Executors", and 
"CountDeadExecutors" graphs all aim to evidence that there was no change in the 
number of resources allocated to do work.
 # The "Max MinsSinceLastAttempt" graph should normally be a flat line unless 
the application is falling behind on the work that it is scheduled to do. It 
can be seen during the period of the spark 3.1.2 deployment the application is 
falling behind at a linear rate and begins to recover once spark 3.0.1 is 
deployed.

!spark-regression-dashes.jpg!

 

2. I've attached a screenshot of the thread dump from the spark driver process. 
It can be seen that many, many threads are blocked waiting for 
SecurityConfigurationLock. The screenshot only shows a handful of threads but 
there are 98 threads in total blocked wiating for the SecurityConfigurationLock.

!so-much-blocking.jpg!

 

It's worth noting that our QA deployment does significantly less work than our 
production deployment; if the QA deployment can't keep up then the production 
deployment has no chance. On the bright side, I had success updating the 
production deployment to spark 3.0.1 and that seems to be stable. 
Unfortunately, we use Databricks for our spark vendor and the LTS release they 
have that supports spark 3.0.1 is only scheduled to be maintained until 
September 2022, so we can't avoid this regression forever.

 

If I can answer any questions or provide any more info, please let me know. 
Thanks in advance!

 

> SIGNIFICANT bottleneck introduced by fix for SPARK-32001
> --------------------------------------------------------
>
>                 Key: SPARK-37391
>                 URL: https://issues.apache.org/jira/browse/SPARK-37391
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0
>         Environment: N/A
>            Reporter: Danny Guinther
>            Priority: Major
>         Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg
>
>
> The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( 
> [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58]
>  ) does not seem to have consider the reality that some apps may rely on 
> being able to establish many JDBC connections simultaneously for performance 
> reasons.
> The fix forces concurrency to 1 when establishing database connections and 
> that strikes me as a *significant* user impacting change and a *significant* 
> bottleneck.
> Can anyone propose a workaround for this? I have an app that makes 
> connections to thousands of databases and I can't upgrade to any version 
> >3.1.x because of this significant bottleneck.
>  
> Thanks in advance for your help!



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

[jira] [Commented] (SPARK-37391) SIGNIFICANT bottleneck introduced by fix for SPARK-32001

Reply via email to