[ https://issues.apache.org/jira/browse/SPARK-37391?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17447512#comment-17447512 ]
Danny Guinther commented on SPARK-37391: ---------------------------------------- [~hyukjin.kwon] , sorry, I seem to have gotten confused when identifying the source of the regression. I have updated the title and description to reflect the true source of the issue. I'm inclined to blame this change: https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58 I'm sorry, but I don't have the capacity to provide a self-contained reproduction of the issue. Hopefully the problem is obvious enough that you will be able to see what is going on from the anecdotal evidence I can provide. The introduction of SecurityConfigurationLock.synchronized prevents a given JDBC Driver from establishing more than one connection at a time (or at least severely limits the concurrency). This is a significant bottleneck for applications that use a single JDBC driver to establish many database connections. The anecdotal evidence I can offer to support this claim: 1. I've attached a screenshot of some dashboards we use to monitor the QA deployment of the application in question. These graphs come from a 4.5 hour window where I had spark 3.1.2 deployed to QA. On the left side of the graph we were running Spark 2.4.5; in the middle we were running spark 3.1.2; and on the right side of the graph we are running spark 3.0.1. # The "Success Rate", "CountActiveTasks", "CountActiveJobs", "CountTableTenantJobStart", "CountTableTenantJobEnd" graphs all aim to demonstrate that with the deployment of spark 3.1.2 the throughput of the application was significantly reduced across the board. # The "Overall Active Thread Count", "Count Active Executors", and "CountDeadExecutors" graphs all aim to evidence that there was no change in the number of resources allocated to do work. # The "Max MinsSinceLastAttempt" graph should normally be a flat line unless the application is falling behind on the work that it is scheduled to do. It can be seen during the period of the spark 3.1.2 deployment the application is falling behind at a linear rate and begins to recover once spark 3.0.1 is deployed. !spark-regression-dashes.jpg! 2. I've attached a screenshot of the thread dump from the spark driver process. It can be seen that many, many threads are blocked waiting for SecurityConfigurationLock. The screenshot only shows a handful of threads but there are 98 threads in total blocked wiating for the SecurityConfigurationLock. !so-much-blocking.jpg! It's worth noting that our QA deployment does significantly less work than our production deployment; if the QA deployment can't keep up then the production deployment has no chance. On the bright side, I had success updating the production deployment to spark 3.0.1 and that seems to be stable. Unfortunately, we use Databricks for our spark vendor and the LTS release they have that supports spark 3.0.1 is only scheduled to be maintained until September 2022, so we can't avoid this regression forever. If I can answer any questions or provide any more info, please let me know. Thanks in advance! > SIGNIFICANT bottleneck introduced by fix for SPARK-32001 > -------------------------------------------------------- > > Key: SPARK-37391 > URL: https://issues.apache.org/jira/browse/SPARK-37391 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 3.1.0, 3.1.1, 3.1.2, 3.2.0 > Environment: N/A > Reporter: Danny Guinther > Priority: Major > Attachments: so-much-blocking.jpg, spark-regression-dashes.jpg > > > The fix for https://issues.apache.org/jira/browse/SPARK-32001 ( > [https://github.com/apache/spark/pull/29024/files#diff-345beef18081272d77d91eeca2d9b5534ff6e642245352f40f4e9c9b8922b085R58] > ) does not seem to have consider the reality that some apps may rely on > being able to establish many JDBC connections simultaneously for performance > reasons. > The fix forces concurrency to 1 when establishing database connections and > that strikes me as a *significant* user impacting change and a *significant* > bottleneck. > Can anyone propose a workaround for this? I have an app that makes > connections to thousands of databases and I can't upgrade to any version > >3.1.x because of this significant bottleneck. > > Thanks in advance for your help! -- This message was sent by Atlassian Jira (v8.20.1#820001) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org