[jira] [Updated] (SOLR-15093) Heavy lock contention during collection creation

Mike Drob (Jira) Wed, 20 Jan 2021 14:49:04 -0800


     [ 
https://issues.apache.org/jira/browse/SOLR-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Mike Drob updated SOLR-15093:
-----------------------------
    Description: 
I was doing some lock analysis and found that we have quite a bit of contention 
on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy collection 
creation. I ran a sample workload creating as many collections as I could in 10 
minutes, and this method was blocked for about 1:30 of that, which is a pretty 
significant portion.

A few representative stack traces:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor)
org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

And another:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
org.apache.solr.common.cloud.ZkStateReader.getCollection(String)
org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, 
boolean, boolean)
org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, 
boolean, boolean)
org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

And one more:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
 org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
 org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
 
org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, 
DocCollectionWatcher)
 org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, 
TimeUnit, Predicate)
 org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor)
 org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
 org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, 
boolean, boolean)
 org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

It looks like part of the problem is that we never allow ourselves to use the 
cache so each one happens to be a full fetch out to ZK. We have the 
optimizations there to compare the stat and the version, but it's still 
relatively heavyweight it appears.

cc: [~noble.paul], you might find this interesting. 

  was:
I was doing some lock analysis and found that we have quite a bit of contention 
on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy collection 
creation. I ran a sample workload creating as many collections as I could in 10 
minutes, and this method was blocked for about 1:30 of that, which is a pretty 
significant portion.

A few representative stack traces:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) 
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) 
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) 
org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor)
 org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

And another:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
org.apache.solr.common.cloud.ZkStateReader.getCollection(String)
org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, 
boolean, boolean)
org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, 
boolean, boolean)
org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

And one more:

{noformat}
org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
 org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
 org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
 
org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, 
DocCollectionWatcher)
 org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, 
TimeUnit, Predicate)
 org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor)
 org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
 org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, 
boolean, boolean)
 org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
{noformat}

It looks like part of the problem is that we never allow ourselves to use the 
cache so each one happens to be a full fetch out to ZK. We have the 
optimizations there to compare the stat and the version, but it's still 
relatively heavyweight it appears.

cc: [~noble.paul], you might find this interesting. 


> Heavy lock contention during collection creation
> ------------------------------------------------
>
>                 Key: SOLR-15093
>                 URL: https://issues.apache.org/jira/browse/SOLR-15093
>             Project: Solr
>          Issue Type: Task
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Mike Drob
>            Priority: Major
>
> I was doing some lock analysis and found that we have quite a bit of 
> contention on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy 
> collection creation. I ran a sample workload creating as many collections as 
> I could in 10 minutes, and this method was blocked for about 1:30 of that, 
> which is a pretty significant portion.
> A few representative stack traces:
> {noformat}
> org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
> org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor)
> org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
> {noformat}
> And another:
> {noformat}
> org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean)
> org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
> org.apache.solr.common.cloud.ZkStateReader.getCollection(String)
> org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, 
> boolean, boolean)
> org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
> org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, 
> boolean, boolean)
> org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
> {noformat}
> And one more:
> {noformat}
> org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean)
>  org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, 
> boolean)
>  org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String)
>  
> org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String,
>  DocCollectionWatcher)
>  org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, 
> TimeUnit, Predicate)
>  org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor)
>  org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean)
>  org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, 
> boolean, boolean)
>  org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean)
> {noformat}
> It looks like part of the problem is that we never allow ourselves to use the 
> cache so each one happens to be a full fetch out to ZK. We have the 
> optimizations there to compare the stat and the version, but it's still 
> relatively heavyweight it appears.
> cc: [~noble.paul], you might find this interesting. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-15093) Heavy lock contention during collection creation

Reply via email to