[ https://issues.apache.org/jira/browse/SOLR-15093?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Mike Drob updated SOLR-15093: ----------------------------- Description: I was doing some lock analysis and found that we have quite a bit of contention on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy collection creation. I ran a sample workload creating as many collections as I could in 10 minutes, and this method was blocked for about 1:30 of that, which is a pretty significant portion. A few representative stack traces: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} And another: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.common.cloud.ZkStateReader.getCollection(String) org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, boolean, boolean) org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean) org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} And one more: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, DocCollectionWatcher) org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, TimeUnit, Predicate) org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor) org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean) org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} It looks like part of the problem is that we never allow ourselves to use the cache so each one happens to be a full fetch out to ZK. We have the optimizations there to compare the stat and the version, but it's still relatively heavyweight it appears. cc: [~noble.paul], you might find this interesting. was: I was doing some lock analysis and found that we have quite a bit of contention on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy collection creation. I ran a sample workload creating as many collections as I could in 10 minutes, and this method was blocked for about 1:30 of that, which is a pretty significant portion. A few representative stack traces: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} And another: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.common.cloud.ZkStateReader.getCollection(String) org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, boolean, boolean) org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean) org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} And one more: {noformat} org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, DocCollectionWatcher) org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, TimeUnit, Predicate) org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor) org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean) org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, boolean, boolean) org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) {noformat} It looks like part of the problem is that we never allow ourselves to use the cache so each one happens to be a full fetch out to ZK. We have the optimizations there to compare the stat and the version, but it's still relatively heavyweight it appears. cc: [~noble.paul], you might find this interesting. > Heavy lock contention during collection creation > ------------------------------------------------ > > Key: SOLR-15093 > URL: https://issues.apache.org/jira/browse/SOLR-15093 > Project: Solr > Issue Type: Task > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Mike Drob > Priority: Major > > I was doing some lock analysis and found that we have quite a bit of > contention on {{ZkStateReader$LazyCollectionRef.get(boolean)}} during heavy > collection creation. I ran a sample workload creating as many collections as > I could in 10 minutes, and this method was blocked for about 1:30 of that, > which is a pretty significant portion. > A few representative stack traces: > {noformat} > org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) > org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) > org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) > org.apache.solr.cloud.ZkController.checkIfCoreNodeNameAlreadyExists(CoreDescriptor) > org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) > {noformat} > And another: > {noformat} > org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) > org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, boolean) > org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) > org.apache.solr.common.cloud.ZkStateReader.getCollection(String) > org.apache.solr.cloud.ZkController.publish(CoreDescriptor, Replica$State, > boolean, boolean) > org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean) > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, > boolean, boolean) > org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) > {noformat} > And one more: > {noformat} > org.apache.solr.common.cloud.ZkStateReader$LazyCollectionRef.get(boolean) > org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String, > boolean) > org.apache.solr.common.cloud.ClusterState.getCollectionOrNull(String) > > org.apache.solr.common.cloud.ZkStateReader.registerDocCollectionWatcher(String, > DocCollectionWatcher) > org.apache.solr.common.cloud.ZkStateReader.waitForState(String, long, > TimeUnit, Predicate) > org.apache.solr.cloud.ZkController.checkStateInZk(CoreDescriptor) > org.apache.solr.cloud.ZkController.preRegister(CoreDescriptor, boolean) > org.apache.solr.core.CoreContainer.createFromDescriptor(CoreDescriptor, > boolean, boolean) > org.apache.solr.core.CoreContainer.create(String, Path, Map, boolean) > {noformat} > It looks like part of the problem is that we never allow ourselves to use the > cache so each one happens to be a full fetch out to ZK. We have the > optimizations there to compare the stat and the version, but it's still > relatively heavyweight it appears. > cc: [~noble.paul], you might find this interesting. -- This message was sent by Atlassian Jira (v8.3.4#803005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org