[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16838635#comment-16838635 ] Tomás Fernández Löbbe commented on SOLR-13445: -- Nice! Thanks for adding this Dat! > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837327#comment-16837327 ] ASF subversion and git services commented on SOLR-13445: Commit 3e300e1a94d74dcdea0496ca9c908deb7313db0e in lucene-solr's branch refs/heads/branch_8x from Cao Manh Dat [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=3e300e1 ] SOLR-13445: Hardness the test > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837326#comment-16837326 ] ASF subversion and git services commented on SOLR-13445: Commit 6a06bcd47007870d4148d1f131bbdbd8f0924a31 in lucene-solr's branch refs/heads/master from Cao Manh Dat [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6a06bcd ] SOLR-13445: Hardness the test > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16837323#comment-16837323 ] Cao Manh Dat commented on SOLR-13445: - Hi [~hossman], thanks a lot for your comment. It seems that with -Dtests.dups, static instance are reused. That makes the code query to gone nodes of previous tests. > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835796#comment-16835796 ] ASF subversion and git services commented on SOLR-13445: Commit 2ec14f0323e0ed28d36ac984f79c00890b26271d in lucene-solr's branch refs/heads/branch_8x from Cao Manh Dat [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2ec14f0 ] SOLR-13445: Fix precommit > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835795#comment-16835795 ] ASF subversion and git services commented on SOLR-13445: Commit 81cfbcd0096b85d98c38dec038e2934bfaa271ca in lucene-solr's branch refs/heads/master from Cao Manh Dat [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=81cfbcd ] SOLR-13445: Fix precommit > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835756#comment-16835756 ] ASF subversion and git services commented on SOLR-13445: Commit 8a1b966165339f017e0f1afb736b0afb939a0510 in lucene-solr's branch refs/heads/branch_8x from Cao Manh Dat [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=8a1b966 ] SOLR-13445: Preferred replicas on nodes with same system properties as the query master > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835754#comment-16835754 ] ASF subversion and git services commented on SOLR-13445: Commit 6b5b74bc9c9576913a5124eec138938e09037dad in lucene-solr's branch refs/heads/master from Cao Manh Dat [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=6b5b74b ] SOLR-13445: Preferred replicas on nodes with same system properties as the query master > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16835486#comment-16835486 ] Cao Manh Dat commented on SOLR-13445: - Adding minor change to the patch because of TestHttpShardHandlerFactory failure. > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16834405#comment-16834405 ] Shalin Shekhar Mangar commented on SOLR-13445: -- Thanks Dat. A few comments: # Minor nit: Rename HttpShardHandlerFactory#sameMetric to hasSameMetric # Can you do an exponential backoff in NodesSysPropsCacher#fetchRemoteProps? # The RoutingToNodesWithPropertiesTest needs a better check than comparing shardAddress. The reason is that shardAddress is set by the GET_TOP_IDS phase but not by other phases such as GET_FIELDS. Use TrackingShardHandlerFactory instead. # I agree with Andrzej that the fix to SolrClientNodeStateProvider should go to a separate issue so that it can be backported to 7_7 if needed. > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832610#comment-16832610 ] Andrzej Bialecki commented on SOLR-13445: -- The bug in {{SolrClientNodeStateProvider}} should be fixed in other active branches too, regardless of this improvement. > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (SOLR-13445) Preferred replicas on nodes with same system properties as the query master
[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16832436#comment-16832436 ] Cao Manh Dat commented on SOLR-13445: - I had several private conversations with [~shalinmangar] about how to deal with this issue, and he helped a lot. Thanks [~shalinmangar]. The attached patch beside implementing mentioned features in the description, also solving an issue in {{SolrClientNodeStateProvider}} since we always retrying query metrics from other nodes even it just successfully doing that. > Preferred replicas on nodes with same system properties as the query master > --- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) >Reporter: Cao Manh Dat >Assignee: Cao Manh Dat >Priority: Major > Attachments: SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org