[ https://issues.apache.org/jira/browse/SOLR-13445?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Hoss Man reopened SOLR-13445: ----------------------------- jenkins has found at least 2 problems with the new RoutingToNodesWithPropertiesTest class... [https://jenkins.thetaphi.de/view/Lucene-Solr/job/Lucene-Solr-8.x-Linux/536/] ---- First: a reproducing failing seed (on branch_8x)... {noformat} [junit4] 2> NOTE: reproduce with: ant test -Dtestcase=RoutingToNodesWithPropertiesTest -Dtests.method=test -Dtests.seed=13525A4073A0EB3F -Dtests.multiplier=3 -Dtests.slow=true -Dtests.locale=zh-HK -Dtests.timezone=Brazil/Acre -Dtests.asserts=true -Dtests.file.encoding=UTF-8 [junit4] FAILURE 0.45s J1 | RoutingToNodesWithPropertiesTest.test <<< [junit4] > Throwable #1: java.lang.AssertionError: Hitting same zone after 10 queries [junit4] > at __randomizedtesting.SeedInfo.seed([13525A4073A0EB3F:9B06659ADD5C86C7]:0) [junit4] > at org.apache.solr.cloud.RoutingToNodesWithPropertiesTest.test(RoutingToNodesWithPropertiesTest.java:251) [junit4] > at java.lang.Thread.run(Thread.java:748) {noformat} At a glance, the problem seems to be that the test assumes if it tries a query 10 times, at least one of those queries is will hit 2 nodes in different "zones" – but there's no guarantee of that, it's pure dumb luck – it's like having a test that calls {{random().nextInt(2)}} in a loop 10 times and asserts that it got a value of "0" at least iteration ... it's statistically going to fail some fixed percentage of time. ---- Second: when jenkins tries to reproduce the seed, it runs with {{-Dtests.dups=5}} but this causes an initialization failure in the BeforeClass method ... i'm not certain, but at a glance I'm guessing this is because of static variables that aren't being cleaned up in the AfterClass method? {noformat} [junit4] ERROR 0.00s J2 | RoutingToNodesWithPropertiesTest (suite) <<< [junit4] > Throwable #1: java.lang.AssertionError: expected:<us-west1> but was:<null> [junit4] > at __randomizedtesting.SeedInfo.seed([13525A4073A0EB3F]:0) [junit4] > at org.apache.solr.cloud.RoutingToNodesWithPropertiesTest.setupCluster(RoutingToNodesWithPropertiesTest.java:115) [junit4] > at java.lang.Thread.run(Thread.java:748) {noformat} > Preferred replicas on nodes with same system properties as the query master > --------------------------------------------------------------------------- > > Key: SOLR-13445 > URL: https://issues.apache.org/jira/browse/SOLR-13445 > Project: Solr > Issue Type: Improvement > Security Level: Public(Default Security Level. Issues are Public) > Reporter: Cao Manh Dat > Assignee: Cao Manh Dat > Priority: Major > Fix For: master (9.0), 8.2 > > Attachments: SOLR-13445.patch, SOLR-13445.patch, SOLR-13445.patch > > > Currently, Solr chooses a random replica for each shard to fan out the query > request. However, this presents a problem when running Solr in multiple > availability zones. > If one availability zone fails then it affects all Solr nodes because they > will try to connect to Solr nodes in the failed availability zone until the > request times out. This can lead to a build up of threads on each Solr node > until the node goes out of memory. This results in a cascading failure. > This issue try to solve this problem by adding > * another shardPreference param named {{node.sysprop}}, so the query will be > routed to nodes with same defined system properties as the current one. > * default shardPreferences on the whole cluster, which will be stored in > {{/clusterprops.json}}. > * a cacher for fetching other nodes system properties whenever /live_nodes > get changed. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org