[ 
https://issues.apache.org/jira/browse/SOLR-17088?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17842810#comment-17842810
 ] 

Houston Putman edited comment on SOLR-17088 at 5/2/24 12:50 AM:
----------------------------------------------------------------

Wow this one was a rough one to find!

Luckily it's only breaking on main, and the git history gave us some help since 
very few commits aren't backported to 9x.

It turned out to be a very innocuous change to remove the solr.xml capabilities 
from ZK:  SOLR-16975

Jan did a great job, however there is one option to start nodes after 
configuring a cluster that does not take a solr.xml, and it will use the 
default solr.xml instead. However for clusters that were setup with a solr.xml 
initially, these new nodes will have different solr.xmls from the original 
nodes. (this did not use to be a problem since the solr.xml file was in ZK)

The reason why we were seeing these errors is that TestPrepRecovery does not 
use the default cloud solr.xml, which has options for setting the 
distributedClusterStateUpdate vs using the overseer. These flags are randomized 
when setting up the cluster. Since TestPrepRecovery was using its own solr.xml 
that didn't support these flags, it was using the overseer. However, when doing 
a {{cluster.startJettySolrRunner()}} , the new jetty instances used the default 
cloud solr.xml (as stated above) which does support the flags. So these 
instances thought they were supposed to be using the distributed state 
processing, hence cluster state issues.

To make matters even more complicated, this error cannot be seen when just 
running the failing test "testLeaderNotResponding", because this test does not 
add any new jetty runners. The test above it "testLeaderUnloaded" does, so the 
entire test class needs to be run to see any errors. Amazingly the test that 
causes the problems, "testLeaderUnloaded", does not fail because the collection 
creation happens before the new jetty runners are started.

Overall easy fix, save the solr.xml after creating a cluster and use it for all 
new jetty runners.


was (Author: houston):
Wow this one was a rough one to find!

Luckily it's only breaking on main, and the git history gave us some help since 
very few commits aren't backported to 9x.

It turned out to be a very innocuous change to remove the solr.xml capabilities 
from ZK:  SOLR-16975

Jan did a great job, however there is one option to start nodes after 
configuring a cluster that does not take a solr.xml, and it will use the 
default solr.xml instead.However for clusters that 

> TestPrepRecovery.testLeaderNotResponding fails much more lately
> ---------------------------------------------------------------
>
>                 Key: SOLR-17088
>                 URL: https://issues.apache.org/jira/browse/SOLR-17088
>             Project: Solr
>          Issue Type: Test
>            Reporter: David Smiley
>            Priority: Minor
>         Attachments: 2023-11-27 fail.log.txt
>
>
> I'll attach logs.  I didn't try and root cause.  [Increased in test frequency 
> lately|http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.TestPrepRecovery.testLeaderNotResponding].
>   All recent failures happen on main, not 9x.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@solr.apache.org
For additional commands, e-mail: issues-h...@solr.apache.org

Reply via email to