I have fixed the issue. A PR is out
https://github.com/apache/lucene-solr/pull/2410/files.
Most of the work was documenting what stats are actually returned. Now
OverseerStatusCmd has more comment lines than code lines.

Will merge it shortly.

Ilan



On Sun, Feb 21, 2021 at 6:05 PM Ilan Ginzburg <ilans...@gmail.com> wrote:

> Searching in my jenkins folder for failures of this test (label:jenkins
> "FAILED:  org.apache.solr.cloud.OverseerStatusTest.test") 26 emails match.
> Searching for all jenkins master builds emails since the first failure
> email found above (2 days ago), I see 40 messages.
> 26 over 40 is not far from the expected 50% failure rate.
> I believe the ratio in the graph you sent David (currently at 5.7%) is
> averaged over a week, and includes failures from all branches (did some
> other stats on jenkins emails that tend to confirm this assumption).
>
> On Sun, Feb 21, 2021 at 10:53 AM Ilan Ginzburg <ilans...@gmail.com> wrote:
>
>> Yes Marcus this is the commit.
>>
>> David I would have expected 50% failures, as 50% of the runs use
>> distributed updates. I’ll try to understand better as I fix the issue.
>>
>> Ilan
>>
>> On Sun 21 Feb 2021 at 06:17, David Smiley <dsmi...@apache.org> wrote:
>>
>>> Interesting.  Do you have a guess as to why the failures there are ~5%
>>> and not 100% reproducible?
>>>
>>> ~ David Smiley
>>> Apache Lucene/Solr Search Developer
>>> http://www.linkedin.com/in/davidwsmiley
>>>
>>>
>>> On Sat, Feb 20, 2021 at 6:41 PM Ilan Ginzburg <ilans...@gmail.com>
>>> wrote:
>>>
>>>> Indeed the issue is due to my changes.
>>>>
>>>> In OverseerStatusCmd I've skipped some stat collection when running in
>>>> distributed cluster state updates mode because I thought these were only
>>>> stats related to cluster state updates.
>>>> Obviously that was too aggressive and some of the stats are related to
>>>> the Collection API.
>>>>
>>>> I will make sure to skip returning only the stats that are related to
>>>> cluster state updater and restore returning collection api stats (when
>>>> running in distributed cluster updates mode, otherwise all stats are
>>>> returned).
>>>>
>>>> Tomorrow...
>>>>
>>>> Ilan
>>>>
>>>> On Sun, Feb 21, 2021 at 12:22 AM Ilan Ginzburg <ilans...@gmail.com>
>>>> wrote:
>>>>
>>>>> Thank you David for reporting this.
>>>>>
>>>>> Seems due to my recent changes. I reproduce the failure locally and
>>>>> will look at this tomorrow.
>>>>>
>>>>> With the distributed cluster state updates i've introduced a
>>>>> randomization for using either Overseer based cluster state updates or
>>>>> distributed cluster state updates in tests. This failure seems to happen 
>>>>> in
>>>>> the distributed state update case. I suspect it is due to Overseer
>>>>> returning less stats than expected by the test (which is expected: 
>>>>> Overseer
>>>>> cannot return stats about cluster state updates if it does not handle
>>>>> cluster state updates).
>>>>>
>>>>> The following line in the logs tells that the run is using distributed
>>>>> cluster state:
>>>>> 972874 INFO  (jetty-launcher-8973-thread-2) [     ]
>>>>> o.a.s.c.DistributedClusterStateUpdater Creating
>>>>> DistributedClusterStateUpdater with useDistributedStateUpdate=true. Solr
>>>>> will be using distributed cluster state updates.
>>>>>
>>>>> Ilan
>>>>>
>>>>>
>>>>> On Sat, Feb 20, 2021 at 3:00 PM David Smiley <dsmi...@apache.org>
>>>>> wrote:
>>>>>
>>>>>> I encountered a failure from OverseerStatusTest locally.  According
>>>>>> to our test failure trends, this guy only just recently started failing
>>>>>> ~4-5% of the time, but previously was fine.  Only master branch.
>>>>>>
>>>>>>
>>>>>> http://fucit.org/solr-jenkins-reports/history-trend-of-recent-failures.html#series/org.apache.solr.cloud.OverseerStatusTest.test
>>>>>>
>>>>>> ~ David Smiley
>>>>>> Apache Lucene/Solr Search Developer
>>>>>> http://www.linkedin.com/in/davidwsmiley
>>>>>>
>>>>>

Reply via email to