Erick,

Serializing isn't really an option here, we want to test that execution
order is *not* submission order...

I believe if we wanted to verify the property that this test seems to
assert: "when there are more tasks than the number of executor threads and
all are blocked on a single lock, then a task enqueued afterwards and
requiring another available lock will be run right away", then maybe a unit
test of OverseerTaskProcessor/OverseerCollectionConfigSetProcessor would be
appropriate. But doing so would require refactoring: the run() method of
OverseerTaskProcessor is 200 lines.

I'll likely go with an "overkill" option. It will still not be free from
timing issues, but hopefully less vulnerable. And I'll try to identify when
test preconditions do not hold, in order to fail with an informative
message such as "slow task finished execution too soon, cannot test running
tasks in parallel".

Will create a new Jira.

Thanks for your week-end feedback,
Ilan

On Sat, May 30, 2020 at 11:26 PM Erick Erickson <erickerick...@gmail.com>
wrote:

> Ilan:
>
> Please raise a new JIRA and attach any fixes there, a Git PR or patch
> whichever you prefer. Your analysis will get lost in the noise for
> SOLR-12801. And thanks for digging! We usually title them something like
> “harden MultiThreadedOCPTest” or “fix failing MultiThreadedOCPTest” or
> similar.
>
> I haven’t really looked at the code you’re analyzing, it’s the weekend
> after all ;). So ignore this question if it makes no sense. Anyway, I’m
> always suspicious of using delays for just this kind of reason; depending
> on the hardware something may fail sometimes. Is there any way to make the
> sequencing more of a positive lock? Unfortunately that can be tricky, I’ve
> wound up doing things like serializing all tasks which…er…isn’t a good fix
> ;). But it sounds like your “overkill” section is along those lines?
>
> Up to you.
>
> Meanwhile, I’ve started beasting that test on my spare machine. If you’re
> curious about what that is:
> https://gist.github.com/markrmiller/dbdb792216dc98b018ad
>
> But don’t worry about that part. The point is I can run the test multiple
> times, with N running in parallel, hopefully making my machine exhibit the
> problem sometime over the night. If I can get it to fail before but not
> after your fix, it provides some reassurance that your fix is working. Not
> totally certain of course. Otherwise, we’ll just commit your fixes and see
> if Hoss’ rollups stop showing it.
>
> Thanks again!
> Erick
>
>
>
> > On May 30, 2020, at 1:42 PM, Ilan Ginzburg <ilans...@gmail.com> wrote:
> >
> > MultiThreadedOCPTest
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

Reply via email to