Hello Shad, Re my previous e-mail: I stand corrected. With the ValueHashSet correction, Brzozowski passes. For some reason, the Lucene.net compilation didn't occur.
Do I get a cookie now? Or (even better) let's pretend we're on 4.9 <g> Vincent -----Original Message----- From: Shad Storhaug [mailto:[email protected]] Sent: Friday, March 24, 2017 2:04 PM To: Van Den Berghe, Vincent <[email protected]> Cc: [email protected] Subject: RE: TestSearcherManager_Mem Vincent, That's great news! Looks like it is passing always now. But, for some reason in .NET core it is taking MUCH longer to finish. I am getting about 6 seconds in .NET Framework, and over 1 minute in .NET Core. Maybe using a stopwatch is the right solution to stabilize this behavior? To run the tests on .NET Core, open the Lucene.Net.Portable.sln. You may need to run "dotnet restore" from the command line at the root of the repository in order to get it to compile (sometimes you have to close and reopen Visual Studio to get the command to take). I have taken a stab at that IndicNormalizer. It was failing when trying to get a character out of the BitArray that it previously put in there. But it was designed to work with a Java BitSet, not a .NET BitArray. Perhaps there is some difference in the way it works that is causing this (like null character not being stored, or something silly like that). It's a shot in the dark, but since I cannot get the test to fail under controlled conditions, I have replaced it with the OpenBitSet (which is basically a Java BitSet with access to its underlying storage). At the very least, it won't hurt. I'll also take a closer look at the random "file not closed" failures coming from TestSearcherManager_Mem(). I think you fixed the underlying cause for the main failure. But this is a sign that there is an unexpected exception being thrown that triggers Dispose() too early. Perhaps there is still a broken Codec that is causing this failure, which would explain its randomness. Is there any hope for Brzozowski? I'll make a compromise - if you can solve that one, I will pretend we are on version 4.9 for TestEarlyTerminationDifferentSorter() so we can put it to bed - it's probably not worth the effort anyway (I have already spent several days chasing after that one). Thanks, Shad Storhaug (NightOwl888) -----Original Message----- From: Van Den Berghe, Vincent [mailto:[email protected]] Sent: Friday, March 24, 2017 4:40 PM To: Shad Storhaug Cc: [email protected] Subject: RE: TestSearcherManager_Mem Hello Shad, I have a theory about TestCRTReopen: if you look at the java code (https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.8.0/lucene/core/src/java/org/apache/lucene/search/ControlledRealTimeReopenThread.java), you see there's a relation between the reentrant lock and its condition variable: ReentrantLock reopenLock = new ReentrantLock(); Condition reopenCond = reopenLock.newCondition(); Maybe there's some subtlety in there that we miss. The lock is used only as a guard around the reopen condition, which maybe is how they rule in the Java Shire, but no such concepts exist as such in C#. The closest thing to a "real" ReentrantLock implementation I have ever seen in .NET (complete with condition variables, fair locking, and so on) is https://github.com/spring-projects/spring-net-threading/blob/master/src/Spring/Spring.Threading/Threading/Locks/ReentrantLock.cs But that's a gorilla. All we really want is a banana, without the gorilla attached to it. So that got me thinking: we know what ControlledRealTImeReopenThread does. Why don't we implement it in "pure" C# instead of trying to translate it from Java using synchronization primitives that are almost but not quite totally unlike those in .NET? You can find the result in the file attached. I restrained myself and didn't replace Environment.TickCount with Stopwatch (which would be more correct and guard against overflows that occur in TickCount every 24.9 days). In a fit of altruism and insight, I even let all the related unit tests run, and they all pass. And finish in time! But that's in my alternate universe, of course <g>. Vincent -----Original Message----- From: Shad Storhaug [mailto:[email protected]] Sent: Friday, March 24, 2017 3:49 AM To: Van Den Berghe, Vincent <[email protected]> Cc: [email protected] Subject: RE: TestSearcherManager_Mem Vincent, FYI - TestSearcherManager_Mem() succeeds much more frequently, but still randomly fails. Also, although I was able to make the error message change for TestCRTReopen(), I didn't manage to get it working. The problem is pretty obvious - the WaitForGeneration() method (https://github.com/apache/lucenenet/blob/api-work/src/Lucene.Net.Tests/Search/TestControlledRealTimeReopenThread.cs#L680) is WAY too slow. Even if I increase the wait period from 20 to 60 seconds it still doesn't finish in time. I played with a few of the variables in ControlledRealTimeReopenThread, but couldn't get the behavior to change. I verified that PulseAll() gets called, but it doesn't seem like it is having any effect on the Wait(). For TestEarlyTerminationDifferentSorter(), I reviewed all of the code under test in the Misc project line by line, but there was nothing significant to fix. So, still broken (sometimes). Thanks, Shad Storhaug (NightOwl888) -----Original Message----- From: Shad Storhaug [mailto:[email protected]] Sent: Thursday, March 23, 2017 8:25 PM To: Van Den Berghe, Vincent; [email protected] Cc: [email protected] Subject: RE: TestSearcherManager_Mem Parallel universe or not, I think you are making progress. I found a similar IncrementAndGet issue in the ThreadedIndexingAndSearchingTestCase that I have already corrected. However, it only mattered in one case, in all other cases the result of IncrementAndGet was not being utilized. It is like someone intentionally wanted to make it impossible to get all of the bugs out of this code... Anyway, stupid is as stupid does...I went through and scanned the entire codebase for IncrementAndGet and compared it against Lucene. Sure enough, the Core.Util.RefCount class was refactored from its original. I changed it back to the original code (backed by an AtomicInteger/AtomicInt32), and the TestCRTReopen() test no longer fails almost immediately - after a couple of minutes it now fails with the message "waited too long for commit generation". I don't know if I fixed it or broke it more, but it is definitely behaving differently now. Now, let me see if I can bring your other changes into my universe...perhaps the new failure has something to do with the reset event. Thanks, Shad Storhaug (NightOwl888) From: Van Den Berghe, Vincent [mailto:[email protected]] Sent: Thursday, March 23, 2017 7:04 PM To: [email protected] Cc: Shad Storhaug Subject: TestSearcherManager_Mem Even though I seem to live in a parallel universe where 42 isn't 42 and 4.8 isn't 4.8, I'll have a stab at resolving TestSearcherManager_Mem. First, there is a method in TrackingIndexWriter: public virtual long GetAndIncrementGeneration() { return indexingGen.IncrementAndGet(); } The implementation calls the wrong indexGen method: it should call GetAndIncrement(), which doesn't exist in the .NET version. You can add the method to the AtomicLong class. Too bad there's no Interlocked.PostIncrement, but it's easy enough: public long GetAndIncrement() { return Interlocked.Increment(ref value) - 1; } And adjust the call accordingly: public virtual long GetAndIncrementGeneration() { return indexingGen.GetAndIncrement(); } Next we turn our attention to ControlledRealTimeReopenThread<T>. There's an event defined as follows: private ManualResetEvent reopenCond = new ManualResetEvent(false); This is not correct, since the remainder of the implementation only Sets and Waits, but never resets. Once the event Set, the wait will never ... uh... wait the second time around. Change this as follows: private AutoResetEvent reopenCond = new AutoResetEvent(false); Next, for some reason, time is counted in nanoseconds, but since Environment.TickCount is in milliseconds, we need to convert it by multiplying by 1000000. Unfortunately, this is done by multiplication: Environment.TickCount * 1000000 Since Environment.TickCount is an int and 1000000 is an int, the result is negative unless you just rebooted your computer in a Tardis doing a polka backwards. Define: private const long MS_IN_NS = 1000000; ... and change all other references to 1000000 except one (see below) with MS_IN_NS: this should solve the overflow problem using C#'s promotion rules. Next, notice that 64-bit integers are sometimes read outside locks: searchingGen = refreshStartGen; if (targetGen > searchingGen) while (targetGen > searchingGen) This isn't guaranteed to be atomic, and I'm a curmudgeon when it comes to parallelism and atomicity. Change all these lines by: Interlocked.Exchange(ref searchingGen, refreshStartGen); if (targetGen > Interlocked.Read(ref searchingGen)) while (targetGen > Interlocked.Read(ref searchingGen)) In my own spacetime continuum, the test now passes. Added bonus points: dispose of the waitable event in ControlledRealTimeReopenThread<T> Dispose method by adding: reopenCond.Dispose(); (after the Monitor.PulseAll(this); statement) Extreme added bonus points: the following statement is incorrect, but it works anyway: reopenCond.WaitOne(new TimeSpan(sleepNS / 1000000));//Convert NS to Ticks (the 1000000 should not be replaced by MS_IN_NS in the new version) The reason why it's incorrect is because the argument to TimeSpan in the call accepts 100-nanoseconds units, and dividing nanoseconds by 1000000 yields milliseconds instead. So you will wait a delay of a factor 10000 shorter. It turns out that the correction (divide by 100) will cause timeouts in the tests, so I left it as-is. But all of this might be wrong, I may not even exists. Vincent
