[GitHub] [lucenenet] NightOwl888 commented on issue #271: TokenStream.IncrementToken() is called after Dispose() is called

GitBox Thu, 18 Nov 2021 08:04:36 -0800


NightOwl888 commented on issue #271:
URL: https://github.com/apache/lucenenet/issues/271#issuecomment-973005744



   I started by trying to solve the two [TestRandomChains]() failures (see 
#269), but it ended up unraveling back to this issue again.
   
   I isolated a specific case that fails in .NET:
   
   ```c#
   [Test]
   public void TestWordDelimiterFilterChain()
   {
       Analyzer a = Analyzer.NewAnonymous(createComponents: (fieldName, reader) 
=>
       {
           Tokenizer tokenizer = new NGramTokenizer(LuceneVersion.LUCENE_48, 
reader, 70, 79);
           TokenStream filter = tokenizer;
           filter = new WordDelimiterFilter(LuceneVersion.LUCENE_48, filter, 
WordDelimiterFlags.CATENATE_NUMBERS | WordDelimiterFlags.SPLIT_ON_CASE_CHANGE, 
new CharArraySet(LuceneVersion.LUCENE_48, new HashSet<string> { "htxodcrfoi", 
"zywcmh", "pkw", "urdvcqr", "gbuskh" }, ignoreCase: true));
           filter = new ShingleFilter(filter, "EMAIL");
           return new TokenStreamComponents(tokenizer, filter);
       });
   
       string text = "ufsczdev -[({0, 
\u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1 
kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61 
\ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a 
\ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv 
\ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
       bool useCharFilter = false;
       bool offsetsAreCorrect = true;
   
       long seed = 0x21773254b5ac8fa0L;
       Random random = new J2N.Randomizer(seed);
   
   
       try
       {
           CheckAnalysisConsistency(random, a, useCharFilter, text, 
offsetsAreCorrect);
       }
       catch (Exception e) when (e.IsThrowable())
       {
           Console.WriteLine("Exception from random analyzer: " + a);
           throw;
       }
   }
   ```
   
   And after translating it to Java, it still fails:
   
   ```java
   public void testWordDelimiterFilterChain() throws Throwable {
     Analyzer a = new Analyzer() {
       public TokenStreamComponents createComponents(String fieldName, Reader 
reader) {
         Tokenizer tokenizer  = new 
org.apache.lucene.analysis.ngram.NGramTokenizer(Version.LUCENE_48, reader, 70, 
79);
         TokenStream filter = tokenizer;
         filter = new WordDelimiterFilter(Version.LUCENE_48, filter, 
WordDelimiterFilter.CATENATE_NUMBERS | 
WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, new CharArraySet(Version.LUCENE_48, 
asSet("htxodcrfoi", "zywcmh", "pkw", "urdvcqr", "gbuskh"), /*ignoreCase:*/ 
true));
         filter = new org.apache.lucene.analysis.shingle.ShingleFilter(filter, 
"EMAIL");
         
         return new TokenStreamComponents(tokenizer, filter);
       }
     };
     
     String text = "ufsczdev -[({0, 
\u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1 
kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61 
\ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a 
\ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv 
\ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
     boolean useCharFilter = false;
     boolean offsetsAreCorrect = false;
     
     long seed = 0x21773254b5ac8fa0L;
     java.util.Random random = new java.util.Random(seed);
     
     
     try
     {
         checkAnalysisConsistency(random, a, useCharFilter, text, 
offsetsAreCorrect);
     }
     catch (Throwable e)
     {
         System.out.println("Exception from random analyzer: " + a);
         throw e;
     } 
    }
   ```
   
   I discovered this is primarily due to the fact that some flags are not 
supported during indexing: https://stackoverflow.com/a/66478863/, and confirmed 
that removing the `WordDelimiterFlags.CATENATE_NUMBERS` flag makes the test 
pass.
   
   In `TestRandomChains`, `WordDelimiterFilter` is added to the 
`brokenOffsetsConstructors` dictionary. This causes [`offsetsAreCorrect` to be 
set to 
`true`](https://github.com/apache/lucenenet/blob/03bc03bb5bcb74d12524cc9c0d85e501473a4445/src/Lucene.Net.Tests.Analysis.Common/Analysis/Core/TestRandomChains.cs#L1028),
 which 
[`BaseTokenStreamTestCase.CheckAnalysisConsistency`](https://github.com/apache/lucenenet/blob/03bc03bb5bcb74d12524cc9c0d85e501473a4445/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L1014)
 is supposed to handle. `BaseTokenStreamTestCase` has been changed from its 
original implementation to include try-finally and using blocks (both in 
[`CheckAnalysisConsistency()`](https://github.com/apache/lucenenet/blob/03bc03bb5bcb74d12524cc9c0d85e501473a4445/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L1038)
 and in 
[`AssertTokenStreamContents()`](https://github.com/apache/lucenenet/blob/03bc03bb5bcb74d12524cc9c0d85e501473a4445/src/Lucene.N
 et.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L144-L148) and that is 
where we are running into trouble.
   
   Long story short, this circles back to the fact that we converted the 
`Closable` interface in Java to `IDisposable` in .NET on `TokenStream`. But 
`TokenStream`'s `close()` method has preconditions that must be met or it will 
throw an exception. The exception doesn't play well when you are trying to 
close open `TextReader`s successfully. Before these finally blocks were put 
into place to call `Dispose()` one failing analysis test would cause nearly all 
of them to fail.
   
   Since most of Lucene's tests do not call `close()` in a finally block, I 
have long suspected that there is something automatic happening to call these 
methods, and after a bit of research, I discovered that 
[`Closable`](https://docs.oracle.com/javase/8/docs/api/java/io/Closeable.html) 
inherits 
[`AutoClosable`](https://docs.oracle.com/javase/8/docs/api/java/lang/AutoCloseable.html),
 which has configuration options that can be set to make it implicit. But it is 
clear this is going to take more research to work out how it works in Lucene 
and the best way to make it translate to .NET in a reasonable way.
   
   ### Some Options to Consider
   
   1. Add a `Close()` method that has the precondition checks and use 
`Dispose()` to only do the cleanup.
   2. In `Dispose()`, only suppress the finalizer in the case where the 
preconditions pass, but rely on a finalizer in cases where it fails.
   
     ```c#
      private bool isValid;
   
     ~TokenStream() // Only called if we don't meet our preconditions
     {
         Dispose(false);
     }
   
     public void Dispose()
     {
        Dispose(true);
        GC.SuppressFinalize(this);
     }
   
     protected virtual void Dispose(bool disposing)
     {
         if (disposing)
         {
             isValid = Validate(out string reason);
             if (!isValid)
                 throw new InvalidOperationException(reason); // Exception 
short circuits the call to SuppressFinalize
   
             DoDispose();
         }
   
         if (!isValid)
             DoDispose();
     }
   
     internal void DoDispose()
     {
       // Dispose file handles here...
     }
     ```
     Of course, that means we need to wait for the finalizer to execute before 
beginning the next test. From the [finalizer 
docs](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/finalizers):
   
     > The programmer has no control over when the finalizer is called; the 
garbage collector decides when to call it. The garbage collector checks for 
objects that are no longer being used by the application. If it considers an 
object eligible for finalization, it calls the finalizer (if any) and reclaims 
the memory used to store the object. It's possible to force garbage collection 
by calling Collect, but most of the time, this call should be avoided because 
it may create performance issues.
   
     Options for fixing the tests:
   
     1. Call `GC.Collect()` in a catch block to ensure the finalizer gets 
called first.
     2. Call `DoDispose()` in a catch block to ensure we dispose everything 
right away, without waiting for the GC.
   
     Since the preconditions are only meant to prevent programming bugs that 
aren't likely to make it into a production application, it seems reasonable to 
rely on a finalizer as a fallback for this edge case.
   
   ## Further Consideration
   
   Being that [Lucene may call `Close()` on a `TokenStream` and then 
`Reset()`](https://lucenenet.apache.org/docs/4.8.0-beta00015/api/core/Lucene.Net.Analysis.html#using-the-tokenstream-api)
 to start the process all over again, we should separate the concept of 
`Close()` from `Dispose()`. Whether we actually need a `Dispose()` method needs 
more analysis to figure out if `TokenStream` is actually responsible for 
cleaning up file handles, and if not, where that responsibility lies. Since 
`TokenStream` can effectively "Reopen" after it is called, it seems to make 
sense to at least have a `Close()` method to signal that we are not actually 
disposing anything and can still use the `TokenStream` instance.
   
   But we definitely need to revert `BaseTokenStreamTestCase` to be closer to 
its original form and remove all of the finally blocks that "correctly consume" 
the `TokenStream` that are now causing other issues in the tests.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[GitHub] [lucenenet] NightOwl888 commented on issue #271: TokenStream.IncrementToken() is called after Dispose() is called

Reply via email to