NightOwl888 commented on issue #271:
URL: https://github.com/apache/lucenenet/issues/271#issuecomment-973005744
I started by trying to solve the two [TestRandomChains]() failures (see
#269), but it ended up unraveling back to this issue again.
I isolated a specific case that fails in .NET:
```c#
[Test]
public void TestWordDelimiterFilterChain()
{
Analyzer a = Analyzer.NewAnonymous(createComponents: (fieldName, reader)
=>
{
Tokenizer tokenizer = new NGramTokenizer(LuceneVersion.LUCENE_48,
reader, 70, 79);
TokenStream filter = tokenizer;
filter = new WordDelimiterFilter(LuceneVersion.LUCENE_48, filter,
WordDelimiterFlags.CATENATE_NUMBERS | WordDelimiterFlags.SPLIT_ON_CASE_CHANGE,
new CharArraySet(LuceneVersion.LUCENE_48, new HashSet<string> { "htxodcrfoi",
"zywcmh", "pkw", "urdvcqr", "gbuskh" }, ignoreCase: true));
filter = new ShingleFilter(filter, "EMAIL");
return new TokenStreamComponents(tokenizer, filter);
});
string text = "ufsczdev -[({0,
\u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1
kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61
\ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a
\ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv
\ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
bool useCharFilter = false;
bool offsetsAreCorrect = true;
long seed = 0x21773254b5ac8fa0L;
Random random = new J2N.Randomizer(seed);
try
{
CheckAnalysisConsistency(random, a, useCharFilter, text,
offsetsAreCorrect);
}
catch (Exception e) when (e.IsThrowable())
{
Console.WriteLine("Exception from random analyzer: " + a);
throw;
}
}
```
And after translating it to Java, it still fails:
```java
public void testWordDelimiterFilterChain() throws Throwable {
Analyzer a = new Analyzer() {
public TokenStreamComponents createComponents(String fieldName, Reader
reader) {
Tokenizer tokenizer = new
org.apache.lucene.analysis.ngram.NGramTokenizer(Version.LUCENE_48, reader, 70,
79);
TokenStream filter = tokenizer;
filter = new WordDelimiterFilter(Version.LUCENE_48, filter,
WordDelimiterFilter.CATENATE_NUMBERS |
WordDelimiterFilter.SPLIT_ON_CASE_CHANGE, new CharArraySet(Version.LUCENE_48,
asSet("htxodcrfoi", "zywcmh", "pkw", "urdvcqr", "gbuskh"), /*ignoreCase:*/
true));
filter = new org.apache.lucene.analysis.shingle.ShingleFilter(filter,
"EMAIL");
return new TokenStreamComponents(tokenizer, filter);
}
};
String text = "ufsczdev -[({0,
\u20d0\u20f4\u20f4\u20d4\u20e0\u20e4\u20d0\u20ef\u20e7\u20dd\u20f7\u20d2\u20e1
kdoxod \u175b\u1758\u174a pirsk --><scr \ufe5d\ufe51\ufe6c\ufe5d\ufe5e\ufe61
\ud802\udf57 \ua6fb\ua6cd \ufc78\uf998\u06e9\u1b3a
\ud802\ude2c\ud802\ude59\ud802\ude22\ud802\ude3b lkpnukew rqniibv
\ue990\ud8d0\udfe6\ue6a8[ lmev wzpts \u306a\u304a\u3093\u3080\u3092";
boolean useCharFilter = false;
boolean offsetsAreCorrect = false;
long seed = 0x21773254b5ac8fa0L;
java.util.Random random = new java.util.Random(seed);
try
{
checkAnalysisConsistency(random, a, useCharFilter, text,
offsetsAreCorrect);
}
catch (Throwable e)
{
System.out.println("Exception from random analyzer: " + a);
throw e;
}
}
```
I discovered this is primarily due to the fact that some flags are not
supported during indexing: https://stackoverflow.com/a/66478863/, and confirmed
that removing the `WordDelimiterFlags.CATENATE_NUMBERS` flag makes the test
pass.
In `TestRandomChains`, `WordDelimiterFilter` is added to the
`brokenOffsetsConstructors` dictionary. This causes [`offsetsAreCorrect` to be
set to
`true`](https://github.com/apache/lucenenet/blob/03bc03bb5bcb74d12524cc9c0d85e501473a4445/src/Lucene.Net.Tests.Analysis.Common/Analysis/Core/TestRandomChains.cs#L1028),
which
[`BaseTokenStreamTestCase.CheckAnalysisConsistency`](https://github.com/apache/lucenenet/blob/03bc03bb5bcb74d12524cc9c0d85e501473a4445/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L1014)
is supposed to handle. `BaseTokenStreamTestCase` has been changed from its
original implementation to include try-finally and using blocks (both in
[`CheckAnalysisConsistency()`](https://github.com/apache/lucenenet/blob/03bc03bb5bcb74d12524cc9c0d85e501473a4445/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L1038)
and in
[`AssertTokenStreamContents()`](https://github.com/apache/lucenenet/blob/03bc03bb5bcb74d12524cc9c0d85e501473a4445/src/Lucene.N
et.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L144-L148) and that is
where we are running into trouble.
Long story short, this circles back to the fact that we converted the
`Closable` interface in Java to `IDisposable` in .NET on `TokenStream`. But
`TokenStream`'s `close()` method has preconditions that must be met or it will
throw an exception. The exception doesn't play well when you are trying to
close open `TextReader`s successfully. Before these finally blocks were put
into place to call `Dispose()` one failing analysis test would cause nearly all
of them to fail.
Since most of Lucene's tests do not call `close()` in a finally block, I
have long suspected that there is something automatic happening to call these
methods, and after a bit of research, I discovered that
[`Closable`](https://docs.oracle.com/javase/8/docs/api/java/io/Closeable.html)
inherits
[`AutoClosable`](https://docs.oracle.com/javase/8/docs/api/java/lang/AutoCloseable.html),
which has configuration options that can be set to make it implicit. But it is
clear this is going to take more research to work out how it works in Lucene
and the best way to make it translate to .NET in a reasonable way.
### Some Options to Consider
1. Add a `Close()` method that has the precondition checks and use
`Dispose()` to only do the cleanup.
2. In `Dispose()`, only suppress the finalizer in the case where the
preconditions pass, but rely on a finalizer in cases where it fails.
```c#
private bool isValid;
~TokenStream() // Only called if we don't meet our preconditions
{
Dispose(false);
}
public void Dispose()
{
Dispose(true);
GC.SuppressFinalize(this);
}
protected virtual void Dispose(bool disposing)
{
if (disposing)
{
isValid = Validate(out string reason);
if (!isValid)
throw new InvalidOperationException(reason); // Exception
short circuits the call to SuppressFinalize
DoDispose();
}
if (!isValid)
DoDispose();
}
internal void DoDispose()
{
// Dispose file handles here...
}
```
Of course, that means we need to wait for the finalizer to execute before
beginning the next test. From the [finalizer
docs](https://docs.microsoft.com/en-us/dotnet/csharp/programming-guide/classes-and-structs/finalizers):
> The programmer has no control over when the finalizer is called; the
garbage collector decides when to call it. The garbage collector checks for
objects that are no longer being used by the application. If it considers an
object eligible for finalization, it calls the finalizer (if any) and reclaims
the memory used to store the object. It's possible to force garbage collection
by calling Collect, but most of the time, this call should be avoided because
it may create performance issues.
Options for fixing the tests:
1. Call `GC.Collect()` in a catch block to ensure the finalizer gets
called first.
2. Call `DoDispose()` in a catch block to ensure we dispose everything
right away, without waiting for the GC.
Since the preconditions are only meant to prevent programming bugs that
aren't likely to make it into a production application, it seems reasonable to
rely on a finalizer as a fallback for this edge case.
## Further Consideration
Being that [Lucene may call `Close()` on a `TokenStream` and then
`Reset()`](https://lucenenet.apache.org/docs/4.8.0-beta00015/api/core/Lucene.Net.Analysis.html#using-the-tokenstream-api)
to start the process all over again, we should separate the concept of
`Close()` from `Dispose()`. Whether we actually need a `Dispose()` method needs
more analysis to figure out if `TokenStream` is actually responsible for
cleaning up file handles, and if not, where that responsibility lies. Since
`TokenStream` can effectively "Reopen" after it is called, it seems to make
sense to at least have a `Close()` method to signal that we are not actually
disposing anything and can still use the `TokenStream` instance.
But we definitely need to revert `BaseTokenStreamTestCase` to be closer to
its original form and remove all of the finally blocks that "correctly consume"
the `TokenStream` that are now causing other issues in the tests.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]