NightOwl888 commented on issue #271:
URL: https://github.com/apache/lucenenet/issues/271#issuecomment-2509096553
Thanks so much for the detailed analysis, Paul.
So, what I gather, you are saying a disposable class would be injected like
this?
```c#
using var someDisposable = new SomeDisposable();
Analyzer analyzer = Analyzer.NewAnonymous(createComponents: (fieldName,
reader) =>
{
var src = new StandardTokenizer(m_matchVersion, reader);
src.MaxTokenLength = 150;
TokenStream tok = new StandardFilter(m_matchVersion, src);
tok = new LowerCaseFilter(m_matchVersion, tok);
tok = new StopFilter(m_matchVersion, tok, m_stopwords);
tok = new MyFilter(someDisposable, tok);
return new TokenStreamComponents(src, tok);
});
```
Alternatively, the user would need to implement a whole Analyzer subclass,
but they could pass the disposable instance into the constructor to accomplish
the same thing. I have to admit, doing it that way hadn't crossed my mind, but
I am not sure it covers everything we need to fix. It may address all usability
issues, though.
The main motivation for this issue is to fix the [`BaseTokenStreamTestCase`
class](https://github.com/apache/lucenenet/blob/abe75de14adb010fd4ac2b614cf5af2b8157e22b/src/Lucene.Net.TestFramework/Analysis/BaseTokenStreamTestCase.cs#L389-L400).
We have added hacks to prevent cascade failures from occurring with the
original design. But this has bled over into other areas, such as the two
`TestRandomChains` test failures. IIRC, both of these things require the state
of `ILLEGAL_STATE_READER` to be carefully managed.
There were 3 main issues we ran into with `BaseTokenStreamTestCase`:
1. If an assert happened, it could leave the `TokenStream` in an
inconsistent state because it would fail to fulfill the [TokenStream
contract](https://lucenenet.apache.org/docs/4.8.0-beta00017/api/core/Lucene.Net.Analysis.TokenStream.html).
This would in turn cause all of the remaining tests in the same test class to
fail because the `TokenStream` is in an invalid state each time it is reused.
2. If a `TokenStream` class owns a file handle and we skip the call
`Dispose()` on it, other tests that use the same file would get a "File in use
by another process" exception. This is probably not an issue in our tests since
we embedded the data files as resources into the assemblies (which means they
open using a managed stream instance), but I see no reason to impose a
limitation of strictly using embedded resources for our users.
3. If a `TokenStream` class owns a native resource handle (as was once the
case when we were using icu-dotnet), there is also no way to clean it up using
Close() since it needs to happen after all of the reuse is done.
It was the 3rd problem that made me open this issue, as using icu-dotnet
with Lucene.NET was completely impossible without having a `Dispose()` method
that was only called once at the end of the process. The above solution would
have been very ugly to hand our users - imagine having to [instruct users to
`Init()` and
`Cleanup()`](https://github.com/sillsdev/icu-dotnet?tab=readme-ov-file#usage)
in every piece of code that uses a particular analyzer implementation (at one
point, those were required). And what if a future implementation we own
requires us to call something disposable?
My thought was to separate the concepts of `Close()` from `Dispsose()` so
the test framework could clean up a reusable `TokenStream` instance without
modifying the `ILLEGAL_STATE_READER` so it doesn't cause all of the other tests
in the class to fail along with it. But, IIRC there has never been an attempt
to make a failure happen in Lucene to see if it had the same cascade failures
or if there is some other hook that is supposed to clean up this state. Having
those cascade failures made finding the test that was actually failing very
hard. If Lucene does not have cascade failures, I am curious as to why they
don't.
> Of course, we have changed a lot of things since those hacks were put in
place, such as fixing the error handling to work the same way it did in Lucene
and switching from external resource files to embedded resources.
A good proof of concept would be to ensure the `TestRandomChains` test
failures can be patched with the proposed fix. I would be interested to know
1. Can we make those pass without having disposable analysis components?
2. Can we make the tests fail independently without getting cascade
failures? Even on `TextReader` implementations that own file handles?
It is that last condition that made me think there was no way out except for
having disposable token stream components, since the original configuration
would always leave the file handle open upon test failure.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]