[GitHub] lucenenet issue #191: [WIP] Migrating Lucene.Net to .NET Core

NightOwl888 Wed, 02 Nov 2016 03:14:45 -0700

Github user NightOwl888 commented on the issue:

https://github.com/apache/lucenenet/pull/191

FYI - There are 2 branches that still need to be integrated before this is
up to date with master:

- https://github.com/NightOwl888/lucenenet/tree/core-tests
- https://github.com/NightOwl888/lucenenet/tree/analysis-stempel

But neither of those affects the `Lucene.Net.Analysis.Common` or the
`Lucene.Net.Expressions` tests, so you are doing the right thing by finding
those problems first.

The core-tests branch adds ~120 missing tests and has several dozen bug
fixes (including fixing the 2 failing tests in `Lucene.Net.Queries` and 32
failing tests in `Lucene.Net.QueryParser`). The tests in `Lucene.Net.Tests` are
now 100% complete (including parts of tests that were previously commented out,
but excluding tests that only apply to JRE/JUnit).

Test Context
====

The test context issues have been "fixed" by creating stub methods to put
the `[Test]` attribute on in the subclasses where the tests are supposed to be
run. You probably already know that part, but there have been some additions
and the new commit where these can be reverted (if needed) is:
4dbc3590361814d13fae64c8d030820eb4987489

That is, if xUnit properly pulls the base class tests into the right
context (if or when it is put in place), this commit can be reverted to
rollback those test method stubs.

LuceneNetSpecificAttribute
====

I added an attribute to apply to tests that were added as part of the port
so they can be filtered (and compared with) the tests from Java Lucene. The
filter can be applied in Visual Studio as `trait:LUCENENET`.

SuppressCodecsAttribute
====

Fixed this attribute to accept an array of strings and added the correct
suppressions to all of the tests in `Lucene.Net.Core`. We still need to fix the
Test Framework to randomize codecs and utilize this attribute to filter them.

ChangeNotes
====

I noticed you have begun [listing some of the questions you have and
unfinished
tasks](https://github.com/conniey/lucenenet/blob/netcoremigration/src/Lucene.Net.Core/ChangeNotes.txt)
about certain areas of the project and I know a few of the answers and have
some comments as well.

> TODO: Confirm HashMap emulates java properly

Although these already seem to be doing the job (tests are passing), we
could raise confidence about this by porting some of the tests over from the
JDK. I have already begun doing this for certain pieces that were ported from
or are meant to emulate parts of Java.

> TODO: Tests need to be written for WeakDictionary

Looks like we [have some tests
already](https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Tests/core/Support/TestWeakDictionary.cs)
from Lucene.Net 3.0.3 that exist in the repo, but haven't yet been added to
the project. I am not sure how well they match up against the JDK tests.

> TODO: Tests need to be written for IdentityDictionary -> Verify behavior

We should probably move this into the Support namespace. AFAIK, the Support
namespace is for pieces that need to be sourced from the JDK and/or close
approximations to what is in the JDK that don't exist in the .NET framework. We
can then port over the applicable tests for IdentityDictionary from the JDK to
verify it works as expected.

> ParallelReader - extra data types, using SortedDictionary in place of
TreeMap. Confirm compatibility. Looks okay, .NET uses a r/b tree just like
Java, and it seems to perform/behave just about the same.

In many cases `SortedDictionary` will function as a suitable replacement
for `TreeMap`. The primary difference is that `TreeMap` allows duplicate keys,
where `SortedDictionary` does not.

That said, it is unlikely any of the tests cover this scenario because they
would have all been written under the assumption that duplicates are allowed
(and therefore the functionality doesn't need to be tested per se, since it is
already covered under JDK tests).

In `Lucene.Net.Grouping` (WIP), `TreeMap` is required to make it work.
There is a suitable TreeMap in the [C5 library](https://github.com/sestoft/C5)
(see [The Working Programmer - .NET Collections: Getting Started with
C5](https://msdn.microsoft.com/en-us/magazine/jj883961.aspx)), which seems to
do the job. I have ported `TreeMap` over to our Support namespace along with
`TreeDictionary` and several other dependencies (as well as the tests).

Do note that C5 currently doesn't have .NET core support, but someone has
recently made a pull request for it. So whether to integrate it into the
Support namespace and get rid of the external dependency or reference it in
specific places is a matter of preference.

> FuzzyQuery - uses java.util.PriorityQueue<T>, which .net does not have.
Using SortedList<TKey, TVal> in it's place, which works, but a) isn't a perfect
replacement (a SortedList<TKey, TVal> doesn't allow duplicate keys, which is
what is sorted, where a PriorityQueue does) and b) it's likely slower than a
PriorityQueue<T>. I can't tell if the PriorityQueue that is defined in
Lucene.Net.Util would work in its place.

There are 2 different `PriorityQueue` types in Lucene.Net:

- Lucene.Net.Util.PriorityQueue
- Lucene.Net.Support.PriorityQueue

The one in the `Util` namespace is an exact match for the one in Lucene.
The PriorityQueue in the `Support` namespace is intended to be an approximate
match for `java.util.PriorityQueue` (it could use some tests to verify that,
though).

FuzzyQuery doesn't have a PriorityQueue, so you are likely referring to the
wrong type where this issue exists.

> Dispose needs to be implemented properly around the entire library. IMO,
that means that Close should be Obsoleted and the code in Close() moved to
Dispose().

AFAIK, `close()` is being universally replaced with `Dispose()` rather than
dealing with the (confusing) redundancy. There are cases where it is required,
though - for example, `TextWriter`. IMO, removing (where applicable) makes more
sense than obsoleting the `Close()` method.

> TODO: NamedThreadFactory.java - Is this needed? What is it for, just for
debugging?

AFAIK, no it is not needed. It is used as a parameter to generate a
concrete `ExecutorService` in Java.

```java
ExecutorService service = new ThreadPoolExecutor(4, 4, 0L,
TimeUnit.MILLISECONDS,
new LinkedBlockingQueue<Runnable>(),
new
NamedThreadFactory("TestIndexSearcher"));
```

It looks like someone has already worked out that `TaskScheduler` is the
rough equivalent in .NET (not confirmed). In all places where this is utilized
in the tests, `TaskScheduler.Default`, or the
[`Lucene.Net.Support.LimitedConcurrencyLevelTaskScheduler`](https://github.com/apache/lucenenet/blob/master/src/Lucene.Net.Core/Support/LimitedConcurrencyLevelTaskScheduler.cs)
(which I grabbed from MSDN) seems to work well enough to get the tests to pass.

While we *could* port the `ThreadPoolExecutor` from the JDK and make it a
concrete `TaskScheduler` so we would have a place to utilize the
`NamedThreadFactory`, it feels like that is a step away from .NETification
rather than just allowing people to use the `TaskScheduler` natively. IMO, it
makes more sense just to comment it out or remove the file from the project
(but not from the repo - it would be best to keep the file around with comments
explaining why it wasn't utilized).

> TODO: LockStressTest.java - Not yet ported.
> TODO: MMapDirectory.java - Port Issues
> TODO: NIOFSDirectory.java - Port Issues

These are now ported and most of the bugs they had were addressed.

Additional Items to Add to the List
-----

**`AlreadyClosedException`** - In .NET, we already have an
`ObjectDisposedException` so this is reinventing the wheel. The `IndexReader`
and `IndexWriter` classes depend heavily on the right exception type being
caught in order to provide the correct response. This may be an issue if a
nested .NET type throws an `ObjectDisposedException` rather than the expected
`AlreadyClosedException`. While we could partially address this by inheriting
`ObjectDisposedException` from `AlreadyClosedException`, and always catch
`ObjectDisposedException`, it still provides a window where consumers could
incorrectly catch `AlreadyClosedException` and later be bitten by an uncaught
`ObjectDisposedException`. We should probably just remove this type and use the
native `ObjectDisposedException`.

**`ICharSequence`** - Many types have replaced the use of `ICharSequence`
with `string`. Its too bad that .NET doesn't have an equivalent interface as
`ICharSequence` that is shared between `string` and `StringBuilder` that could
be utilized. But using only `string` has limited the API to a certain extent. A
few types within Lucene.Net implement `ICharSequence` (`CharTermAttribute`,
`CharBlockArray`, `OpenStringBuilder`, `StringCharSequenceWrapper`). It could
save a step to have overloads for `string`, `ICharSequence`, and
`StringBuilder` in all places that were `ICharSequence` in Lucene rather than
having to call `ToString()` on the latter two.

-----

Is there anything I can do to help you get this up to date?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

[GitHub] lucenenet issue #191: [WIP] Migrating Lucene.Net to .NET Core

Reply via email to