Hello,

To clarify, this is regarding the GitHub PR and Lucene JIRA ticket where you 
can read more info:

https://github.com/apache/lucenenet/pull/645
https://issues.apache.org/jira/browse/LUCENE-8231

We attempted to port the Nori analysis package to .NET 3 years ago from Lucene 
8.2.0 to Lucene.NET 4.8.0 and got it all to work except for 6 tests. 3 of them 
are dealing with the KoreanNumberFilter, but we are certain we can fix those 
without your assistance.

However, there are 3 tests that fail due to changes to the FST implementation 
between Lucene 4.8.0 and 8.2.0:


  1.  TestKoreanTokenizer.testRandomHugeStringsMockGraphAfter()
  2.  TestKoreanTokenizerFactory.testUserDict()
  3.  UserDictionaryTest.TstLookup()

We attempted:


  1.  Re-building the dictionaries using the Lucene.NET 4.8.0 FST using 
mecab-ko-dic-2.0.3-20170922.tar.gz<https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz>.
  2.  Porting the Lucene 8.2.0 FST over to the Lucene.Net.Analysis.Nori project 
and rewiring Nori to be the only project that uses it.

Unfortunately, the former doesn't change the results of the tests and the 
latter simply fails the check because we are below Lucene 6. Removing the check 
still has invalid FST output, but it is clear that it wasn't supposed to be 
compatible.

Our most current attempt is at: 
https://github.com/NightOwl888/lucenenet/tree/feature/analysis-nori-2

It is far simpler to backport the Nori package to Lucene.NET 4.8.0 than it is 
to upgrade the entire project to at least 7.4.0. So, we could use your 
assistance to help us convert the Nori package to be compatible with the FST in 
Lucene 4.8.0.

Some Options (there may be more)


  1.  Backport the analyzers-nori package from either the latest Lucene version 
or 8.2.0 to Lucene 4.8.0 (in Java). Once it is functional, we can use it as a 
basis to both port and compare execution to find any bugs on our end.
  2.  Provide us with the high-level info on how the FST package has changed 
between 4.8.0 and 8.2.0 (or the latest version) so we can make the backport. We 
need some sort of map to follow to understand the changes at the binary level.

Note that we maintain a copy of Lucene 4.8.0 for debugging purposes because the 
Maven artifacts are now stale and we had to upgrade them to get the build to 
work: 
https://github.com/NightOwl888/lucene/tree/releases/lucene-solr/4.8.0/updated

The first option would work the best for us, primarily because FST is a bit of 
a puzzle that we haven't dealt with at a high level, but we are willing to 
learn if you can point the way.

Thanks,
Shad Storhaug (NightOwl888)
Project Chairperson - Apache Lucene.NET

From: Roy Hwang <r.hw...@criteo.com.INVALID>
Sent: Saturday, October 15, 2022 12:41 AM
To: dev@lucene.apache.org
Subject: Backporting of Nori

Hello Dev's at Lucene, I'm Roy.
Currently we are using the Java implementation, as well as the dotnet 
implementation of Lucene.
We've greatly enjoyed your works in our efforts to expand our app to 
accommodate and analyze multiple languages.
However, I've encountered a roadblock while working with the Korean analyzer.
There is a adequate implementation of the Korean (Nori) analyzer in Java, but 
not for dotnet.
Upon communicating and working with the developers at Lucene dotnet, there seem 
to be some changes to the dependencies which has made debugging almost 
impossible with the oldest versions of the Korean Analyzer.
The developers at Lucene dotnet are requesting that a backport of the Korean 
Analyzer to Java 4.8.0 to serve as a basis for porting the Java implementation 
to dotnet.
If this version could be provided in a dedicated branch, it'd be greatly 
appreciated so that this work can move forward, as many other developers are 
also anticipating for this feature to be made available in dotnet.

Thank you for your contributions!
Best regards,
Roy Hwang

Reply via email to