Hello, To clarify, this is regarding the GitHub PR and Lucene JIRA ticket where you can read more info:
https://github.com/apache/lucenenet/pull/645 https://issues.apache.org/jira/browse/LUCENE-8231 We attempted to port the Nori analysis package to .NET 3 years ago from Lucene 8.2.0 to Lucene.NET 4.8.0 and got it all to work except for 6 tests. 3 of them are dealing with the KoreanNumberFilter, but we are certain we can fix those without your assistance. However, there are 3 tests that fail due to changes to the FST implementation between Lucene 4.8.0 and 8.2.0: 1. TestKoreanTokenizer.testRandomHugeStringsMockGraphAfter() 2. TestKoreanTokenizerFactory.testUserDict() 3. UserDictionaryTest.TstLookup() We attempted: 1. Re-building the dictionaries using the Lucene.NET 4.8.0 FST using mecab-ko-dic-2.0.3-20170922.tar.gz<https://bitbucket.org/eunjeon/mecab-ko-dic/downloads/mecab-ko-dic-2.0.3-20170922.tar.gz>. 2. Porting the Lucene 8.2.0 FST over to the Lucene.Net.Analysis.Nori project and rewiring Nori to be the only project that uses it. Unfortunately, the former doesn't change the results of the tests and the latter simply fails the check because we are below Lucene 6. Removing the check still has invalid FST output, but it is clear that it wasn't supposed to be compatible. Our most current attempt is at: https://github.com/NightOwl888/lucenenet/tree/feature/analysis-nori-2 It is far simpler to backport the Nori package to Lucene.NET 4.8.0 than it is to upgrade the entire project to at least 7.4.0. So, we could use your assistance to help us convert the Nori package to be compatible with the FST in Lucene 4.8.0. Some Options (there may be more) 1. Backport the analyzers-nori package from either the latest Lucene version or 8.2.0 to Lucene 4.8.0 (in Java). Once it is functional, we can use it as a basis to both port and compare execution to find any bugs on our end. 2. Provide us with the high-level info on how the FST package has changed between 4.8.0 and 8.2.0 (or the latest version) so we can make the backport. We need some sort of map to follow to understand the changes at the binary level. Note that we maintain a copy of Lucene 4.8.0 for debugging purposes because the Maven artifacts are now stale and we had to upgrade them to get the build to work: https://github.com/NightOwl888/lucene/tree/releases/lucene-solr/4.8.0/updated The first option would work the best for us, primarily because FST is a bit of a puzzle that we haven't dealt with at a high level, but we are willing to learn if you can point the way. Thanks, Shad Storhaug (NightOwl888) Project Chairperson - Apache Lucene.NET From: Roy Hwang <r.hw...@criteo.com.INVALID> Sent: Saturday, October 15, 2022 12:41 AM To: dev@lucene.apache.org Subject: Backporting of Nori Hello Dev's at Lucene, I'm Roy. Currently we are using the Java implementation, as well as the dotnet implementation of Lucene. We've greatly enjoyed your works in our efforts to expand our app to accommodate and analyze multiple languages. However, I've encountered a roadblock while working with the Korean analyzer. There is a adequate implementation of the Korean (Nori) analyzer in Java, but not for dotnet. Upon communicating and working with the developers at Lucene dotnet, there seem to be some changes to the dependencies which has made debugging almost impossible with the oldest versions of the Korean Analyzer. The developers at Lucene dotnet are requesting that a backport of the Korean Analyzer to Java 4.8.0 to serve as a basis for porting the Java implementation to dotnet. If this version could be provided in a dedicated branch, it'd be greatly appreciated so that this work can move forward, as many other developers are also anticipating for this feature to be made available in dotnet. Thank you for your contributions! Best regards, Roy Hwang