Hi, Actually, there is already an optimized Chinese word segmentation tool in the Lucene.Net.ICU project (https://lucene.apache.org/core/4_8_0/analyzers-icu/index.html), which is still a work in progress. We have Lucene.Net.ICU 100% ported with all tests passing (see https://github.com/NightOwl888/lucenenet/tree/icu4n-migration), but we could definitely use some help getting the dependent ICU functionality finished.
There are still many undecided issues regarding the ICU functionality. For example: 1. Should we use the newly ported ICU4N (https://github.com/NightOwl888/ICU4N) project or try to add the functionality to the already existing icu.net project (https://github.com/sillsdev/icu-dotnet)? Note the latter has been attempted, but there are several issues (missing functionality, incompatibilities, problems loading data) that make it very challenging to provide all of the Lucene.Net.ICU functionality - it was easier to get it working by porting from ICU4J, but will require maintaining the ICU4N project. 2. If we use ICU4N, should we make it into a general library that benefits all of the .NET ecosystem, or should we limit it to primarily support Lucene.NET? 3. If we use ICU4N, how should we best allow the user to load a customized version of the ICU data? If we make ICU4N into a general library, it would probably be best to contribute it back to the ICU project: http://site.icu-project.org/ so it is maintained and released on the same schedule and documented there, too. Do note that ICU releases very often to keep up with the changes to the Unicode standard - we have ported ICU4J from version 60.1 (released November 1, 2017) and they just released version 62.1 yesterday (June 20, 2018). So one of the first orders of business would be to upgrade the existing ICU4N features to version 62.1 if we go that route. Also note that we only have about 40% of ICU4J ported, which is just enough to support Lucene.Net.ICU. There are several APIs that still need to be refactored to fit into the .NET paradigm, as well as some gaps in functionality to work out before proceeding with any more porting work. My hope was to make ICU4N into a first rate .NET component to add complete Unicode support to the .NET framework with fully .NET like APIs, however we also have the option of limiting the scope of the project to just what is needed to support Lucene.Net.ICU in order to get the 4.8.0 release done quicker. Either way, there is still work to be done to make the APIs of the project consistent if we use ICU4N, and there is quite a bit of missing functionality to add to icu.net if we use that instead. Basically, there are 3 ways to complete this: 1. Add the required functionality to the icu.net project in order to support the Lucene.Net.ICU features, port the missing Lucene.Net.ICU features to the current master branch and abandon work on ICU4N. 2. Finish up the API and fix 19 failing tests to make ICU4N good enough to support Lucene.Net.ICU without making it into a first-rate component that supports all ICU features. 3. Contact the ICU team about contributing ICU4N to their repository and if they agree, allow them to lead the direction of the API and features (with the added possibility of their help and Unicode expertise). #1 would be the least maintenance long-term solution, but I have doubts we can get more than about 50% of the Lucene.Net.ICU features to function if we go that route. Failing that, the preference is to go with option #3 so the whole .NET ecosystem benefits (and contributes) and we will be able to release 100% of the Lucene.Net.ICU functionality. Would you be interested in helping out in order to make the word segmentation functionality production-ready, and if so, for which of these options? Let me know, and I will start putting together a prioritized list of items that are incomplete to get you started. Thanks, Shad Storhaug (NightOwl888) -----Original Message----- From: 小康 [mailto:[email protected]] Sent: Thursday, June 21, 2018 9:00 AM To: [email protected] Cc: [email protected] Subject: Re: State / Future of the Lucene.Net Project I want to add a Chinese word segmentation tool with good performance in lucenenet. I think this will be kind to Chinese developers. Can I do this job? 2018-06-21 5:19 GMT+08:00 Shad Storhaug <[email protected]>: > Hello. Thanks for the heads up. For code optimizations, you will need to > locate the areas that need fixing, patch them, and then submit a separate > pull request on GitHub for each one. Please provide a small standalone > piece of code (a console app works great) we can run before and after the > patch to demonstrate exactly how the fix affects performance. > > We will definitely welcome the help. > > -----Original Message----- > From: 小康 [mailto:[email protected]] > Sent: Wednesday, June 20, 2018 8:03 PM > To: [email protected] > Cc: [email protected] > Subject: Re: State / Future of the Lucene.Net Project > > I am willing to contribute to lucene.net .Because I am creating a vertical > search engine with lucene.net. > > I want to make lucene.net faster and better. > > I can do some contibution on weekends. > > Thank you. > > 2018-05-28 23:48 GMT+08:00 Stefan Bodewig <[email protected]>: > > > Hi all > > > > it is pretty difficult to write a message like this. I've been one of > > Lucene.Net's mentors during Apache incubation and even though I never > > contributed anything significant (at least code-wise) I really care for > > the project and its community. > > > > For more than a year Shad has been the only committer who actually > > committed to the code base but despite his herculean effort we haven't > > been able to attract new contributors. > > > > Of the project management committee most people seem to be absent by now > > and the project has rightfully raised concerns by the board [1][2] > > > > There really are only two options. > > > > * we create a credible plan how to get Lucene.Net back into a healthy > > state with multiple contributors and a more active PMC and execute on > > it > > > > * we start the process of sending the project to the Apache Attic > > http://attic.apache.org/ (which is not a one-way road, projects ca be > > re-surrected if a new community emerges). > > > > We probably should start with trying the first option. We have tried to > > find new contributors in the past but haven't been succeful, let's give > > it one more try. > > > > What we need are people who are willing to contribute for more than a > > single pull request or two and who are willing to become members of the > > developer community here at Apache. If you think this description fits > > you, please raise your hand :-) > > > > Stefan > > > > [1] https://lists.apache.org/thread.html/c44ef94020271b3823fe356a255d69 > > 3a76287c1214743dfc074621de@%3Cdev.lucenenet.apache.org%3E > > [2] https://lists.apache.org/thread.html/70a34c2cd3298afe02827c219e2dc2 > > b66ae594aabcbaa33265301a44@%3Cdev.lucenenet.apache.org%3E > > >
