[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323800#comment-15323800 ]
Andriy Rysin commented on LUCENE-7287: -------------------------------------- Ok, I've imported lucene-sorl and the Ukrainian analyzer project from [~mr_gambal] into Eclipse and looked through the code. Unfortunately we can't use the whole morfologik package as is - it's very specific for Polish. We could still probably use part of morfologik for compact dictionary representation. The whole Ukrainian dictionary in this format with POS tags is ~1.6MB compared to 98M in csv and we could probably make it smaller if we strip the tags. There are several things I'd like to note: 1) this dictionary is for inflections (not related words) so this stemming will be producing lemmas not quite root words (this is probably ok and in some cases even better?) 2) as this is dictionary-based stemming it won't stem unknown words (but dictionary contains ~200K lemmas so it should give good output) 3) as Ukrainian has high level of inflection (nouns produce up to 7 forms, direct verbs up to 20, reverse verbs up to 30 forms) with many rules and exceptions developing quality rule-base stemming will not be trivial 4) I was planning to work on Ukrainian analyzer in a separate project but if it's better for the review process I can fork lucene-solr and work inside the fork 5) I am thinking to create org.apache.lucene.analysis.uk classes based on [~mr_gambal]'s work and the csv file we have and once it's working try more compact representation The question: once we have it working shall we include the dictionary in the lucene project or make it an external dependency (like with morfologik-polish.jar)? First is simpler but second will allow easy updates for the dictionary (which I can see being actively developed for another year or two) and also will keep the binary blob out of the project. I am leaning towards second but open for discussion. > New lemma-tizer plugin for ukrainian language. > ---------------------------------------------- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Dmytro Hambal > Priority: Minor > Labels: analysis, language, plugin > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org