[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15323800#comment-15323800
 ] 

Andriy Rysin commented on LUCENE-7287:
--------------------------------------

Ok, I've imported lucene-sorl and the Ukrainian analyzer project from 
[~mr_gambal] into Eclipse and looked through the code.
Unfortunately we can't use the whole morfologik package as is - it's very 
specific for Polish. We could still probably use part of morfologik for compact 
dictionary representation. The whole Ukrainian dictionary in this format with 
POS tags is ~1.6MB compared to 98M in csv and we could probably make it smaller 
if we strip the tags.
There are several things I'd like to note:
1) this dictionary is for inflections (not related words) so this stemming will 
be producing lemmas not quite root words (this is probably ok and in some cases 
even better?)
2) as this is dictionary-based stemming it won't stem unknown words (but 
dictionary contains ~200K lemmas so it should give good output)
3) as Ukrainian has high level of inflection (nouns produce up to 7 forms, 
direct verbs up to 20, reverse verbs up to 30 forms) with many rules and 
exceptions developing quality rule-base stemming will not be trivial
4) I was planning to work on Ukrainian analyzer in a separate project but if 
it's better for the review process I can fork lucene-solr and work inside the 
fork
5) I am thinking to create org.apache.lucene.analysis.uk classes based on 
[~mr_gambal]'s work and the csv file we have and once it's working try more 
compact representation

The question: once we have it working shall we include the dictionary in the 
lucene project or make it an external dependency (like with 
morfologik-polish.jar)? First is simpler but second will allow easy updates for 
the dictionary (which I can see being actively developed for another year or 
two) and also will keep the binary blob out of the project. I am leaning 
towards second but open for discussion.



> New lemma-tizer plugin for ukrainian language.
> ----------------------------------------------
>
>                 Key: LUCENE-7287
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7287
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Dmytro Hambal
>            Priority: Minor
>              Labels: analysis, language, plugin
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to