[ 
https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346878#comment-15346878
 ] 

Andriy Rysin commented on LUCENE-7287:
--------------------------------------

Hmm, that does not look right. Yes we can either use 
RemoveDuplicatesTokenFilterFactory (we'll have to add that to the 
UkrainianMorfologikAnalyzer too) or I need to rebuild the dictionary to remove 
the duplicates (probably preferred way).
The problem is that currently the dictionary is the POS dictionary so there may 
be duplicate lemma records as long as the POS tags are different.
I am thinking to file new jira issue for that and will provide a pull request, 
does that make sense?

> New lemma-tizer plugin for ukrainian language.
> ----------------------------------------------
>
>                 Key: LUCENE-7287
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7287
>             Project: Lucene - Core
>          Issue Type: New Feature
>          Components: modules/analysis
>            Reporter: Dmytro Hambal
>            Priority: Minor
>              Labels: analysis, language, plugin
>             Fix For: master (7.0), 6.2
>
>         Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 
> PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png
>
>
> Hi all,
> I wonder whether you are interested in supporting a plugin which provides a 
> mapping between ukrainian word forms and their lemmas. Some tests and docs go 
> out-of-the-box =) .
> https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer
> It's really simple but still works and generates some value for its users.
> More: https://github.com/elastic/elasticsearch/issues/18303



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to