[ https://issues.apache.org/jira/browse/LUCENE-7287?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15346878#comment-15346878 ]
Andriy Rysin commented on LUCENE-7287: -------------------------------------- Hmm, that does not look right. Yes we can either use RemoveDuplicatesTokenFilterFactory (we'll have to add that to the UkrainianMorfologikAnalyzer too) or I need to rebuild the dictionary to remove the duplicates (probably preferred way). The problem is that currently the dictionary is the POS dictionary so there may be duplicate lemma records as long as the POS tags are different. I am thinking to file new jira issue for that and will provide a pull request, does that make sense? > New lemma-tizer plugin for ukrainian language. > ---------------------------------------------- > > Key: LUCENE-7287 > URL: https://issues.apache.org/jira/browse/LUCENE-7287 > Project: Lucene - Core > Issue Type: New Feature > Components: modules/analysis > Reporter: Dmytro Hambal > Priority: Minor > Labels: analysis, language, plugin > Fix For: master (7.0), 6.2 > > Attachments: LUCENE-7287.patch, Screen Shot 2016-06-23 at 8.23.01 > PM.png, Screen Shot 2016-06-23 at 8.41.28 PM.png > > > Hi all, > I wonder whether you are interested in supporting a plugin which provides a > mapping between ukrainian word forms and their lemmas. Some tests and docs go > out-of-the-box =) . > https://github.com/mrgambal/elasticsearch-ukrainian-lemmatizer > It's really simple but still works and generates some value for its users. > More: https://github.com/elastic/elasticsearch/issues/18303 -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org