[ https://issues.apache.org/jira/browse/LUCENE-3883?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13233159#comment-13233159 ]
Robert Muir commented on LUCENE-3883: ------------------------------------- Thanks for updating the patch Jim! one concern doing some very very rudimentary testing: we have special lowercasing for situations like nAthair -> n-athair, which the snowball rules then strip: {noformat} define initial_morph as ( [substring] among ( 'h-' 'n-' 't-' //nAthair -> n-athair, but alone are problematic (delete) {noformat} The problem is if the input initially comes as n-athair, Unicode break rules will split this up on the hyphen into two tokens {n, athair}. You can visualize this at http://unicode.org/cldr/utility/breaks.jsp This means we can add many spurious 'n' tokens in the index... So we have two potential solutions to this: # we can simply add 'n', 'h', 't', etc to the stopwords list. This is the simplest solution. Would this be too aggressive? # we can add a CharFilter for IrishAnalyzer to prevent this splitting from happening. This is more complex. > Analysis for Irish > ------------------ > > Key: LUCENE-3883 > URL: https://issues.apache.org/jira/browse/LUCENE-3883 > Project: Lucene - Java > Issue Type: New Feature > Components: modules/analysis > Reporter: Jim Regan > Priority: Trivial > Labels: analysis, newbie > Attachments: LUCENE-3883.patch, irish.sbl > > > Adds analysis for Irish. > The stemmer is generated from a snowball stemmer. I've sent it to Martin > Porter, who says it will be added during the week. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org