[ https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199227#comment-13199227 ]
Jan Høydahl commented on SOLR-2764: ----------------------------------- When looking at words enging in -het and -dom in dictionaries (such as Ooo nb_NO.dic), the base word has the same meaning in the vast majority of cases. But of course there will be exceptions. Take the word "brennhet" (het as in hot), it will be stemmed to "brenn" -> "bren" which is kind of wrong, but then "bren" is not a valid word so it won't cause errors. There may be such cases where the final stem clashes with another word, but not more than the base rules. I.e. there is a Norwegian surname "Brenna" which will be stemmed to "brenn" by the "-a" rule, believing it's a fem.definite ending, and then we get a clash with the verb "brenn" (burn). And the first name "Tore" (boy) or "Tora" (girl) will be stemmed to "Tor" (boy) which is another valid first name... My hunch is that the -dom/-het rules make more good than wrong. Mainly because in the majority of cases it leads to the base word and the -het/-dom word being stemmed to the same stem in cases where the "-en/-et/-a/-e/-n" rule are applied wrongly. Example: {noformat} One pass Two passes forlegen forleg forlegen forleg forlegenhet forlegen forlegenhet forleg forlegenheten forlegen forlegenheten forleg forlegenhetens forlegen forlegenhetens forleg firkantet firkant firkantet firkant firkantethet firkantet firkantethet firkant firkantetheten firkantet firkantetheten firkant {noformat} But I think maybe the rules -dommer and -dommen should be removed, because the word dommer (judge) and dommen (the sentence) are both common words valid in word endings. So the word "linjedommer" (linesman) would be stemmed to "linje" (line) which is too aggressive. I see that it soon gets complicated to try to be clever. Should we go back to the one-pass again for the light stemmer? Christian? > Create a NorwegianLightStemmer and NorwegianMinimalStemmer > ---------------------------------------------------------- > > Key: SOLR-2764 > URL: https://issues.apache.org/jira/browse/SOLR-2764 > Project: Solr > Issue Type: New Feature > Components: Schema and Analysis > Reporter: Jan Høydahl > Fix For: 3.6, 4.0 > > Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch, > SOLR-2764.patch > > > We need a simple light-weight stemmer and a minimal stemmer for > plural/singlular only in Norwegian -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org