[
https://issues.apache.org/jira/browse/SOLR-2764?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13199227#comment-13199227
]
Jan Høydahl commented on SOLR-2764:
-----------------------------------
When looking at words enging in -het and -dom in dictionaries (such as Ooo
nb_NO.dic), the base word has the same meaning in the vast majority of cases.
But of course there will be exceptions. Take the word "brennhet" (het as in
hot), it will be stemmed to "brenn" -> "bren" which is kind of wrong, but then
"bren" is not a valid word so it won't cause errors. There may be such cases
where the final stem clashes with another word, but not more than the base
rules. I.e. there is a Norwegian surname "Brenna" which will be stemmed to
"brenn" by the "-a" rule, believing it's a fem.definite ending, and then we get
a clash with the verb "brenn" (burn). And the first name "Tore" (boy) or "Tora"
(girl) will be stemmed to "Tor" (boy) which is another valid first name...
My hunch is that the -dom/-het rules make more good than wrong. Mainly because
in the majority of cases it leads to the base word and the -het/-dom word being
stemmed to the same stem in cases where the "-en/-et/-a/-e/-n" rule are applied
wrongly. Example:
{noformat}
One pass Two passes
forlegen forleg forlegen forleg
forlegenhet forlegen forlegenhet forleg
forlegenheten forlegen forlegenheten forleg
forlegenhetens forlegen forlegenhetens forleg
firkantet firkant firkantet firkant
firkantethet firkantet firkantethet firkant
firkantetheten firkantet firkantetheten firkant
{noformat}
But I think maybe the rules -dommer and -dommen should be removed, because the
word dommer (judge) and dommen (the sentence) are both common words valid in
word endings. So the word "linjedommer" (linesman) would be stemmed to "linje"
(line) which is too aggressive.
I see that it soon gets complicated to try to be clever. Should we go back to
the one-pass again for the light stemmer? Christian?
> Create a NorwegianLightStemmer and NorwegianMinimalStemmer
> ----------------------------------------------------------
>
> Key: SOLR-2764
> URL: https://issues.apache.org/jira/browse/SOLR-2764
> Project: Solr
> Issue Type: New Feature
> Components: Schema and Analysis
> Reporter: Jan Høydahl
> Fix For: 3.6, 4.0
>
> Attachments: SOLR-2764.patch, SOLR-2764.patch, SOLR-2764.patch,
> SOLR-2764.patch
>
>
> We need a simple light-weight stemmer and a minimal stemmer for
> plural/singlular only in Norwegian
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]