[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
[ https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261470#comment-14261470 ] Massimo Pasquini commented on LUCENE-6138: -- I got the point about "it's not your work", after all this is the good thing about open source software. I also definitely agree your second point "these kind of changes need real evaluation work too": that's exactly what I was thinking before opening this issue. I think I will try to fix the ItalianLightStemmer myself and see what the results will be, but considering the wide range of test necessary in this specific domain, it will take a while before I could find enough time to try to fix it. About this issue, I suppose it may be closed as "won't fix" then. > ItalianLightStemmer doesn't apply on words shorter then 6 chars in length > - > > Key: LUCENE-6138 > URL: https://issues.apache.org/jira/browse/LUCENE-6138 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.10.2 >Reporter: Massimo Pasquini >Priority: Minor > > I expect a stemmer to transform nouns in their singular and plural forms into > a shorter common form. The implementation of the ItalianLightStemmer doesn't > apply any stemming to words shorter then 6 characters in length. This leads > to some annoying results: > singular form | plural form > 4|5 chars in length (no stemming) > alga -> alga | alghe -> alghe > fuga -> fuga | fughe -> fughe > lega -> lega | leghe -> leghe > 5|6 chars in length (stemming only on plural form) > vanga -> vanga | vanghe -> vang > verga -> verga | verghe -> verg > I suppose that such limitation on words length is to avoid other side effects > on shorter words not in the set above, but I think something must be reviewed > in the code for better results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
[ https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261247#comment-14261247 ] Erick Erickson commented on LUCENE-6138: Unfortunately I don't quite know, I've used the stemmers "as is". Sorry I can't help... > ItalianLightStemmer doesn't apply on words shorter then 6 chars in length > - > > Key: LUCENE-6138 > URL: https://issues.apache.org/jira/browse/LUCENE-6138 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.10.2 >Reporter: Massimo Pasquini >Priority: Minor > > I expect a stemmer to transform nouns in their singular and plural forms into > a shorter common form. The implementation of the ItalianLightStemmer doesn't > apply any stemming to words shorter then 6 characters in length. This leads > to some annoying results: > singular form | plural form > 4|5 chars in length (no stemming) > alga -> alga | alghe -> alghe > fuga -> fuga | fughe -> fughe > lega -> lega | leghe -> leghe > 5|6 chars in length (stemming only on plural form) > vanga -> vanga | vanghe -> vang > verga -> verga | verghe -> verg > I suppose that such limitation on words length is to avoid other side effects > on shorter words not in the set above, but I think something must be reviewed > in the code for better results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
[ https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261242#comment-14261242 ] Robert Muir commented on LUCENE-6138: - One nice thing about the porter stemmer is that it does some basic syllable counting instead of using arbitrary limits of characters like the light stemmers do. This is more complicated than checking a limit, but probably the best solution for this problem. Its probably easier to implement this logic with the snowball language itself (for this language, the logic already exists there), but could be done in java too. However these kind of changes need real evaluation work too. Currently we don't invent these light stemmers, we just use other people's work. > ItalianLightStemmer doesn't apply on words shorter then 6 chars in length > - > > Key: LUCENE-6138 > URL: https://issues.apache.org/jira/browse/LUCENE-6138 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.10.2 >Reporter: Massimo Pasquini >Priority: Minor > > I expect a stemmer to transform nouns in their singular and plural forms into > a shorter common form. The implementation of the ItalianLightStemmer doesn't > apply any stemming to words shorter then 6 characters in length. This leads > to some annoying results: > singular form | plural form > 4|5 chars in length (no stemming) > alga -> alga | alghe -> alghe > fuga -> fuga | fughe -> fughe > lega -> lega | leghe -> leghe > 5|6 chars in length (stemming only on plural form) > vanga -> vanga | vanghe -> vang > verga -> verga | verghe -> verg > I suppose that such limitation on words length is to avoid other side effects > on shorter words not in the set above, but I think something must be reviewed > in the code for better results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
[ https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261227#comment-14261227 ] Massimo Pasquini commented on LUCENE-6138: -- Could you please give me the link to the right place to post issues about the stemmers? I cannot find any link to the project. Thanks > ItalianLightStemmer doesn't apply on words shorter then 6 chars in length > - > > Key: LUCENE-6138 > URL: https://issues.apache.org/jira/browse/LUCENE-6138 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.10.2 >Reporter: Massimo Pasquini >Priority: Minor > > I expect a stemmer to transform nouns in their singular and plural forms into > a shorter common form. The implementation of the ItalianLightStemmer doesn't > apply any stemming to words shorter then 6 characters in length. This leads > to some annoying results: > singular form | plural form > 4|5 chars in length (no stemming) > alga -> alga | alghe -> alghe > fuga -> fuga | fughe -> fughe > lega -> lega | leghe -> leghe > 5|6 chars in length (stemming only on plural form) > vanga -> vanga | vanghe -> vang > verga -> verga | verghe -> verg > I suppose that such limitation on words length is to avoid other side effects > on shorter words not in the set above, but I think something must be reviewed > in the code for better results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
[ https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259684#comment-14259684 ] Erick Erickson commented on LUCENE-6138: Right, the important part of the discussion (I should have pointed it out) was that the stemmers are not part of the Solr code base, they're another project and that project would be the place to raise possible bugs or submit patches, bq: Can you propose your changes to http://members.unine.ch/jacques.savoy/clef/index.html? Sorry for the confusion > ItalianLightStemmer doesn't apply on words shorter then 6 chars in length > - > > Key: LUCENE-6138 > URL: https://issues.apache.org/jira/browse/LUCENE-6138 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.10.2 >Reporter: Massimo Pasquini >Priority: Minor > > I expect a stemmer to transform nouns in their singular and plural forms into > a shorter common form. The implementation of the ItalianLightStemmer doesn't > apply any stemming to words shorter then 6 characters in length. This leads > to some annoying results: > singular form | plural form > 4|5 chars in length (no stemming) > alga -> alga | alghe -> alghe > fuga -> fuga | fughe -> fughe > lega -> lega | leghe -> leghe > 5|6 chars in length (stemming only on plural form) > vanga -> vanga | vanghe -> vang > verga -> verga | verghe -> verg > I suppose that such limitation on words length is to avoid other side effects > on shorter words not in the set above, but I think something must be reviewed > in the code for better results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
[ https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259655#comment-14259655 ] Massimo Pasquini commented on LUCENE-6138: -- The issue you pointed out is related to a different stemmer for Russian language. I see no connection to the Italian light stemmer. According to the rules of the Italian grammar, I think the bug I found can be fixed (it possibly cannot be done in the Russian stemmer according to what I've read on the other post). So I suppose the ItalianLightStemmer can evolve to a better implementation: it is possible to find some simple rules on words suffixes in order to make a decision about applying the stemming on short words (shorter then 6 characters). Notice my thoughts are not based on a deep and accurate study of the problem though. But I think it could be worth to investigate about it. I may suggest to submit this issue to the author of the code and see if he got a better solution (I saw he's in the field of computational linguistics). According to the notes in the source, the algorithm was written in 2005 as the result of some experiments. We don't know if they've put further efforts in investigating the problem and they possibly wrote a best algorithm they agree to publish according to Lucene's license. I don't expect the stemmer to be 100% successful, but the issue I pointed out affects an important range on words. > ItalianLightStemmer doesn't apply on words shorter then 6 chars in length > - > > Key: LUCENE-6138 > URL: https://issues.apache.org/jira/browse/LUCENE-6138 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.10.2 >Reporter: Massimo Pasquini >Priority: Minor > > I expect a stemmer to transform nouns in their singular and plural forms into > a shorter common form. The implementation of the ItalianLightStemmer doesn't > apply any stemming to words shorter then 6 characters in length. This leads > to some annoying results: > singular form | plural form > 4|5 chars in length (no stemming) > alga -> alga | alghe -> alghe > fuga -> fuga | fughe -> fughe > lega -> lega | leghe -> leghe > 5|6 chars in length (stemming only on plural form) > vanga -> vanga | vanghe -> vang > verga -> verga | verghe -> verg > I suppose that such limitation on words length is to avoid other side effects > on shorter words not in the set above, but I think something must be reviewed > in the code for better results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
[ https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258906#comment-14258906 ] Erick Erickson commented on LUCENE-6138: I think the discussion at LUCENE-6137 applies here. > ItalianLightStemmer doesn't apply on words shorter then 6 chars in length > - > > Key: LUCENE-6138 > URL: https://issues.apache.org/jira/browse/LUCENE-6138 > Project: Lucene - Core > Issue Type: Bug > Components: modules/analysis >Affects Versions: 4.10.2 >Reporter: Massimo Pasquini >Priority: Minor > > I expect a stemmer to transform nouns in their singular and plural forms into > a shorter common form. The implementation of the ItalianLightStemmer doesn't > apply any stemming to words shorter then 6 characters in length. This leads > to some annoying results: > singular form | plural form > 4|5 chars in length (no stemming) > alga -> alga | alghe -> alghe > fuga -> fuga | fughe -> fughe > lega -> lega | leghe -> leghe > 5|6 chars in length (stemming only on plural form) > vanga -> vanga | vanghe -> vang > verga -> verga | verghe -> verg > I suppose that such limitation on words length is to avoid other side effects > on shorter words not in the set above, but I think something must be reviewed > in the code for better results. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org