[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length

2014-12-30 Thread Massimo Pasquini (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261470#comment-14261470
 ] 

Massimo Pasquini commented on LUCENE-6138:
--

I got the point about "it's not your work", after all this is the good thing 
about open source software. I also definitely agree your second point "these 
kind of changes need real evaluation work too": that's exactly what I was 
thinking before opening this issue. 
I think I will try to fix the ItalianLightStemmer myself and see what the 
results will be, but considering the wide range of test necessary in this 
specific domain, it will take a while before I could find enough time to try to 
fix it.
About this issue, I suppose it may be closed as "won't fix" then.

> ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
> -
>
> Key: LUCENE-6138
> URL: https://issues.apache.org/jira/browse/LUCENE-6138
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.10.2
>Reporter: Massimo Pasquini
>Priority: Minor
>
> I expect a stemmer to transform nouns in their singular and plural forms into 
> a shorter common form. The implementation of the ItalianLightStemmer doesn't 
> apply any stemming to words shorter then 6 characters in length. This leads 
> to some annoying results:
> singular form | plural form
> 4|5 chars in length (no stemming)
> alga -> alga | alghe -> alghe
> fuga -> fuga | fughe -> fughe
> lega -> lega | leghe -> leghe
> 5|6 chars in length (stemming only on plural form)
> vanga -> vanga | vanghe -> vang
> verga -> verga | verghe -> verg
> I suppose that such limitation on words length is to avoid other side effects 
> on shorter words not in the set above, but I think something must be reviewed 
> in the code for better results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length

2014-12-30 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261247#comment-14261247
 ] 

Erick Erickson commented on LUCENE-6138:


Unfortunately I don't quite know, I've used the stemmers "as is".

Sorry I can't help... 

> ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
> -
>
> Key: LUCENE-6138
> URL: https://issues.apache.org/jira/browse/LUCENE-6138
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.10.2
>Reporter: Massimo Pasquini
>Priority: Minor
>
> I expect a stemmer to transform nouns in their singular and plural forms into 
> a shorter common form. The implementation of the ItalianLightStemmer doesn't 
> apply any stemming to words shorter then 6 characters in length. This leads 
> to some annoying results:
> singular form | plural form
> 4|5 chars in length (no stemming)
> alga -> alga | alghe -> alghe
> fuga -> fuga | fughe -> fughe
> lega -> lega | leghe -> leghe
> 5|6 chars in length (stemming only on plural form)
> vanga -> vanga | vanghe -> vang
> verga -> verga | verghe -> verg
> I suppose that such limitation on words length is to avoid other side effects 
> on shorter words not in the set above, but I think something must be reviewed 
> in the code for better results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length

2014-12-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261242#comment-14261242
 ] 

Robert Muir commented on LUCENE-6138:
-

One nice thing about the porter stemmer is that it does some basic syllable 
counting instead of using arbitrary limits of characters like the light 
stemmers do. This is more complicated than checking a limit, but probably the 
best solution for this problem. Its probably easier to implement this logic 
with the snowball language itself (for this language, the logic already exists 
there), but could be done in java too. However these kind of changes need real 
evaluation work too. Currently we don't invent these light stemmers, we just 
use other people's work.



> ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
> -
>
> Key: LUCENE-6138
> URL: https://issues.apache.org/jira/browse/LUCENE-6138
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.10.2
>Reporter: Massimo Pasquini
>Priority: Minor
>
> I expect a stemmer to transform nouns in their singular and plural forms into 
> a shorter common form. The implementation of the ItalianLightStemmer doesn't 
> apply any stemming to words shorter then 6 characters in length. This leads 
> to some annoying results:
> singular form | plural form
> 4|5 chars in length (no stemming)
> alga -> alga | alghe -> alghe
> fuga -> fuga | fughe -> fughe
> lega -> lega | leghe -> leghe
> 5|6 chars in length (stemming only on plural form)
> vanga -> vanga | vanghe -> vang
> verga -> verga | verghe -> verg
> I suppose that such limitation on words length is to avoid other side effects 
> on shorter words not in the set above, but I think something must be reviewed 
> in the code for better results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length

2014-12-30 Thread Massimo Pasquini (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14261227#comment-14261227
 ] 

Massimo Pasquini commented on LUCENE-6138:
--

Could you please give me the link to the right place to post issues about the 
stemmers? I cannot find any link to the project. Thanks

> ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
> -
>
> Key: LUCENE-6138
> URL: https://issues.apache.org/jira/browse/LUCENE-6138
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.10.2
>Reporter: Massimo Pasquini
>Priority: Minor
>
> I expect a stemmer to transform nouns in their singular and plural forms into 
> a shorter common form. The implementation of the ItalianLightStemmer doesn't 
> apply any stemming to words shorter then 6 characters in length. This leads 
> to some annoying results:
> singular form | plural form
> 4|5 chars in length (no stemming)
> alga -> alga | alghe -> alghe
> fuga -> fuga | fughe -> fughe
> lega -> lega | leghe -> leghe
> 5|6 chars in length (stemming only on plural form)
> vanga -> vanga | vanghe -> vang
> verga -> verga | verghe -> verg
> I suppose that such limitation on words length is to avoid other side effects 
> on shorter words not in the set above, but I think something must be reviewed 
> in the code for better results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length

2014-12-28 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259684#comment-14259684
 ] 

Erick Erickson commented on LUCENE-6138:


Right, the important part of the discussion (I should have pointed it out) was 
that the stemmers are not part of the Solr code base, they're another project 
and that project would be the place to raise possible bugs or submit patches, 

bq: Can you propose your changes to 
http://members.unine.ch/jacques.savoy/clef/index.html?

Sorry for the confusion



> ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
> -
>
> Key: LUCENE-6138
> URL: https://issues.apache.org/jira/browse/LUCENE-6138
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.10.2
>Reporter: Massimo Pasquini
>Priority: Minor
>
> I expect a stemmer to transform nouns in their singular and plural forms into 
> a shorter common form. The implementation of the ItalianLightStemmer doesn't 
> apply any stemming to words shorter then 6 characters in length. This leads 
> to some annoying results:
> singular form | plural form
> 4|5 chars in length (no stemming)
> alga -> alga | alghe -> alghe
> fuga -> fuga | fughe -> fughe
> lega -> lega | leghe -> leghe
> 5|6 chars in length (stemming only on plural form)
> vanga -> vanga | vanghe -> vang
> verga -> verga | verghe -> verg
> I suppose that such limitation on words length is to avoid other side effects 
> on shorter words not in the set above, but I think something must be reviewed 
> in the code for better results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length

2014-12-28 Thread Massimo Pasquini (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14259655#comment-14259655
 ] 

Massimo Pasquini commented on LUCENE-6138:
--

The issue you pointed out is related to a different stemmer for Russian 
language. I see no connection to the Italian light stemmer. According to the 
rules of the Italian grammar, I think the bug I found can be fixed (it possibly 
cannot be done in the Russian stemmer according to what I've read on the other 
post).

So I suppose the ItalianLightStemmer can evolve to a better implementation: it 
is possible to find some simple rules on words suffixes in order to make a 
decision about applying the stemming on short words (shorter then 6 characters).

Notice my thoughts are not based on a deep and accurate study of the problem 
though. But I think it could be worth to investigate about it. I may suggest to 
submit this issue to the author of the code and see if he got a better solution 
(I saw he's in the field of computational linguistics). According to the notes 
in the source, the algorithm was written in 2005 as the result of some 
experiments. We don't know if they've put further efforts in investigating the 
problem and they possibly wrote a best algorithm they agree to publish 
according to Lucene's license.

I don't expect the stemmer to be 100% successful, but the issue I pointed out 
affects an important range on words.

> ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
> -
>
> Key: LUCENE-6138
> URL: https://issues.apache.org/jira/browse/LUCENE-6138
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.10.2
>Reporter: Massimo Pasquini
>Priority: Minor
>
> I expect a stemmer to transform nouns in their singular and plural forms into 
> a shorter common form. The implementation of the ItalianLightStemmer doesn't 
> apply any stemming to words shorter then 6 characters in length. This leads 
> to some annoying results:
> singular form | plural form
> 4|5 chars in length (no stemming)
> alga -> alga | alghe -> alghe
> fuga -> fuga | fughe -> fughe
> lega -> lega | leghe -> leghe
> 5|6 chars in length (stemming only on plural form)
> vanga -> vanga | vanghe -> vang
> verga -> verga | verghe -> verg
> I suppose that such limitation on words length is to avoid other side effects 
> on shorter words not in the set above, but I think something must be reviewed 
> in the code for better results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-6138) ItalianLightStemmer doesn't apply on words shorter then 6 chars in length

2014-12-25 Thread Erick Erickson (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-6138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14258906#comment-14258906
 ] 

Erick Erickson commented on LUCENE-6138:


I think the discussion at LUCENE-6137 applies here.

> ItalianLightStemmer doesn't apply on words shorter then 6 chars in length
> -
>
> Key: LUCENE-6138
> URL: https://issues.apache.org/jira/browse/LUCENE-6138
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 4.10.2
>Reporter: Massimo Pasquini
>Priority: Minor
>
> I expect a stemmer to transform nouns in their singular and plural forms into 
> a shorter common form. The implementation of the ItalianLightStemmer doesn't 
> apply any stemming to words shorter then 6 characters in length. This leads 
> to some annoying results:
> singular form | plural form
> 4|5 chars in length (no stemming)
> alga -> alga | alghe -> alghe
> fuga -> fuga | fughe -> fughe
> lega -> lega | leghe -> leghe
> 5|6 chars in length (stemming only on plural form)
> vanga -> vanga | vanghe -> vang
> verga -> verga | verghe -> verg
> I suppose that such limitation on words length is to avoid other side effects 
> on shorter words not in the set above, but I think something must be reviewed 
> in the code for better results.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org