Re: Bad fieldNorm when using morphologic synonyms
Attached patch into the JIRA issue. Reviews are welcome. On Thu, Dec 19, 2013 at 7:24 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Roman, do you have any results? created SOLR-5561 Robert, if I'm wrong, you are welcome to close that issue. On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh isaac.he...@gmail.comwrote: You can see the norm value, in the explain text, when setting debugQuery=true. If the same item gets different norm before/after, that's it. Note that this configuration is in schema.xml (not solrconfig.xml...) On Monday, December 9, 2013, Roman Chyla wrote: Isaac, is there an easy way to recognize this problem? We also index synonym tokens in the same position (like you do, and I'm sure that our positions are set correctly). I could test whether the default similarity factory in solrconfig.xml had any effect (before/after reindexing). --roman On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
Roman, do you have any results? created SOLR-5561 Robert, if I'm wrong, you are welcome to close that issue. On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh isaac.he...@gmail.com wrote: You can see the norm value, in the explain text, when setting debugQuery=true. If the same item gets different norm before/after, that's it. Note that this configuration is in schema.xml (not solrconfig.xml...) On Monday, December 9, 2013, Roman Chyla wrote: Isaac, is there an easy way to recognize this problem? We also index synonym tokens in the same position (like you do, and I'm sure that our positions are set correctly). I could test whether the default similarity factory in solrconfig.xml had any effect (before/after reindexing). --roman On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
Isaac, is there an easy way to recognize this problem? We also index synonym tokens in the same position (like you do, and I'm sure that our positions are set correctly). I could test whether the default similarity factory in solrconfig.xml had any effect (before/after reindexing). --roman On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
You can see the norm value, in the explain text, when setting debugQuery=true. If the same item gets different norm before/after, that's it. Note that this configuration is in schema.xml (not solrconfig.xml...) On Monday, December 9, 2013, Roman Chyla wrote: Isaac, is there an easy way to recognize this problem? We also index synonym tokens in the same position (like you do, and I'm sure that our positions are set correctly). I could test whether the default similarity factory in solrconfig.xml had any effect (before/after reindexing). --roman On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.comjavascript:; wrote: Hi Robert and Manuel. The DefaultSimilarity indeed sets discountOverlap to true by default. BUT, the *factory*, aka DefaultSimilarityFactory, when called by IndexSchema (the getSimilarity method), explicitly sets this value to the value of its corresponding class member. This class member is initialized to be FALSE when the instance is created (like every boolean variable in the world). It should be set when init method is called. If the parameter is not set in schema.xml, the default is true. Everything seems to be alright, but the issue is that init method is NOT called, if the similarity is not *explicitly* declared in schema.xml. In that case, init method is not called, the discountOverlaps member (of the factory class) remains FALSE, and getSimilarity explicitly calls setDiscountOverlaps with value of FALSE. This is very easy to reproduce and debug. On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.comjavascript:; wrote: no, its turned on by default in the default similarity. as i said, all that is necessary is to fix your analyzer to emit the proper position increments. On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand manuel.lenorm...@gmail.com javascript:; wrote: In order to set discountOverlaps to true you must have added the similarity class=solr.DefaultSimilarityFactory to the schema.xml, which is commented out by default! As by default this param is false, the above situation is expected with correct positioning, as said. In order to fix the field norms you'd have to reindex with the similarity class which initializes the param to true. Cheers, Manu
Re: Bad fieldNorm when using morphologic synonyms
Robert, you last reply is not accurate. It's true that the field norms and termVectors are independent. But this issue of higher norms for this case is expected with well assigned positions. The LengthNorm is assigned as FieldInvertState.length which is the count of incrementToken and not num of positions! It is the case for wordDelimiterFilter or ReversedWildcardFilter which do change the norm when expanding a term.
Re: Bad fieldNorm when using morphologic synonyms
its accurate, you are wrong. please, look at setDiscountOverlaps in your similarity. This is really easy to understand. On Sun, Dec 8, 2013 at 7:23 AM, Manuel Le Normand manuel.lenorm...@gmail.com wrote: Robert, you last reply is not accurate. It's true that the field norms and termVectors are independent. But this issue of higher norms for this case is expected with well assigned positions. The LengthNorm is assigned as FieldInvertState.length which is the count of incrementToken and not num of positions! It is the case for wordDelimiterFilter or ReversedWildcardFilter which do change the norm when expanding a term.
Re: Bad fieldNorm when using morphologic synonyms
Your analyzer needs to set positionIncrement correctly: sounds like its broken. On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
Re: Bad fieldNorm when using morphologic synonyms
1) positions look all right (for me). 2) fieldNorm is determined by the size of the termVector, isn't it? the termVector size isn't affected by the positions. On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir rcm...@gmail.com wrote: Your analyzer needs to set positionIncrement correctly: sounds like its broken. On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
Re: Bad fieldNorm when using morphologic synonyms
termvectors have nothing to do with any of this. please, fix your analyzer first. if you want to add a synonym, it should be position increment of zero. i bet exact phrase queries aren't working correctly either. On Fri, Dec 6, 2013 at 12:50 AM, Isaac Hebsh isaac.he...@gmail.com wrote: 1) positions look all right (for me). 2) fieldNorm is determined by the size of the termVector, isn't it? the termVector size isn't affected by the positions. On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir rcm...@gmail.com wrote: Your analyzer needs to set positionIncrement correctly: sounds like its broken. On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
Re: Bad fieldNorm when using morphologic synonyms
Hi Isaac, Did you consider omitting norms completely for that field? omitNorms=true Are you using solr.RemoveDuplicatesTokenFilterFactory? On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?
Re: Bad fieldNorm when using morphologic synonyms
The field is our main textual field. In the standard case, the length-normalization makes a significant work with tf-idf, we don't want to avoid it. Removing duplicates won't help here, because the terms are not dup. One term is stemmed, and the other is not. On Fri, Dec 6, 2013 at 9:48 AM, Ahmet Arslan iori...@yahoo.com wrote: Hi Isaac, Did you consider omitting norms completely for that field? omitNorms=true Are you using solr.RemoveDuplicatesTokenFilterFactory? On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh isaac.he...@gmail.com wrote: Hi, we implemented a morphologic analyzer, which stems words on index time. For some reasons, we index both the original word and the stem (on the same position, of course). The stemming is done on a specific language, so other languages are not stemmed at all. Because of that, two documents with the same amount of terms, may have different termVector size. document which contains many words that being stemmed, will have a double sized termVector. This behaviour affects the relevance score in a BAD way. the fieldNorm of these documents reduces thier score. This is NOT the wanted behaviour in our case. We are looking for a way to mark the stemmed words (on index time, of course) so they won't affect the fieldNorm. Do such a way exist? Do you have another idea?