Re: Bad fieldNorm when using morphologic synonyms

2013-12-26 Thread Isaac Hebsh
Attached patch into the JIRA issue.
Reviews are welcome.


On Thu, Dec 19, 2013 at 7:24 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Roman, do you have any results?

 created SOLR-5561

 Robert, if I'm wrong, you are welcome to close that issue.


 On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh isaac.he...@gmail.comwrote:

 You can see the norm value, in the explain text, when setting
 debugQuery=true.
 If the same item gets different norm before/after, that's it.

 Note that this configuration is in schema.xml (not solrconfig.xml...)

 On Monday, December 9, 2013, Roman Chyla wrote:

 Isaac, is there an easy way to recognize this problem? We also index
 synonym tokens in the same position (like you do, and I'm sure that our
 positions are set correctly). I could test whether the default similarity
 factory in solrconfig.xml had any effect (before/after reindexing).

 --roman


 On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Hi Robert and Manuel.
 
  The DefaultSimilarity indeed sets discountOverlap to true by default.
  BUT, the *factory*, aka DefaultSimilarityFactory, when called by
  IndexSchema (the getSimilarity method), explicitly sets this value to
 the
  value of its corresponding class member.
  This class member is initialized to be FALSE  when the instance is
 created
  (like every boolean variable in the world). It should be set when
 init
  method is called. If the parameter is not set in schema.xml, the
 default is
  true.
 
  Everything seems to be alright, but the issue is that init method is
 NOT
  called, if the similarity is not *explicitly* declared in schema.xml.
 In
  that case, init method is not called, the discountOverlaps member (of
 the
  factory class) remains FALSE, and getSimilarity explicitly calls
  setDiscountOverlaps with value of FALSE.
 
  This is very easy to reproduce and debug.
 
 
  On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote:
 
   no, its turned on by default in the default similarity.
  
   as i said, all that is necessary is to fix your analyzer to emit the
   proper position increments.
  
   On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
   manuel.lenorm...@gmail.com wrote:
In order to set discountOverlaps to true you must have added the
similarity class=solr.DefaultSimilarityFactory to the
 schema.xml,
   which
is commented out by default!
   
As by default this param is false, the above situation is expected
 with
correct positioning, as said.
   
In order to fix the field norms you'd have to reindex with the
  similarity
class which initializes the param to true.
   
Cheers,
Manu
  
 





Re: Bad fieldNorm when using morphologic synonyms

2013-12-19 Thread Isaac Hebsh
Roman, do you have any results?

created SOLR-5561

Robert, if I'm wrong, you are welcome to close that issue.


On Mon, Dec 9, 2013 at 10:50 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 You can see the norm value, in the explain text, when setting
 debugQuery=true.
 If the same item gets different norm before/after, that's it.

 Note that this configuration is in schema.xml (not solrconfig.xml...)

 On Monday, December 9, 2013, Roman Chyla wrote:

 Isaac, is there an easy way to recognize this problem? We also index
 synonym tokens in the same position (like you do, and I'm sure that our
 positions are set correctly). I could test whether the default similarity
 factory in solrconfig.xml had any effect (before/after reindexing).

 --roman


 On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

  Hi Robert and Manuel.
 
  The DefaultSimilarity indeed sets discountOverlap to true by default.
  BUT, the *factory*, aka DefaultSimilarityFactory, when called by
  IndexSchema (the getSimilarity method), explicitly sets this value to
 the
  value of its corresponding class member.
  This class member is initialized to be FALSE  when the instance is
 created
  (like every boolean variable in the world). It should be set when init
  method is called. If the parameter is not set in schema.xml, the
 default is
  true.
 
  Everything seems to be alright, but the issue is that init method is
 NOT
  called, if the similarity is not *explicitly* declared in schema.xml. In
  that case, init method is not called, the discountOverlaps member (of
 the
  factory class) remains FALSE, and getSimilarity explicitly calls
  setDiscountOverlaps with value of FALSE.
 
  This is very easy to reproduce and debug.
 
 
  On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote:
 
   no, its turned on by default in the default similarity.
  
   as i said, all that is necessary is to fix your analyzer to emit the
   proper position increments.
  
   On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
   manuel.lenorm...@gmail.com wrote:
In order to set discountOverlaps to true you must have added the
similarity class=solr.DefaultSimilarityFactory to the
 schema.xml,
   which
is commented out by default!
   
As by default this param is false, the above situation is expected
 with
correct positioning, as said.
   
In order to fix the field norms you'd have to reindex with the
  similarity
class which initializes the param to true.
   
Cheers,
Manu
  
 




Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Manuel Le Normand
In order to set discountOverlaps to true you must have added the
similarity class=solr.DefaultSimilarityFactory to the schema.xml, which
is commented out by default!

As by default this param is false, the above situation is expected with
correct positioning, as said.

In order to fix the field norms you'd have to reindex with the similarity
class which initializes the param to true.

Cheers,
Manu


Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Robert Muir
no, its turned on by default in the default similarity.

as i said, all that is necessary is to fix your analyzer to emit the
proper position increments.

On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
manuel.lenorm...@gmail.com wrote:
 In order to set discountOverlaps to true you must have added the
 similarity class=solr.DefaultSimilarityFactory to the schema.xml, which
 is commented out by default!

 As by default this param is false, the above situation is expected with
 correct positioning, as said.

 In order to fix the field norms you'd have to reindex with the similarity
 class which initializes the param to true.

 Cheers,
 Manu


Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Isaac Hebsh
Hi Robert and Manuel.

The DefaultSimilarity indeed sets discountOverlap to true by default.
BUT, the *factory*, aka DefaultSimilarityFactory, when called by
IndexSchema (the getSimilarity method), explicitly sets this value to the
value of its corresponding class member.
This class member is initialized to be FALSE  when the instance is created
(like every boolean variable in the world). It should be set when init
method is called. If the parameter is not set in schema.xml, the default is
true.

Everything seems to be alright, but the issue is that init method is NOT
called, if the similarity is not *explicitly* declared in schema.xml. In
that case, init method is not called, the discountOverlaps member (of the
factory class) remains FALSE, and getSimilarity explicitly calls
setDiscountOverlaps with value of FALSE.

This is very easy to reproduce and debug.


On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote:

 no, its turned on by default in the default similarity.

 as i said, all that is necessary is to fix your analyzer to emit the
 proper position increments.

 On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
 manuel.lenorm...@gmail.com wrote:
  In order to set discountOverlaps to true you must have added the
  similarity class=solr.DefaultSimilarityFactory to the schema.xml,
 which
  is commented out by default!
 
  As by default this param is false, the above situation is expected with
  correct positioning, as said.
 
  In order to fix the field norms you'd have to reindex with the similarity
  class which initializes the param to true.
 
  Cheers,
  Manu



Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Roman Chyla
Isaac, is there an easy way to recognize this problem? We also index
synonym tokens in the same position (like you do, and I'm sure that our
positions are set correctly). I could test whether the default similarity
factory in solrconfig.xml had any effect (before/after reindexing).

--roman


On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh isaac.he...@gmail.com wrote:

 Hi Robert and Manuel.

 The DefaultSimilarity indeed sets discountOverlap to true by default.
 BUT, the *factory*, aka DefaultSimilarityFactory, when called by
 IndexSchema (the getSimilarity method), explicitly sets this value to the
 value of its corresponding class member.
 This class member is initialized to be FALSE  when the instance is created
 (like every boolean variable in the world). It should be set when init
 method is called. If the parameter is not set in schema.xml, the default is
 true.

 Everything seems to be alright, but the issue is that init method is NOT
 called, if the similarity is not *explicitly* declared in schema.xml. In
 that case, init method is not called, the discountOverlaps member (of the
 factory class) remains FALSE, and getSimilarity explicitly calls
 setDiscountOverlaps with value of FALSE.

 This is very easy to reproduce and debug.


 On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.com wrote:

  no, its turned on by default in the default similarity.
 
  as i said, all that is necessary is to fix your analyzer to emit the
  proper position increments.
 
  On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
  manuel.lenorm...@gmail.com wrote:
   In order to set discountOverlaps to true you must have added the
   similarity class=solr.DefaultSimilarityFactory to the schema.xml,
  which
   is commented out by default!
  
   As by default this param is false, the above situation is expected with
   correct positioning, as said.
  
   In order to fix the field norms you'd have to reindex with the
 similarity
   class which initializes the param to true.
  
   Cheers,
   Manu
 



Re: Bad fieldNorm when using morphologic synonyms

2013-12-09 Thread Isaac Hebsh
You can see the norm value, in the explain text, when setting
debugQuery=true.
If the same item gets different norm before/after, that's it.

Note that this configuration is in schema.xml (not solrconfig.xml...)

On Monday, December 9, 2013, Roman Chyla wrote:

 Isaac, is there an easy way to recognize this problem? We also index
 synonym tokens in the same position (like you do, and I'm sure that our
 positions are set correctly). I could test whether the default similarity
 factory in solrconfig.xml had any effect (before/after reindexing).

 --roman


 On Mon, Dec 9, 2013 at 2:42 PM, Isaac Hebsh 
 isaac.he...@gmail.comjavascript:;
 wrote:

  Hi Robert and Manuel.
 
  The DefaultSimilarity indeed sets discountOverlap to true by default.
  BUT, the *factory*, aka DefaultSimilarityFactory, when called by
  IndexSchema (the getSimilarity method), explicitly sets this value to the
  value of its corresponding class member.
  This class member is initialized to be FALSE  when the instance is
 created
  (like every boolean variable in the world). It should be set when init
  method is called. If the parameter is not set in schema.xml, the default
 is
  true.
 
  Everything seems to be alright, but the issue is that init method is
 NOT
  called, if the similarity is not *explicitly* declared in schema.xml. In
  that case, init method is not called, the discountOverlaps member (of the
  factory class) remains FALSE, and getSimilarity explicitly calls
  setDiscountOverlaps with value of FALSE.
 
  This is very easy to reproduce and debug.
 
 
  On Mon, Dec 9, 2013 at 9:19 PM, Robert Muir rcm...@gmail.comjavascript:;
 wrote:
 
   no, its turned on by default in the default similarity.
  
   as i said, all that is necessary is to fix your analyzer to emit the
   proper position increments.
  
   On Mon, Dec 9, 2013 at 12:27 PM, Manuel Le Normand
   manuel.lenorm...@gmail.com javascript:; wrote:
In order to set discountOverlaps to true you must have added the
similarity class=solr.DefaultSimilarityFactory to the schema.xml,
   which
is commented out by default!
   
As by default this param is false, the above situation is expected
 with
correct positioning, as said.
   
In order to fix the field norms you'd have to reindex with the
  similarity
class which initializes the param to true.
   
Cheers,
Manu
  
 



Re: Bad fieldNorm when using morphologic synonyms

2013-12-08 Thread Manuel Le Normand
Robert, you last reply is not accurate.
It's true that the field norms and termVectors are independent. But this
issue of higher norms for this case is expected with well assigned
positions. The LengthNorm is assigned as FieldInvertState.length which is
the count of incrementToken and not num of positions! It is the case for
wordDelimiterFilter or ReversedWildcardFilter which do change the norm when
expanding a term.


Re: Bad fieldNorm when using morphologic synonyms

2013-12-08 Thread Robert Muir
its accurate, you are wrong.

please, look at setDiscountOverlaps in your similarity. This is really
easy to understand.

On Sun, Dec 8, 2013 at 7:23 AM, Manuel Le Normand
manuel.lenorm...@gmail.com wrote:
 Robert, you last reply is not accurate.
 It's true that the field norms and termVectors are independent. But this
 issue of higher norms for this case is expected with well assigned
 positions. The LengthNorm is assigned as FieldInvertState.length which is
 the count of incrementToken and not num of positions! It is the case for
 wordDelimiterFilter or ReversedWildcardFilter which do change the norm when
 expanding a term.


Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Robert Muir
Your analyzer needs to set positionIncrement correctly: sounds like its broken.

On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote:
 Hi,
 we implemented a morphologic analyzer, which stems words on index time.
 For some reasons, we index both the original word and the stem (on the same
 position, of course).
 The stemming is done on a specific language, so other languages are not
 stemmed at all.

 Because of that, two documents with the same amount of terms, may have
 different termVector size. document which contains many words that being
 stemmed, will have a double sized termVector. This behaviour affects the
 relevance score in a BAD way. the fieldNorm of these documents reduces
 thier score. This is NOT the wanted behaviour in our case.

 We are looking for a way to mark the stemmed words (on index time, of
 course) so they won't affect the fieldNorm. Do such a way exist?

 Do you have another idea?


Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Isaac Hebsh
1) positions look all right (for me).
2) fieldNorm is determined by the size of the termVector, isn't it? the
termVector size isn't affected by the positions.


On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir rcm...@gmail.com wrote:

 Your analyzer needs to set positionIncrement correctly: sounds like its
 broken.

 On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote:
  Hi,
  we implemented a morphologic analyzer, which stems words on index time.
  For some reasons, we index both the original word and the stem (on the
 same
  position, of course).
  The stemming is done on a specific language, so other languages are not
  stemmed at all.
 
  Because of that, two documents with the same amount of terms, may have
  different termVector size. document which contains many words that being
  stemmed, will have a double sized termVector. This behaviour affects the
  relevance score in a BAD way. the fieldNorm of these documents reduces
  thier score. This is NOT the wanted behaviour in our case.
 
  We are looking for a way to mark the stemmed words (on index time, of
  course) so they won't affect the fieldNorm. Do such a way exist?
 
  Do you have another idea?



Re: Bad fieldNorm when using morphologic synonyms

2013-12-06 Thread Robert Muir
termvectors have nothing to do with any of this.

please, fix your analyzer first. if you want to add a synonym, it
should be position increment of zero.

i bet exact phrase queries aren't working correctly either.

On Fri, Dec 6, 2013 at 12:50 AM, Isaac Hebsh isaac.he...@gmail.com wrote:
 1) positions look all right (for me).
 2) fieldNorm is determined by the size of the termVector, isn't it? the
 termVector size isn't affected by the positions.


 On Fri, Dec 6, 2013 at 10:46 AM, Robert Muir rcm...@gmail.com wrote:

 Your analyzer needs to set positionIncrement correctly: sounds like its
 broken.

 On Thu, Dec 5, 2013 at 1:53 PM, Isaac Hebsh isaac.he...@gmail.com wrote:
  Hi,
  we implemented a morphologic analyzer, which stems words on index time.
  For some reasons, we index both the original word and the stem (on the
 same
  position, of course).
  The stemming is done on a specific language, so other languages are not
  stemmed at all.
 
  Because of that, two documents with the same amount of terms, may have
  different termVector size. document which contains many words that being
  stemmed, will have a double sized termVector. This behaviour affects the
  relevance score in a BAD way. the fieldNorm of these documents reduces
  thier score. This is NOT the wanted behaviour in our case.
 
  We are looking for a way to mark the stemmed words (on index time, of
  course) so they won't affect the fieldNorm. Do such a way exist?
 
  Do you have another idea?



Bad fieldNorm when using morphologic synonyms

2013-12-05 Thread Isaac Hebsh
Hi,
we implemented a morphologic analyzer, which stems words on index time.
For some reasons, we index both the original word and the stem (on the same
position, of course).
The stemming is done on a specific language, so other languages are not
stemmed at all.

Because of that, two documents with the same amount of terms, may have
different termVector size. document which contains many words that being
stemmed, will have a double sized termVector. This behaviour affects the
relevance score in a BAD way. the fieldNorm of these documents reduces
thier score. This is NOT the wanted behaviour in our case.

We are looking for a way to mark the stemmed words (on index time, of
course) so they won't affect the fieldNorm. Do such a way exist?

Do you have another idea?


Re: Bad fieldNorm when using morphologic synonyms

2013-12-05 Thread Ahmet Arslan
Hi Isaac,

Did you consider omitting norms completely for that field? omitNorms=true
Are you using solr.RemoveDuplicatesTokenFilterFactory?



On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh isaac.he...@gmail.com 
wrote:
 
Hi,
we implemented a morphologic analyzer, which stems words on index time.
For some reasons, we index both the original word and the stem (on the same
position, of course).
The stemming is done on a specific language, so other languages are not
stemmed at all.

Because of that, two documents with the same amount of terms, may have
different termVector size. document which contains many words that being
stemmed, will have a double sized termVector. This behaviour affects the
relevance score in a BAD way. the fieldNorm of these documents reduces
thier score. This is NOT the wanted behaviour in our case.

We are looking for a way to mark the stemmed words (on index time, of
course) so they won't affect the fieldNorm. Do such a way exist?

Do you have another idea?

Re: Bad fieldNorm when using morphologic synonyms

2013-12-05 Thread Isaac Hebsh
The field is our main textual field. In the standard case, the
length-normalization makes a significant work with tf-idf, we don't want to
avoid it.

Removing duplicates won't help here, because the terms are not dup. One
term is stemmed, and the other is not.


On Fri, Dec 6, 2013 at 9:48 AM, Ahmet Arslan iori...@yahoo.com wrote:

 Hi Isaac,

 Did you consider omitting norms completely for that field? omitNorms=true
 Are you using solr.RemoveDuplicatesTokenFilterFactory?



 On Thursday, December 5, 2013 8:55 PM, Isaac Hebsh isaac.he...@gmail.com
 wrote:

 Hi,
 we implemented a morphologic analyzer, which stems words on index time.
 For some reasons, we index both the original word and the stem (on the same
 position, of course).
 The stemming is done on a specific language, so other languages are not
 stemmed at all.

 Because of that, two documents with the same amount of terms, may have
 different termVector size. document which contains many words that being
 stemmed, will have a double sized termVector. This behaviour affects the
 relevance score in a BAD way. the fieldNorm of these documents reduces
 thier score. This is NOT the wanted behaviour in our case.

 We are looking for a way to mark the stemmed words (on index time, of
 course) so they won't affect the fieldNorm. Do such a way exist?

 Do you have another idea?