Re: Length norm not functioning in solr queries.

2014-12-11 Thread S.L
Mikhail,

Thank you for confirming this , however Ahmet's proposal seems more simpler
to implement to me .

On Wed, Dec 10, 2014 at 5:07 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 S.L,

 I briefly skimmed Lucene50NormsConsumer.writeNormsField(), my conclusion
 is: if you supply own similarity, which just avoids putting float to byte
 in Similarity.computeNorm(FieldInvertState), you get right this value in .
 Similarity.decodeNormValue(long).
 You may wonder but this is what's exactly done in PreciseDefaultSimilarity
 in TestLongNormValueSource. I think you can just use it.

 On Wed, Dec 10, 2014 at 12:11 PM, S.L simpleliving...@gmail.com wrote:

  Hi Ahmet,
 
  Is there already an implementation of the suggested work around ? Thanks.
 
  On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
  wrote:
 
   Hi,
  
   Default length norm is not best option for differentiating very short
   documents, like product names.
   Please see :
   http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
  
   I suggest you to create an additional integer field, that holds number
 of
   tokens. You can populate it via update processor. And then penalise
  (using
   fuction queries) according to that field. This way you have more fine
   grained and flexible control over it.
  
   Ahmet
  
  
  
   On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
   wrote:
   Hi ,
  
   Mikhail Thanks , I looked at the explain and this is what I see for the
  two
   different documents in questions, they have identical scores   even
  though
   the document 2 has a shorter productName field, I do not see any
  lenghtNorm
   related information in the explain.
  
   Also I am not exactly clear on what needs to be looked in the API ?
  
   *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
   productNameps=1pf2= productNamepf3=
   productNamestopwords=truelowercaseOperators=true
  
   *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
   Unlocked *
  
  
  - *100%* 10.649221 sum of the following:
 - *10.58%* 1.1270299 sum of the following:
- *2.1%* 0.22383358 productName:iphon
- *3.47%* 0.36922288 productName:4 s
- *5.01%* 0.53397346 productName:16 gb
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 - *27.79%* 2.959255 sum of the following:
- *10.97%* 1.1680154 productName:iphon 4 s~1
- *16.82%* 1.7912396 productName:4 s 16 gb~1
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  
  
   *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
  
  
  - *100%* 10.649221 sum of the following:
 - *10.58%* 1.1270299 sum of the following:
- *2.1%* 0.22383358 productName:iphon
- *3.47%* 0.36922288 productName:4 s
- *5.01%* 0.53397346 productName:16 gb
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 - *27.79%* 2.959255 sum of the following:
- *10.97%* 1.1680154 productName:iphon 4 s~1
- *16.82%* 1.7912396 productName:4 s 16 gb~1
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  
  
  
  
  
   On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
It's worth to look into explain to check particular scoring values.
  But
for most suspect is the reducing precision when float norms are
 stored
  in
byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
   
   
On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com
 wrote:
   
 I have two documents doc1 and doc2 and each one of those has a
 field
called
 phoneName.

 doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White
  (Verizon)
 Smartphone Factory Unlocked

 doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White

 Here if I search for


   
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true

 Doc1 and Doc2 both have the same identical score , but since the
  field
 phoneName in the doc2 has shorter length I would expect it to have
 a
higher
 score , but both have an identical score of 9.961212.

 The phoneName filed is defined as follows.As we can see no where
 am I
 specifying omitNorms=True, still the behavior seems to be that the
   length
 norm is not functioning at all. Can some one let me know whats the
   issue
 here ?

 field name=phoneName type=text_en_splitting
  indexed=true
 stored=true required=true /
 fieldType name=text_en_splitting class=solr.TextField
 positionIncrementGap=100
   autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory
 /
 !-- in this example, we will only use 

Re: Length norm not functioning in solr queries.

2014-12-11 Thread S.L
Ahmet,

Thank you , as the configurations in SolrCloud are uploaded to zookeeper ,
are there any special steps that need to be taken to make this work in
SolrCloud ?

On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Or even better, you can use your new field for tie break purposes. Where
 scores are identical.
 e.g. sort=score desc, wordCount asc

 Ahmet


 On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan iori...@yahoo.com
 wrote:
 Hi,

 You mean update processor factory?

 Here is augmented (wordCount field added) version of your example :

 doc1:

 phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
 Smartphone Factory Unlocked
 wordCount: 11

 doc2:

 phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
 wordCount: 9


 First task is simply calculate wordCount values. You can do it in your
 indexing code, or other places.
 I quickly skimmed existing update processors but I couldn't find stock
 implementation.
 CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is
 all about multivalued fields.

 I guess, A simple javascript that splits on whitespace and returns the
 produced array size would do the trick :
 StatelessScriptUpdateProcessorFactory



 At this point you have a int field named word count.
 boost=div(1,wordCount) should work. Or you can came up with more
 sophisticated math formula.

 Ahmet


 On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com
 wrote:
 Hi Ahmet,

 Is there already an implementation of the suggested work around ? Thanks.


 On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:

  Hi,
 
  Default length norm is not best option for differentiating very short
  documents, like product names.
  Please see :
  http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
 
  I suggest you to create an additional integer field, that holds number of
  tokens. You can populate it via update processor. And then penalise
 (using
  fuction queries) according to that field. This way you have more fine
  grained and flexible control over it.
 
  Ahmet
 
 
 
  On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
  wrote:
  Hi ,
 
  Mikhail Thanks , I looked at the explain and this is what I see for the
 two
  different documents in questions, they have identical scores   even
 though
  the document 2 has a shorter productName field, I do not see any
 lenghtNorm
  related information in the explain.
 
  Also I am not exactly clear on what needs to be looked in the API ?
 
  *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
  productNameps=1pf2= productNamepf3=
  productNamestopwords=truelowercaseOperators=true
 
  *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
  Unlocked *
 
 
 - *100%* 10.649221 sum of the following:
- *10.58%* 1.1270299 sum of the following:
   - *2.1%* 0.22383358 productName:iphon
   - *3.47%* 0.36922288 productName:4 s
   - *5.01%* 0.53397346 productName:16 gb
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
- *27.79%* 2.959255 sum of the following:
   - *10.97%* 1.1680154 productName:iphon 4 s~1
   - *16.82%* 1.7912396 productName:4 s 16 gb~1
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 
 
  *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
 
 
 - *100%* 10.649221 sum of the following:
- *10.58%* 1.1270299 sum of the following:
   - *2.1%* 0.22383358 productName:iphon
   - *3.47%* 0.36922288 productName:4 s
   - *5.01%* 0.53397346 productName:16 gb
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
- *27.79%* 2.959255 sum of the following:
   - *10.97%* 1.1680154 productName:iphon 4 s~1
   - *16.82%* 1.7912396 productName:4 s 16 gb~1
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 
 
 
 
 
  On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   It's worth to look into explain to check particular scoring values.
 But
   for most suspect is the reducing precision when float norms are stored
 in
   byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
  
  
   On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
  
I have two documents doc1 and doc2 and each one of those has a field
   called
phoneName.
   
doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White
 (Verizon)
Smartphone Factory Unlocked
   
doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
   
Here if I search for
   
   
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
   
Doc1 and Doc2 both have the same identical score , but since the
 field
phoneName in the doc2 has shorter length I would expect it to have a
   higher
score , but both have an identical 

Re: Length norm not functioning in solr queries.

2014-12-11 Thread Ahmet Arslan
Hi,

No special steps to be taken for cloud setup. Please note that for both 
solutions, re-index is mandatory.

Ahmet



On Thursday, December 11, 2014 12:15 PM, S.L simpleliving...@gmail.com wrote:
Ahmet,

Thank you , as the configurations in SolrCloud are uploaded to zookeeper ,
are there any special steps that need to be taken to make this work in
SolrCloud ?


On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Or even better, you can use your new field for tie break purposes. Where
 scores are identical.
 e.g. sort=score desc, wordCount asc

 Ahmet


 On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan iori...@yahoo.com
 wrote:
 Hi,

 You mean update processor factory?

 Here is augmented (wordCount field added) version of your example :

 doc1:

 phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
 Smartphone Factory Unlocked
 wordCount: 11

 doc2:

 phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
 wordCount: 9


 First task is simply calculate wordCount values. You can do it in your
 indexing code, or other places.
 I quickly skimmed existing update processors but I couldn't find stock
 implementation.
 CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is
 all about multivalued fields.

 I guess, A simple javascript that splits on whitespace and returns the
 produced array size would do the trick :
 StatelessScriptUpdateProcessorFactory



 At this point you have a int field named word count.
 boost=div(1,wordCount) should work. Or you can came up with more
 sophisticated math formula.

 Ahmet


 On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com
 wrote:
 Hi Ahmet,

 Is there already an implementation of the suggested work around ? Thanks.


 On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:

  Hi,
 
  Default length norm is not best option for differentiating very short
  documents, like product names.
  Please see :
  http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
 
  I suggest you to create an additional integer field, that holds number of
  tokens. You can populate it via update processor. And then penalise
 (using
  fuction queries) according to that field. This way you have more fine
  grained and flexible control over it.
 
  Ahmet
 
 
 
  On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
  wrote:
  Hi ,
 
  Mikhail Thanks , I looked at the explain and this is what I see for the
 two
  different documents in questions, they have identical scores   even
 though
  the document 2 has a shorter productName field, I do not see any
 lenghtNorm
  related information in the explain.
 
  Also I am not exactly clear on what needs to be looked in the API ?
 
  *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
  productNameps=1pf2= productNamepf3=
  productNamestopwords=truelowercaseOperators=true
 
  *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
  Unlocked *
 
 
 - *100%* 10.649221 sum of the following:
- *10.58%* 1.1270299 sum of the following:
   - *2.1%* 0.22383358 productName:iphon
   - *3.47%* 0.36922288 productName:4 s
   - *5.01%* 0.53397346 productName:16 gb
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
- *27.79%* 2.959255 sum of the following:
   - *10.97%* 1.1680154 productName:iphon 4 s~1
   - *16.82%* 1.7912396 productName:4 s 16 gb~1
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 
 
  *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
 
 
 - *100%* 10.649221 sum of the following:
- *10.58%* 1.1270299 sum of the following:
   - *2.1%* 0.22383358 productName:iphon
   - *3.47%* 0.36922288 productName:4 s
   - *5.01%* 0.53397346 productName:16 gb
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
- *27.79%* 2.959255 sum of the following:
   - *10.97%* 1.1680154 productName:iphon 4 s~1
   - *16.82%* 1.7912396 productName:4 s 16 gb~1
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 
 
 
 
 
  On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   It's worth to look into explain to check particular scoring values.
 But
   for most suspect is the reducing precision when float norms are stored
 in
   byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
  
  
   On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
  
I have two documents doc1 and doc2 and each one of those has a field
   called
phoneName.
   
doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White
 (Verizon)
Smartphone Factory Unlocked
   
doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
   
Here if I search for
   
   
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
   
 

Re: Length norm not functioning in solr queries.

2014-12-11 Thread S.L
Yes, I understand that reindexing is neccesary , however for some reason I
was not able to invoke the js script from the updateprocessor, so I ended
up using Java only solution at index time.

Thanks.

On Thu, Dec 11, 2014 at 7:18 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 No special steps to be taken for cloud setup. Please note that for both
 solutions, re-index is mandatory.

 Ahmet



 On Thursday, December 11, 2014 12:15 PM, S.L simpleliving...@gmail.com
 wrote:
 Ahmet,

 Thank you , as the configurations in SolrCloud are uploaded to zookeeper ,
 are there any special steps that need to be taken to make this work in
 SolrCloud ?


 On Wed, Dec 10, 2014 at 4:32 AM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:
 
  Hi,
 
  Or even better, you can use your new field for tie break purposes. Where
  scores are identical.
  e.g. sort=score desc, wordCount asc
 
  Ahmet
 
 
  On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan 
 iori...@yahoo.com
  wrote:
  Hi,
 
  You mean update processor factory?
 
  Here is augmented (wordCount field added) version of your example :
 
  doc1:
 
  phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
  Smartphone Factory Unlocked
  wordCount: 11
 
  doc2:
 
  phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
  wordCount: 9
 
 
  First task is simply calculate wordCount values. You can do it in your
  indexing code, or other places.
  I quickly skimmed existing update processors but I couldn't find stock
  implementation.
  CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is
  all about multivalued fields.
 
  I guess, A simple javascript that splits on whitespace and returns the
  produced array size would do the trick :
  StatelessScriptUpdateProcessorFactory
 
 
 
  At this point you have a int field named word count.
  boost=div(1,wordCount) should work. Or you can came up with more
  sophisticated math formula.
 
  Ahmet
 
 
  On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com
 
  wrote:
  Hi Ahmet,
 
  Is there already an implementation of the suggested work around ? Thanks.
 
 
  On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
  wrote:
 
   Hi,
  
   Default length norm is not best option for differentiating very short
   documents, like product names.
   Please see :
   http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
  
   I suggest you to create an additional integer field, that holds number
 of
   tokens. You can populate it via update processor. And then penalise
  (using
   fuction queries) according to that field. This way you have more fine
   grained and flexible control over it.
  
   Ahmet
  
  
  
   On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
   wrote:
   Hi ,
  
   Mikhail Thanks , I looked at the explain and this is what I see for the
  two
   different documents in questions, they have identical scores   even
  though
   the document 2 has a shorter productName field, I do not see any
  lenghtNorm
   related information in the explain.
  
   Also I am not exactly clear on what needs to be looked in the API ?
  
   *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
   productNameps=1pf2= productNamepf3=
   productNamestopwords=truelowercaseOperators=true
  
   *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
   Unlocked *
  
  
  - *100%* 10.649221 sum of the following:
 - *10.58%* 1.1270299 sum of the following:
- *2.1%* 0.22383358 productName:iphon
- *3.47%* 0.36922288 productName:4 s
- *5.01%* 0.53397346 productName:16 gb
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 - *27.79%* 2.959255 sum of the following:
- *10.97%* 1.1680154 productName:iphon 4 s~1
- *16.82%* 1.7912396 productName:4 s 16 gb~1
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  
  
   *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
  
  
  - *100%* 10.649221 sum of the following:
 - *10.58%* 1.1270299 sum of the following:
- *2.1%* 0.22383358 productName:iphon
- *3.47%* 0.36922288 productName:4 s
- *5.01%* 0.53397346 productName:16 gb
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 - *27.79%* 2.959255 sum of the following:
- *10.97%* 1.1680154 productName:iphon 4 s~1
- *16.82%* 1.7912396 productName:4 s 16 gb~1
 - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  
  
  
  
  
   On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
   mkhlud...@griddynamics.com wrote:
  
It's worth to look into explain to check particular scoring values.
  But
for most suspect is the reducing precision when float norms are
 stored
  in
byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
   
   
On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com
 wrote:
   
   

Re: Length norm not functioning in solr queries.

2014-12-10 Thread S.L
Hi Ahmet,

Is there already an implementation of the suggested work around ? Thanks.

On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Default length norm is not best option for differentiating very short
 documents, like product names.
 Please see :
 http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec

 I suggest you to create an additional integer field, that holds number of
 tokens. You can populate it via update processor. And then penalise (using
 fuction queries) according to that field. This way you have more fine
 grained and flexible control over it.

 Ahmet



 On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
 wrote:
 Hi ,

 Mikhail Thanks , I looked at the explain and this is what I see for the two
 different documents in questions, they have identical scores   even though
 the document 2 has a shorter productName field, I do not see any lenghtNorm
 related information in the explain.

 Also I am not exactly clear on what needs to be looked in the API ?

 *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
 productNameps=1pf2= productNamepf3=
 productNamestopwords=truelowercaseOperators=true

 *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
 Unlocked *


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1


 *productName Apple iPhone 4S 16GB for Net10, No Contract, White*


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1





 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  It's worth to look into explain to check particular scoring values. But
  for most suspect is the reducing precision when float norms are stored in
  byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
 
 
  On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
 
   I have two documents doc1 and doc2 and each one of those has a field
  called
   phoneName.
  
   doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
   Smartphone Factory Unlocked
  
   doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
  
   Here if I search for
  
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
  
   Doc1 and Doc2 both have the same identical score , but since the field
   phoneName in the doc2 has shorter length I would expect it to have a
  higher
   score , but both have an identical score of 9.961212.
  
   The phoneName filed is defined as follows.As we can see no where am I
   specifying omitNorms=True, still the behavior seems to be that the
 length
   norm is not functioning at all. Can some one let me know whats the
 issue
   here ?
  
   field name=phoneName type=text_en_splitting indexed=true
   stored=true required=true /
   fieldType name=text_en_splitting class=solr.TextField
   positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory /
   !-- in this example, we will only use synonyms at
 query
   time filter
   class=solr.SynonymFilterFactory
   synonyms=index_synonyms.txt ignoreCase=true
   expand=false/ --
   !-- Case insensitive stop word removal. add
   enablePositionIncrements=true
   in both the index and query analyzers to leave a
  'gap'
   for more accurate
   phrase queries. --
   filter class=solr.StopFilterFactory
 ignoreCase=true
   words=lang/stopwords_en.txt
   enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1
   catenateWords=1
   catenateNumbers=1 catenateAll=0
   splitOnCaseChange=1 /
   filter class=solr.LowerCaseFilterFactory /
   filter class=solr.KeywordMarkerFilterFactory
   

Re: Length norm not functioning in solr queries.

2014-12-10 Thread Ahmet Arslan
Hi,

You mean update processor factory?

Here is augmented (wordCount field added) version of your example :

doc1:

phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
Smartphone Factory Unlocked
wordCount: 11

doc2:

phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
wordCount: 9


First task is simply calculate wordCount values. You can do it in your indexing 
code, or other places.
I quickly skimmed existing update processors but I couldn't find stock 
implementation. 
CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is all 
about multivalued fields.

I guess, A simple javascript that splits on whitespace and returns the produced 
array size would do the trick : StatelessScriptUpdateProcessorFactory



At this point you have a int field named word count. boost=div(1,wordCount) 
should work. Or you can came up with more sophisticated math formula.

Ahmet

On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com wrote:
Hi Ahmet,

Is there already an implementation of the suggested work around ? Thanks.


On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Default length norm is not best option for differentiating very short
 documents, like product names.
 Please see :
 http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec

 I suggest you to create an additional integer field, that holds number of
 tokens. You can populate it via update processor. And then penalise (using
 fuction queries) according to that field. This way you have more fine
 grained and flexible control over it.

 Ahmet



 On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
 wrote:
 Hi ,

 Mikhail Thanks , I looked at the explain and this is what I see for the two
 different documents in questions, they have identical scores   even though
 the document 2 has a shorter productName field, I do not see any lenghtNorm
 related information in the explain.

 Also I am not exactly clear on what needs to be looked in the API ?

 *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
 productNameps=1pf2= productNamepf3=
 productNamestopwords=truelowercaseOperators=true

 *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
 Unlocked *


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1


 *productName Apple iPhone 4S 16GB for Net10, No Contract, White*


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1





 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  It's worth to look into explain to check particular scoring values. But
  for most suspect is the reducing precision when float norms are stored in
  byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
 
 
  On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
 
   I have two documents doc1 and doc2 and each one of those has a field
  called
   phoneName.
  
   doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
   Smartphone Factory Unlocked
  
   doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
  
   Here if I search for
  
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
  
   Doc1 and Doc2 both have the same identical score , but since the field
   phoneName in the doc2 has shorter length I would expect it to have a
  higher
   score , but both have an identical score of 9.961212.
  
   The phoneName filed is defined as follows.As we can see no where am I
   specifying omitNorms=True, still the behavior seems to be that the
 length
   norm is not functioning at all. Can some one let me know whats the
 issue
   here ?
  
   field name=phoneName type=text_en_splitting indexed=true
   stored=true required=true /
   fieldType name=text_en_splitting class=solr.TextField
   positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer type=index
   tokenizer 

Re: Length norm not functioning in solr queries.

2014-12-10 Thread Ahmet Arslan
Hi,

Or even better, you can use your new field for tie break purposes. Where scores 
are identical.
e.g. sort=score desc, wordCount asc

Ahmet


On Wednesday, December 10, 2014 11:29 AM, Ahmet Arslan iori...@yahoo.com 
wrote:
Hi,

You mean update processor factory?

Here is augmented (wordCount field added) version of your example :

doc1:

phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
Smartphone Factory Unlocked
wordCount: 11

doc2:

phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
wordCount: 9


First task is simply calculate wordCount values. You can do it in your indexing 
code, or other places.
I quickly skimmed existing update processors but I couldn't find stock 
implementation. 
CountFieldValuesUpdateProcessorFactory fooled me, but it looks like it is all 
about multivalued fields.

I guess, A simple javascript that splits on whitespace and returns the produced 
array size would do the trick : StatelessScriptUpdateProcessorFactory



At this point you have a int field named word count. boost=div(1,wordCount) 
should work. Or you can came up with more sophisticated math formula.

Ahmet


On Wednesday, December 10, 2014 11:12 AM, S.L simpleliving...@gmail.com wrote:
Hi Ahmet,

Is there already an implementation of the suggested work around ? Thanks.


On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
wrote:

 Hi,

 Default length norm is not best option for differentiating very short
 documents, like product names.
 Please see :
 http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec

 I suggest you to create an additional integer field, that holds number of
 tokens. You can populate it via update processor. And then penalise (using
 fuction queries) according to that field. This way you have more fine
 grained and flexible control over it.

 Ahmet



 On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
 wrote:
 Hi ,

 Mikhail Thanks , I looked at the explain and this is what I see for the two
 different documents in questions, they have identical scores   even though
 the document 2 has a shorter productName field, I do not see any lenghtNorm
 related information in the explain.

 Also I am not exactly clear on what needs to be looked in the API ?

 *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
 productNameps=1pf2= productNamepf3=
 productNamestopwords=truelowercaseOperators=true

 *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
 Unlocked *


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1


 *productName Apple iPhone 4S 16GB for Net10, No Contract, White*


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1





 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  It's worth to look into explain to check particular scoring values. But
  for most suspect is the reducing precision when float norms are stored in
  byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
 
 
  On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
 
   I have two documents doc1 and doc2 and each one of those has a field
  called
   phoneName.
  
   doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
   Smartphone Factory Unlocked
  
   doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
  
   Here if I search for
  
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
  
   Doc1 and Doc2 both have the same identical score , but since the field
   phoneName in the doc2 has shorter length I would expect it to have a
  higher
   score , but both have an identical score of 9.961212.
  
   The phoneName filed is defined as follows.As we can see no where am I
   specifying omitNorms=True, still the behavior seems to be that the
 length
   norm is not functioning at all. Can some one let me know whats the
 issue
   here ?
  
   field name=phoneName type=text_en_splitting indexed=true
   stored=true required=true /

Re: Length norm not functioning in solr queries.

2014-12-10 Thread Mikhail Khludnev
S.L,

I briefly skimmed Lucene50NormsConsumer.writeNormsField(), my conclusion
is: if you supply own similarity, which just avoids putting float to byte
in Similarity.computeNorm(FieldInvertState), you get right this value in .
Similarity.decodeNormValue(long).
You may wonder but this is what's exactly done in PreciseDefaultSimilarity
in TestLongNormValueSource. I think you can just use it.

On Wed, Dec 10, 2014 at 12:11 PM, S.L simpleliving...@gmail.com wrote:

 Hi Ahmet,

 Is there already an implementation of the suggested work around ? Thanks.

 On Tue, Dec 9, 2014 at 6:41 AM, Ahmet Arslan iori...@yahoo.com.invalid
 wrote:

  Hi,
 
  Default length norm is not best option for differentiating very short
  documents, like product names.
  Please see :
  http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec
 
  I suggest you to create an additional integer field, that holds number of
  tokens. You can populate it via update processor. And then penalise
 (using
  fuction queries) according to that field. This way you have more fine
  grained and flexible control over it.
 
  Ahmet
 
 
 
  On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com
  wrote:
  Hi ,
 
  Mikhail Thanks , I looked at the explain and this is what I see for the
 two
  different documents in questions, they have identical scores   even
 though
  the document 2 has a shorter productName field, I do not see any
 lenghtNorm
  related information in the explain.
 
  Also I am not exactly clear on what needs to be looked in the API ?
 
  *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
  productNameps=1pf2= productNamepf3=
  productNamestopwords=truelowercaseOperators=true
 
  *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
  Unlocked *
 
 
 - *100%* 10.649221 sum of the following:
- *10.58%* 1.1270299 sum of the following:
   - *2.1%* 0.22383358 productName:iphon
   - *3.47%* 0.36922288 productName:4 s
   - *5.01%* 0.53397346 productName:16 gb
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
- *27.79%* 2.959255 sum of the following:
   - *10.97%* 1.1680154 productName:iphon 4 s~1
   - *16.82%* 1.7912396 productName:4 s 16 gb~1
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 
 
  *productName Apple iPhone 4S 16GB for Net10, No Contract, White*
 
 
 - *100%* 10.649221 sum of the following:
- *10.58%* 1.1270299 sum of the following:
   - *2.1%* 0.22383358 productName:iphon
   - *3.47%* 0.36922288 productName:4 s
   - *5.01%* 0.53397346 productName:16 gb
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
- *27.79%* 2.959255 sum of the following:
   - *10.97%* 1.1680154 productName:iphon 4 s~1
   - *16.82%* 1.7912396 productName:4 s 16 gb~1
- *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
 
 
 
 
 
  On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
  mkhlud...@griddynamics.com wrote:
 
   It's worth to look into explain to check particular scoring values.
 But
   for most suspect is the reducing precision when float norms are stored
 in
   byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
  
  
   On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
  
I have two documents doc1 and doc2 and each one of those has a field
   called
phoneName.
   
doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White
 (Verizon)
Smartphone Factory Unlocked
   
doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
   
Here if I search for
   
   
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
   
Doc1 and Doc2 both have the same identical score , but since the
 field
phoneName in the doc2 has shorter length I would expect it to have a
   higher
score , but both have an identical score of 9.961212.
   
The phoneName filed is defined as follows.As we can see no where am I
specifying omitNorms=True, still the behavior seems to be that the
  length
norm is not functioning at all. Can some one let me know whats the
  issue
here ?
   
field name=phoneName type=text_en_splitting
 indexed=true
stored=true required=true /
fieldType name=text_en_splitting class=solr.TextField
positionIncrementGap=100
  autoGeneratePhraseQueries=true
analyzer type=index
tokenizer class=solr.WhitespaceTokenizerFactory /
!-- in this example, we will only use synonyms at
  query
time filter
class=solr.SynonymFilterFactory
synonyms=index_synonyms.txt ignoreCase=true
expand=false/ --
!-- Case insensitive stop word removal. add
enablePositionIncrements=true
in both the index and query analyzers to 

Re: Length norm not functioning in solr queries.

2014-12-09 Thread S.L
Hi ,

Mikhail Thanks , I looked at the explain and this is what I see for the two
different documents in questions, they have identical scores   even though
the document 2 has a shorter productName field, I do not see any lenghtNorm
related information in the explain.

Also I am not exactly clear on what needs to be looked in the API ?

*Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
productNameps=1pf2= productNamepf3=
productNamestopwords=truelowercaseOperators=true

*productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
Unlocked *


   - *100%* 10.649221 sum of the following:
  - *10.58%* 1.1270299 sum of the following:
 - *2.1%* 0.22383358 productName:iphon
 - *3.47%* 0.36922288 productName:4 s
 - *5.01%* 0.53397346 productName:16 gb
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  - *27.79%* 2.959255 sum of the following:
 - *10.97%* 1.1680154 productName:iphon 4 s~1
 - *16.82%* 1.7912396 productName:4 s 16 gb~1
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1


*productName Apple iPhone 4S 16GB for Net10, No Contract, White*


   - *100%* 10.649221 sum of the following:
  - *10.58%* 1.1270299 sum of the following:
 - *2.1%* 0.22383358 productName:iphon
 - *3.47%* 0.36922288 productName:4 s
 - *5.01%* 0.53397346 productName:16 gb
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  - *27.79%* 2.959255 sum of the following:
 - *10.97%* 1.1680154 productName:iphon 4 s~1
 - *16.82%* 1.7912396 productName:4 s 16 gb~1
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1




On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 It's worth to look into explain to check particular scoring values. But
 for most suspect is the reducing precision when float norms are stored in
 byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)


 On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:

  I have two documents doc1 and doc2 and each one of those has a field
 called
  phoneName.
 
  doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
  Smartphone Factory Unlocked
 
  doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
 
  Here if I search for
 
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
 
  Doc1 and Doc2 both have the same identical score , but since the field
  phoneName in the doc2 has shorter length I would expect it to have a
 higher
  score , but both have an identical score of 9.961212.
 
  The phoneName filed is defined as follows.As we can see no where am I
  specifying omitNorms=True, still the behavior seems to be that the length
  norm is not functioning at all. Can some one let me know whats the issue
  here ?
 
  field name=phoneName type=text_en_splitting indexed=true
  stored=true required=true /
  fieldType name=text_en_splitting class=solr.TextField
  positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory /
  !-- in this example, we will only use synonyms at query
  time filter
  class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true
  expand=false/ --
  !-- Case insensitive stop word removal. add
  enablePositionIncrements=true
  in both the index and query analyzers to leave a
 'gap'
  for more accurate
  phrase queries. --
  filter class=solr.StopFilterFactory ignoreCase=true
  words=lang/stopwords_en.txt
  enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1
  catenateWords=1
  catenateNumbers=1 catenateAll=0
  splitOnCaseChange=1 /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.KeywordMarkerFilterFactory
  protected=protwords.txt /
  filter class=solr.PorterStemFilterFactory /
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.SynonymFilterFactory
  synonyms=synonyms.txt
  ignoreCase=true expand=true /
  filter class=solr.StopFilterFactory ignoreCase=true
  words=lang/stopwords_en.txt
  enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1
  catenateWords=0
  catenateNumbers=0 catenateAll=0
  splitOnCaseChange=1 /
  filter class=solr.LowerCaseFilterFactory /
  filter 

Re: Length norm not functioning in solr queries.

2014-12-09 Thread Ahmet Arslan
Hi,

Default length norm is not best option for differentiating very short 
documents, like product names.
Please see : 
http://find.searchhub.org/document/b3f776512ab640ec#b3f776512ab640ec

I suggest you to create an additional integer field, that holds number of 
tokens. You can populate it via update processor. And then penalise (using 
fuction queries) according to that field. This way you have more fine grained 
and flexible control over it.

Ahmet



On Tuesday, December 9, 2014 12:22 PM, S.L simpleliving...@gmail.com wrote:
Hi ,

Mikhail Thanks , I looked at the explain and this is what I see for the two
different documents in questions, they have identical scores   even though
the document 2 has a shorter productName field, I do not see any lenghtNorm
related information in the explain.

Also I am not exactly clear on what needs to be looked in the API ?

*Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
productNameps=1pf2= productNamepf3=
productNamestopwords=truelowercaseOperators=true

*productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
Unlocked *


   - *100%* 10.649221 sum of the following:
  - *10.58%* 1.1270299 sum of the following:
 - *2.1%* 0.22383358 productName:iphon
 - *3.47%* 0.36922288 productName:4 s
 - *5.01%* 0.53397346 productName:16 gb
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  - *27.79%* 2.959255 sum of the following:
 - *10.97%* 1.1680154 productName:iphon 4 s~1
 - *16.82%* 1.7912396 productName:4 s 16 gb~1
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1


*productName Apple iPhone 4S 16GB for Net10, No Contract, White*


   - *100%* 10.649221 sum of the following:
  - *10.58%* 1.1270299 sum of the following:
 - *2.1%* 0.22383358 productName:iphon
 - *3.47%* 0.36922288 productName:4 s
 - *5.01%* 0.53397346 productName:16 gb
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
  - *27.79%* 2.959255 sum of the following:
 - *10.97%* 1.1680154 productName:iphon 4 s~1
 - *16.82%* 1.7912396 productName:4 s 16 gb~1
  - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1





On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
mkhlud...@griddynamics.com wrote:

 It's worth to look into explain to check particular scoring values. But
 for most suspect is the reducing precision when float norms are stored in
 byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)


 On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:

  I have two documents doc1 and doc2 and each one of those has a field
 called
  phoneName.
 
  doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
  Smartphone Factory Unlocked
 
  doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
 
  Here if I search for
 
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
 
  Doc1 and Doc2 both have the same identical score , but since the field
  phoneName in the doc2 has shorter length I would expect it to have a
 higher
  score , but both have an identical score of 9.961212.
 
  The phoneName filed is defined as follows.As we can see no where am I
  specifying omitNorms=True, still the behavior seems to be that the length
  norm is not functioning at all. Can some one let me know whats the issue
  here ?
 
  field name=phoneName type=text_en_splitting indexed=true
  stored=true required=true /
  fieldType name=text_en_splitting class=solr.TextField
  positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer type=index
  tokenizer class=solr.WhitespaceTokenizerFactory /
  !-- in this example, we will only use synonyms at query
  time filter
  class=solr.SynonymFilterFactory
  synonyms=index_synonyms.txt ignoreCase=true
  expand=false/ --
  !-- Case insensitive stop word removal. add
  enablePositionIncrements=true
  in both the index and query analyzers to leave a
 'gap'
  for more accurate
  phrase queries. --
  filter class=solr.StopFilterFactory ignoreCase=true
  words=lang/stopwords_en.txt
  enablePositionIncrements=true /
  filter class=solr.WordDelimiterFilterFactory
  generateWordParts=1 generateNumberParts=1
  catenateWords=1
  catenateNumbers=1 catenateAll=0
  splitOnCaseChange=1 /
  filter class=solr.LowerCaseFilterFactory /
  filter class=solr.KeywordMarkerFilterFactory
  protected=protwords.txt /
  filter class=solr.PorterStemFilterFactory /
  /analyzer
  analyzer type=query
  tokenizer class=solr.WhitespaceTokenizerFactory /
  filter class=solr.SynonymFilterFactory
  

Re: Length norm not functioning in solr queries.

2014-12-09 Thread Mikhail Khludnev
I wonder why your explains are so brief, mine looks like

str
0.4500489 = (MATCH) weight(text:inc in 17) [DefaultSimilarity], result of:
  0.4500489 = fieldWeight in 17, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
2.880313 = idf(docFreq=8, maxDocs=59)
0.15625 = fieldNorm(doc=17)/str
str
0.4500489 = (MATCH) weight(text:inc in 27) [DefaultSimilarity], result of:
  0.4500489 = fieldWeight in 27, product of:
1.0 = tf(freq=1.0), with freq of:
  1.0 = termFreq=1.0
2.880313 = idf(docFreq=8, maxDocs=59)
0.15625 = fieldNorm(doc=27)/str

here we can see fieldNorm factors. These two docs are rather different,
however norm factors are equal.

 Also I am not exactly clear on what needs to be looked in the API ?

Because you can see how exactly how it looses precision when stores
float field norm in the byte.



On Tue, Dec 9, 2014 at 1:22 PM, S.L simpleliving...@gmail.com wrote:

 Hi ,

 Mikhail Thanks , I looked at the explain and this is what I see for the two
 different documents in questions, they have identical scores   even though
 the document 2 has a shorter productName field, I do not see any lenghtNorm
 related information in the explain.

 Also I am not exactly clear on what needs to be looked in the API ?

 *Search Query* : q=iphone+4s+16gbqf= productNamemm=1pf=
 productNameps=1pf2= productNamepf3=
 productNamestopwords=truelowercaseOperators=true

 *productName Details about Apple iPhone 4s 16GB Smartphone ATT Factory
 Unlocked *


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1


 *productName Apple iPhone 4S 16GB for Net10, No Contract, White*


- *100%* 10.649221 sum of the following:
   - *10.58%* 1.1270299 sum of the following:
  - *2.1%* 0.22383358 productName:iphon
  - *3.47%* 0.36922288 productName:4 s
  - *5.01%* 0.53397346 productName:16 gb
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1
   - *27.79%* 2.959255 sum of the following:
  - *10.97%* 1.1680154 productName:iphon 4 s~1
  - *16.82%* 1.7912396 productName:4 s 16 gb~1
   - *30.81%* 3.2814684 productName:iphon 4 s 16 gb~1




 On Mon, Dec 8, 2014 at 10:25 AM, Mikhail Khludnev 
 mkhlud...@griddynamics.com wrote:

  It's worth to look into explain to check particular scoring values. But
  for most suspect is the reducing precision when float norms are stored in
  byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)
 
 
  On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:
 
   I have two documents doc1 and doc2 and each one of those has a field
  called
   phoneName.
  
   doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
   Smartphone Factory Unlocked
  
   doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White
  
   Here if I search for
  
  
 
 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true
  
   Doc1 and Doc2 both have the same identical score , but since the field
   phoneName in the doc2 has shorter length I would expect it to have a
  higher
   score , but both have an identical score of 9.961212.
  
   The phoneName filed is defined as follows.As we can see no where am I
   specifying omitNorms=True, still the behavior seems to be that the
 length
   norm is not functioning at all. Can some one let me know whats the
 issue
   here ?
  
   field name=phoneName type=text_en_splitting indexed=true
   stored=true required=true /
   fieldType name=text_en_splitting class=solr.TextField
   positionIncrementGap=100
 autoGeneratePhraseQueries=true
   analyzer type=index
   tokenizer class=solr.WhitespaceTokenizerFactory /
   !-- in this example, we will only use synonyms at
 query
   time filter
   class=solr.SynonymFilterFactory
   synonyms=index_synonyms.txt ignoreCase=true
   expand=false/ --
   !-- Case insensitive stop word removal. add
   enablePositionIncrements=true
   in both the index and query analyzers to leave a
  'gap'
   for more accurate
   phrase queries. --
   filter class=solr.StopFilterFactory
 ignoreCase=true
   words=lang/stopwords_en.txt
   enablePositionIncrements=true /
   filter class=solr.WordDelimiterFilterFactory
   generateWordParts=1 generateNumberParts=1

Re: Length norm not functioning in solr queries.

2014-12-08 Thread Mikhail Khludnev
It's worth to look into explain to check particular scoring values. But
for most suspect is the reducing precision when float norms are stored in
byte vals. See javadoc for DefaultSimilarity.encodeNormValue(float)


On Mon, Dec 8, 2014 at 5:49 PM, S.L simpleliving...@gmail.com wrote:

 I have two documents doc1 and doc2 and each one of those has a field called
 phoneName.

 doc1:phoneName:Details about  Apple iPhone 4s - 16GB - White (Verizon)
 Smartphone Factory Unlocked

 doc2:phoneName:Apple iPhone 4S 16GB for Net10, No Contract, White

 Here if I search for

 q=iphone+4s+16gbqf=phoneNamemm=1pf=phoneNameps=1pf2=phoneNamepf3=phoneNamestopwords=truelowercaseOperators=true

 Doc1 and Doc2 both have the same identical score , but since the field
 phoneName in the doc2 has shorter length I would expect it to have a higher
 score , but both have an identical score of 9.961212.

 The phoneName filed is defined as follows.As we can see no where am I
 specifying omitNorms=True, still the behavior seems to be that the length
 norm is not functioning at all. Can some one let me know whats the issue
 here ?

 field name=phoneName type=text_en_splitting indexed=true
 stored=true required=true /
 fieldType name=text_en_splitting class=solr.TextField
 positionIncrementGap=100 autoGeneratePhraseQueries=true
 analyzer type=index
 tokenizer class=solr.WhitespaceTokenizerFactory /
 !-- in this example, we will only use synonyms at query
 time filter
 class=solr.SynonymFilterFactory
 synonyms=index_synonyms.txt ignoreCase=true
 expand=false/ --
 !-- Case insensitive stop word removal. add
 enablePositionIncrements=true
 in both the index and query analyzers to leave a 'gap'
 for more accurate
 phrase queries. --
 filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_en.txt
 enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1
 catenateWords=1
 catenateNumbers=1 catenateAll=0
 splitOnCaseChange=1 /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt /
 filter class=solr.PorterStemFilterFactory /
 /analyzer
 analyzer type=query
 tokenizer class=solr.WhitespaceTokenizerFactory /
 filter class=solr.SynonymFilterFactory
 synonyms=synonyms.txt
 ignoreCase=true expand=true /
 filter class=solr.StopFilterFactory ignoreCase=true
 words=lang/stopwords_en.txt
 enablePositionIncrements=true /
 filter class=solr.WordDelimiterFilterFactory
 generateWordParts=1 generateNumberParts=1
 catenateWords=0
 catenateNumbers=0 catenateAll=0
 splitOnCaseChange=1 /
 filter class=solr.LowerCaseFilterFactory /
 filter class=solr.KeywordMarkerFilterFactory
 protected=protwords.txt /
 filter class=solr.PorterStemFilterFactory /
 /analyzer
 /fieldType




-- 
Sincerely yours
Mikhail Khludnev
Principal Engineer,
Grid Dynamics

http://www.griddynamics.com
mkhlud...@griddynamics.com