RE: Default value from another field?

2017-10-04 Thread jimi.hullegard
Thank you Alexandre! It worked great. :)

And here is how it is configured, if someone else wants to do this, but is too 
busy to read the documentation for these classes:



source_field
target_field


 target_field 

[...]


/Jimi

-Original Message-
From: Alexandre Rafalovitch [mailto:arafa...@gmail.com] 
Sent: Tuesday, October 3, 2017 7:02 PM
To: solr-user 
Subject: Re: Default value from another field?

I believe you should be able to use a combination of:
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/CloneFieldUpdateProcessorFactory.html
and
http://www.solr-start.com/javadoc/solr-lucene/org/apache/solr/update/processor/FirstFieldValueUpdateProcessorFactory.html

So, if you have no value in target field, you end up with one (copied one). And 
if you did, you end up with one (original one).

This is available in the stock Solr, just need to configure it.

Regards,
   Alex.

http://www.solr-start.com/ - Resources for Solr users, new and experienced


On 3 October 2017 at 12:14,   wrote:
> Hi Emir,
>
> Thanks for the tip about DefaultValueUpdateProcessorFactory. But even though 
> I agree that it most likely isn't too hard to write custom code that does 
> this, the overhead is a bit too much I think considering we now use a vanilla 
> Solr with no custom code deployed. So we would need to setup a new project, 
> and a new deployment procedure, and that is a bit overkill considering that 
> this is a feature that would help me as a developer and administrator only a 
> little bit (ie nice-to-have), and at the same time would not have any impact 
> for any end user (except, possibly negative side effects because of bugs etc).
>
> Regards
> /Jimi
>
>
> -Original Message-
> From: Emir Arnautović [mailto:emir.arnauto...@sematext.com]
> Sent: Tuesday, October 3, 2017 3:28 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Default value from another field?
>
> Hi Jimi,
> I don’t think that you can do it using schema, but you could do it 
> using custom update request processor chain. I quickly scanned to see 
> if there is such processor and could not find one. The closest one is 
> https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update
> /processor/DefaultValueUpdateProcessorFactory.html 
>  e/processor/DefaultValueUpdateProcessorFactory.html>
> It should not be too hard to adjust it to do what you need.
>
> HTH,
> Emir
>
> --
> Monitoring - Log Management - Alerting - Anomaly Detection Solr & 
> Elasticsearch Consulting Support Training - http://sematext.com/
>
>
>
>> On 3 Oct 2017, at 14:10, jimi.hulleg...@svensktnaringsliv.se wrote:
>>
>> Hi,
>>
>> Is it possible using some Solr schema magic to make solr get the default 
>> value for a field from another field? Ie, if the value is specified in the 
>> document to be indexed, then that value is used. Otherwise it uses the value 
>> of another field. As far as I understand it, the field property "default" 
>> only takes a static value, not a reference to another field. And the 
>> copyField element doesn't solve this problem either, since it will result in 
>> two values if the field was specified in the document, and I only want a 
>> single value.
>>
>> /Jimi
>


RE: Default value from another field?

2017-10-03 Thread jimi.hullegard
Hi Emir,

Thanks for the tip about DefaultValueUpdateProcessorFactory. But even though I 
agree that it most likely isn't too hard to write custom code that does this, 
the overhead is a bit too much I think considering we now use a vanilla Solr 
with no custom code deployed. So we would need to setup a new project, and a 
new deployment procedure, and that is a bit overkill considering that this is a 
feature that would help me as a developer and administrator only a little bit 
(ie nice-to-have), and at the same time would not have any impact for any end 
user (except, possibly negative side effects because of bugs etc).

Regards
/Jimi


-Original Message-
From: Emir Arnautović [mailto:emir.arnauto...@sematext.com] 
Sent: Tuesday, October 3, 2017 3:28 PM
To: solr-user@lucene.apache.org
Subject: Re: Default value from another field?

Hi Jimi,
I don’t think that you can do it using schema, but you could do it using custom 
update request processor chain. I quickly scanned to see if there is such 
processor and could not find one. The closest one is 
https://lucene.apache.org/solr/6_6_0//solr-core/org/apache/solr/update/processor/DefaultValueUpdateProcessorFactory.html
 

It should not be too hard to adjust it to do what you need.

HTH,
Emir

--
Monitoring - Log Management - Alerting - Anomaly Detection Solr & Elasticsearch 
Consulting Support Training - http://sematext.com/



> On 3 Oct 2017, at 14:10, jimi.hulleg...@svensktnaringsliv.se wrote:
> 
> Hi,
> 
> Is it possible using some Solr schema magic to make solr get the default 
> value for a field from another field? Ie, if the value is specified in the 
> document to be indexed, then that value is used. Otherwise it uses the value 
> of another field. As far as I understand it, the field property "default" 
> only takes a static value, not a reference to another field. And the 
> copyField element doesn't solve this problem either, since it will result in 
> two values if the field was specified in the document, and I only want a 
> single value.
> 
> /Jimi



Default value from another field?

2017-10-03 Thread jimi.hullegard
Hi,

Is it possible using some Solr schema magic to make solr get the default value 
for a field from another field? Ie, if the value is specified in the document 
to be indexed, then that value is used. Otherwise it uses the value of another 
field. As far as I understand it, the field property "default" only takes a 
static value, not a reference to another field. And the copyField element 
doesn't solve this problem either, since it will result in two values if the 
field was specified in the document, and I only want a single value.

/Jimi


RE: Can't get spelling suggestions to work properly

2017-01-13 Thread jimi.hullegard
I just noticed why setting maxResultsForSuggest to a high value was not a good 
thing. Because now it show spelling suggestions even on correctly spelled words.

I think, what I would need is the logic of SuggestMode. 
SUGGEST_WHEN_NOT_IN_INDEX, but with a configurable limit instead of it being 
hard coded to 0. Ie just as maxQueryFrequency works.

/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Friday, January 13, 2017 5:56 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't get spelling suggestions to work properly

Hi Alessandro,

Thanks for your explanation. It helped a lot. Although setting 
"spellcheck.maxResultsForSuggest" to a value higher than zero was not enough. I 
also had to set "spellcheck.alternativeTermCount". With that done, I now get 
suggestions when searching for 'mycet' (a misspelling of the Swedish word 
'mycket', that didn't return suggestions before).

Although, I'm still not able to fully understand how to configure this 
properly. Because with this change there now are other misspelled searches that 
now longer gives suggestions. The problem here is stemming, I suspect. Because 
the main search fields use stemming, so that in some cases one can get lots of 
results for spellings that doesn't exist in the index at all (or, at least not 
in the spelling-field). How can I configure this component so that those 
suggestions are still included? Do I need to set maxResultsForSuggest to a 
really high number? Like Integer.MAX_VALUE? I feel that such a setting would 
defeat the purpose of that parameter, in a way. But I'm not sure how else to 
solve this.

Also, there is one other things I wonder about the spelling suggestions, that 
you might have the answer to. Is there a way to make the logic case 
insensitive, but the presentation case sensitive? For example, a search for 
'georg washington' now would return 'george washington' as a suggestion, but ' 
Georg Washington' would be even better.

Regards
/Jimi


-Original Message-
From: alessandro.benedetti [mailto:abenede...@apache.org] 
Sent: Thursday, January 12, 2017 5:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't get spelling suggestions to work properly

Hi Jimi,
taking a look to the *maxQueryFrequency*  param :

Your understanding is correct.

1) we don't provide misspelled suggestions if we set the param to 1, and we 
have a minimum of 1 doc freq for the term .

2) we don't provide misspelled suggestions if the doc frequency of the term is 
greater than the max limit set.

Let us explore the code :

if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && docfreq > 0) {
  return new SuggestWord[0];
}
/// If we are working in "Not in Index Mode" , with a document frequency >0 we 
get no misspelled corrections.
/

int maxDoc = ir.maxDoc();

if (maxQueryFrequency >= 1f && docfreq > maxQueryFrequency) {
  return new SuggestWord[0];
} else if (docfreq > (int) Math.ceil(maxQueryFrequency * (float)maxDoc)) {
  return new SuggestWord[0];
}
// then the MaxQueryFrequency as you correctly stated enters the game

...

Let's explore how you can end up in the first scenario :

if (maxResultsForSuggest == null || hits <= maxResultsForSuggest) {
  SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;
  if (onlyMorePopular) {
suggestMode = SuggestMode.SUGGEST_MORE_POPULAR;
  } else if (alternativeTermCount > 0) {
suggestMode = SuggestMode.SUGGEST_ALWAYS;
  }

You did not set maxResultsForSuggest ( and not onlyMorePopular or alternative 
term count) so you ended up in :
SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;

>From Solr javaDoc :

If left unspecified, the default behavior will prevail.  That is, 
"correctlySpelled" will be false and suggestions
   * will be returned only if one or more of the query terms are absent from 
the dictionary and/or index.  If set to zero,
   * the "correctlySpelled" flag will be false only if the response returns 
zero hits.  If set to a value greater than zero, 
   * suggestions will be returned even if hits are returned (up to the 
specified number).  This number also will serve as
   * the threshold in determining the value of "correctlySpelled". 
Specifying a value greater than zero is useful 
   * for creating "did-you-mean" suggestions for queries that return a low 
number of hits.
   * 
   */
  public static final String SPELLCHECK_MAX_RESULTS_FOR_SUGGEST = 
SPELLCHECK_PREFIX + "maxResultsForSuggest";

You probably want to bypass the other parameters and just set the proper 
maxResultsForSuggest param for your spellchecker Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-get-spelling-suggestions-to-work-properly-tp4310079p4313685.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Can't get spelling suggestions to work properly

2017-01-13 Thread jimi.hullegard
Hi Alessandro,

Thanks for your explanation. It helped a lot. Although setting 
"spellcheck.maxResultsForSuggest" to a value higher than zero was not enough. I 
also had to set "spellcheck.alternativeTermCount". With that done, I now get 
suggestions when searching for 'mycet' (a misspelling of the Swedish word 
'mycket', that didn't return suggestions before).

Although, I'm still not able to fully understand how to configure this 
properly. Because with this change there now are other misspelled searches that 
now longer gives suggestions. The problem here is stemming, I suspect. Because 
the main search fields use stemming, so that in some cases one can get lots of 
results for spellings that doesn't exist in the index at all (or, at least not 
in the spelling-field). How can I configure this component so that those 
suggestions are still included? Do I need to set maxResultsForSuggest to a 
really high number? Like Integer.MAX_VALUE? I feel that such a setting would 
defeat the purpose of that parameter, in a way. But I'm not sure how else to 
solve this.

Also, there is one other things I wonder about the spelling suggestions, that 
you might have the answer to. Is there a way to make the logic case 
insensitive, but the presentation case sensitive? For example, a search for 
'georg washington' now would return 'george washington' as a suggestion, but ' 
Georg Washington' would be even better.

Regards
/Jimi


-Original Message-
From: alessandro.benedetti [mailto:abenede...@apache.org] 
Sent: Thursday, January 12, 2017 5:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't get spelling suggestions to work properly

Hi Jimi,
taking a look to the *maxQueryFrequency*  param :

Your understanding is correct.

1) we don't provide misspelled suggestions if we set the param to 1, and we 
have a minimum of 1 doc freq for the term .

2) we don't provide misspelled suggestions if the doc frequency of the term is 
greater than the max limit set.

Let us explore the code :

if (suggestMode==SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX && docfreq > 0) {
  return new SuggestWord[0];
}
/// If we are working in "Not in Index Mode" , with a document frequency >0 we 
get no misspelled corrections.
/

int maxDoc = ir.maxDoc();

if (maxQueryFrequency >= 1f && docfreq > maxQueryFrequency) {
  return new SuggestWord[0];
} else if (docfreq > (int) Math.ceil(maxQueryFrequency * (float)maxDoc)) {
  return new SuggestWord[0];
}
// then the MaxQueryFrequency as you correctly stated enters the game

...

Let's explore how you can end up in the first scenario :

if (maxResultsForSuggest == null || hits <= maxResultsForSuggest) {
  SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;
  if (onlyMorePopular) {
suggestMode = SuggestMode.SUGGEST_MORE_POPULAR;
  } else if (alternativeTermCount > 0) {
suggestMode = SuggestMode.SUGGEST_ALWAYS;
  }

You did not set maxResultsForSuggest ( and not onlyMorePopular or alternative 
term count) so you ended up in :
SuggestMode suggestMode = SuggestMode.SUGGEST_WHEN_NOT_IN_INDEX;

>From Solr javaDoc :

If left unspecified, the default behavior will prevail.  That is, 
"correctlySpelled" will be false and suggestions
   * will be returned only if one or more of the query terms are absent from 
the dictionary and/or index.  If set to zero,
   * the "correctlySpelled" flag will be false only if the response returns 
zero hits.  If set to a value greater than zero, 
   * suggestions will be returned even if hits are returned (up to the 
specified number).  This number also will serve as
   * the threshold in determining the value of "correctlySpelled". 
Specifying a value greater than zero is useful 
   * for creating "did-you-mean" suggestions for queries that return a low 
number of hits.
   * 
   */
  public static final String SPELLCHECK_MAX_RESULTS_FOR_SUGGEST = 
SPELLCHECK_PREFIX + "maxResultsForSuggest";

You probably want to bypass the other parameters and just set the proper 
maxResultsForSuggest param for your spellchecker Cheers



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Can-t-get-spelling-suggestions-to-work-properly-tp4310079p4313685.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Can't get spelling suggestions to work properly

2017-01-10 Thread jimi.hullegard
No one has any input on my post below about the spelling suggestions? I just 
find it a bit frustrating not being able to understand this feature better, and 
why it doesn't give the expected results. A built in "explain" feature really 
would have helped.

/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Friday, December 16, 2016 9:58 PM
To: solr-user@lucene.apache.org
Subject: Can't get spelling suggestions to work properly

Hi,

I'm trying to add the spelling suggestion feature to our search, but I'm having 
problems getting suggestions on some misspellings.

For example, the Swedish word 'mycket' exists in ~14.000 of a total of ~40.000 
documents in our index.

A search for the incorrect spelling 'myket' (a missing 'c') gives several 
spelling suggestions, and the top one is 'mycket'. This is the wanted/expected 
behaivor.

But a search for the incorrect spelling 'mycet' (a missing 'k') gives no 
spelling suggestions.

The only difference between these two searches is that the one that results in 
spelling suggestions had zero results, while the other one had two (2) results. 
These two documents contain this incorrect spelling ('mycet'). Can this be the 
cause of no spelling suggestions? But I have set 'maxQueryFrequency' to 0.001, 
and with 40.000 documents in the index that should mean that the word can exist 
in up to 40 documents, and since 2 is less than 40 I argue that that this word 
would be considered a spelling misstake. But for some reason the solr 
spellchecker considers 'myket' as an incorrect spelling, while 'mycet' 
incorrectly is considered as a correct spelling.

Also, I tried with spellcheck.accuracy=0 just to rule out that I have a too 
high accuracy setting, but that didn't help.

Can someone see what I'm doing wrong, or give some tips on configuration 
changes and/or how I can troubleshoot this? For example, is there any way to 
debug the spellchecker function?


Here are the searches:

Search for 'myket':

http://localhost:8080/solr/s2/select/?q=myket&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+state%3Adraft-published+OR+state%3Asubmitted-published+OR+state%3Aapproved-published%29&wt=xml&indent=true

Spellcheck output for 'myket':


 
  

   16

   0

   5

   0

   



 mycket

 14039



[...]

   
  
  false
  

   mycket

   14005

   

mycket

   
  
  [...]
  
 



Spellcheck output for 'mycet':

http://localhost:8080/solr/s2/select/?q=mycet&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2B

RE: ICUFoldingFilter with swedish characters, and tokens with the keyword attribute?

2017-01-10 Thread jimi.hullegard
Hi Eric.

> But that's not the most important bit. Have you considered something like 
> MappingCharFitlerFactory? 
> Unfortunately that's a charFilter which transforms everything before it gets 
> to the repeatFilter so you'd have to use two fields.

Yes, that is actually what I tried after giving up the idea of being able to 
tweak (ie configure) the ICUFoldingFilter to work as I wanted it to. With a 
custom modification to an existing mapping file, with all the swedish 
characters commented out, it worked just fine.

The fact that it is a charFilter, and the problems that causes, is something I 
have already though about and considered to be OK. For now we won't add another 
field (for the non-folded text), but we might very well do that in the future.

/Jimi


RE: Same score listing order

2017-01-10 Thread jimi.hullegard
Hi Kshitij,

Quoting Yonik, the creator of solr:

"Ties are the same as in lucene... internal docid (equiv to the order in which 
they were added to the index)."

Also, you can have multiple sort clauses, where score can be the first one. 
Like sort=score DESC, publishDate DESC. But I think the recommended approach is 
to use boosting on date (etc) to effect the score instead.

Hope this helps.
/Jimi

-Original Message-
From: kshitij tyagi [mailto:kshitij.shopcl...@gmail.com] 
Sent: Tuesday, January 10, 2017 5:11 PM
To: solr-user@lucene.apache.org
Subject: Same score listing order

Hi,

I need to understand what is the order of listing the documents from query in 
case there is same score for all documents.

Regards,
Kshitij


ICUFoldingFilter with swedish characters, and tokens with the keyword attribute?

2017-01-09 Thread jimi.hullegard
Hi,

I wasn't happy with how our current solr configuration handled diacritics (like 
'é') in the text and in search queries, since it simply considered the letter 
with a diacritic as a distinct letter. Ie 'é' didn't match 'e', and vice versa. 
Except for a handful rare words where the diacritical sign in 'é' actually 
change the word meaning, it is usually used in names of people and places and 
the expected behaivor when searching is to not have to type them and still get 
the expected results (like searching for 'Penelope Cruz' and getting hits for 
'Penélope Cruz').

When reading online about how to handle diacritics in solr, it seems that the 
general recommendation, when no language specific solution exists that handles 
this, is to use the ICUFoldingFilter. However this filter doesn't really come 
with a lot of documentation, and doesn't seem to have any configuration options 
at all (at least not documented).

So what I ended up with doing was simply to add the ICUFoldingFilterFactory in 
the middle of the existing analyzer chain, like this:


 
  
  
  
  
  
  
  
  
 



But that didn't really give me the results I want. For example, using the 
analysis debug tool I see that the text 'café åäö' becomes 'cafe caf aao'. And 
there are two problems with that result:

1. It doesn't respect keyword attribute
2. It folds the Swedish characters 'åäö' into 'aao'

The disregard of the keyword attribute is bad enough, but the mangling of the 
Swedish language is really a show stopper for us. The Swedish language doesn't 
consider 'ö', for example, to be the letter 'o' with two diacritical dots above 
it, just as 'Q' isn't considered to be the letter 'O' with a diacritical 
"squiggly line" at the bottom. So when handling Swedish text, these characters 
('åäöÅÄÖ') shouldn't be folded, because then there will be to many "collisions".

For example, when searching for 'påstå' ('claim'), one doesn't want hits about 
'pasta' (you guessed it, it means 'pasta'), just as one doesn't want to get 
hits about 'aga' ('corporal punishment, usually against children') when 
searching for 'äga' ('to own'). Or even worse, when searching för 'höra' ('to 
hear'), one most likely doesn't want hits about 'hora' ('prostitute'). And I 
can go on... :)

So, is there a way for us to make the ICUFoldingFilter work in a better way? Ie 
configure it to respect the keyword attribute and ignore 'åäö' characters when 
folding, but otherwise fold all diacritical characters into the non-diacritical 
form. Or how would you recommend us to configure our analyzer chain to 
acomplice this?

Regards
/Jimi


Can't get spelling suggestions to work properly

2016-12-16 Thread jimi.hullegard
Hi,

I'm trying to add the spelling suggestion feature to our search, but I'm having 
problems getting suggestions on some misspellings.

For example, the Swedish word 'mycket' exists in ~14.000 of a total of ~40.000 
documents in our index.

A search for the incorrect spelling 'myket' (a missing 'c') gives several 
spelling suggestions, and the top one is 'mycket'. This is the wanted/expected 
behaivor.

But a search for the incorrect spelling 'mycet' (a missing 'k') gives no 
spelling suggestions.

The only difference between these two searches is that the one that results in 
spelling suggestions had zero results, while the other one had two (2) results. 
These two documents contain this incorrect spelling ('mycet'). Can this be the 
cause of no spelling suggestions? But I have set 'maxQueryFrequency' to 0.001, 
and with 40.000 documents in the index that should mean that the word can exist 
in up to 40 documents, and since 2 is less than 40 I argue that that this word 
would be considered a spelling misstake. But for some reason the solr 
spellchecker considers 'myket' as an incorrect spelling, while 'mycet' 
incorrectly is considered as a correct spelling.

Also, I tried with spellcheck.accuracy=0 just to rule out that I have a too 
high accuracy setting, but that didn't help.

Can someone see what I'm doing wrong, or give some tips on configuration 
changes and/or how I can troubleshoot this? For example, is there any way to 
debug the spellchecker function?


Here are the searches:

Search for 'myket':

http://localhost:8080/solr/s2/select/?q=myket&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+state%3Adraft-published+OR+state%3Asubmitted-published+OR+state%3Aapproved-published%29&wt=xml&indent=true

Spellcheck output for 'myket':


 
  

   16

   0

   5

   0

   



 mycket

 14039



[...]

   
  
  false
  

   mycket

   14005

   

mycket

   
  
  [...]
  
 



Spellcheck output for 'mycet':

http://localhost:8080/solr/s2/select/?q=mycet&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+state%3Adraft-published+OR+state%3Asubmitted-published+OR+state%3Aapproved-published%29&wt=xml&indent=true

Search for 'mycet':

http://localhost:8080/solr/s2/select/?q=mycet&rows=100&sort=score+desc&fl=*%2Cscore%2C%5Bexplain+style%3Dtext%5D&defType=edismax&qf=title%5E2&qf=swedishText1%5E1&spellcheck=true&spellcheck.accuracy=0&spellcheck.maxCollationTries=200&fq=%2Bactivatedate%3A%5B*+TO+NOW%5D+%2Bexpiredate%3A%5BNOW+TO+*%5D+%2B%28state%3Apublished+OR+

Re: Replicas for same shard not in sync

2016-04-23 Thread jimi.hullegard
Hi,

An extra tip, on top of everything that Erick said:

Add an extra field to all documents, that contains the date the document was 
indexed. That way, you can always compare the solr documents on different 
machines, and quickly see what "version" exists on each machine.

And you don't have to add this date to the document data yourself, you can make 
solr do it for you:

In your schema.xml, add a field like this:



(The tdate is a regular solr.TrieDateField.)

And then in your solrconfig.xml, add this to the beginning of your 
updateRequestProcessorChain definition:


indexationtime



/Jimi

From: tedsolr 
Sent: Friday, April 22, 2016 8:48 PM
To: solr-user@lucene.apache.org
Subject: Replicas for same shard not in sync

I have a SolrCloud setup with v5.2.1 - just two hosts. A ZK ensemble of 3
hosts. Just today, customers searching in one specific collection reported
seeing varying results with the same search. I could confirm this by looking
at the logs - same search with different hits by the solr host. In the admin
console I see that the Replication section on one node is highlighted -
drawing your attention to a version and generation difference between
listing of "master" and "slave".

The shard has the same number of docs on each host, they are just at
different generations. What's the proper way to re-sync? Should I restart
the host with the out of sync collection? Or click the "optimize" button for
the one shard? Or reload the collection? Do I need to delete the replica and
build a new one?

Thanks!



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Replicas-for-same-shard-not-in-sync-tp4272236.html
Sent from the Solr - User mailing list archive at Nabble.com.


RE: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-21 Thread jimi.hullegard
Hi Ahmet,

Yes, I have also come to that conclusion, that I need to do one of those things 
if I want this function, since Solr/Lucene is lacking in this area. Although 
after some discussion with my coworkers, we decided to simply disable norms for 
the title field, and not do anything more, for now. Hopefully all the other 
boosting logic we use will give a reasonable user experience even without a 
length norm for the title.

Thanks for your help. :)

/Jimi

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com] 
Sent: Thursday, April 21, 2016 7:10 PM
To: solr-user@lucene.apache.org; Hullegård, Jimi 

Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

Please do either :

1) write your own similarity that saves document length (docValues) in a 
lossless way and implement whatever punishment/algorithm you want.

or

2) disable norms altogether add an integer field (title_lenght) and populate it 
(outside the solr) with the number of words in the title field. And use some 
function query to influence the score. e.g. 
q=something&boost=someFuctionQuery(title_lenght)
https://cwiki.apache.org/confluence/display/solr/Function+Queries

Ahmet



On Thursday, April 21, 2016 9:37 AM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Yes, it definately seems to be the main problem for us. I did some simple tests 
of the encoding and decoding calculations in DefaultSimilarity, and my findings 
are:

* For input between 1.0 and 0.5, a difference of 0.01 in the input causes the 
output to change by a value of 0 or 0.125 depending if it is an edge case or not
* For input between 0.5 and 0.25, a difference of 0.01 in the input causes the 
output to change by a value of 0 or 0.0625
* For input between 0.25 and 0.125, a difference of 0.01 in the input causes 
the output to change by a value of 0 or 0.015625
* And so on, with smaller and smaller differences in the output value for edge 
cases

I would say that the main problem is for input values between 1.0 and 0.5. So 
if one could tweak the SweetSpotSimilarity to start it's "raw" (ie not encoded) 
lengthNorm values at 0.5 instead of 1.0, it would solve my problem for the 
title field. This would of course worsen the precision for longer text values, 
but since this is a title field that is not a problem.

So, is there a way to configure SweetSpotSimilarity to use 0.5 as it's highest 
lengthNorm value, instead of 1.0?

/Jimi



From: Ahmet Arslan 
Sent: Thursday, April 21, 2016 2:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jim,

fieldNorm encode/decode thing cause some precision loss.
This may be a problem when dealing with very short documents.
You can find many discussions on this topic.

ahmet



On Thursday, April 21, 2016 3:10 AM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest 
Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt


And in general, when a search word matches the title, I would like to have the 
length of the title field influence the score, so that matching documents with 
shorter title get a higher score than documents with longer title, all else 
considered equal.

So, when a user searches for "John", I would like the results to be pretty much 
in the order presented above. Though, it is not crucial that for example 
document 1 comes before document 2. But I would surely want document 1-3 to 
come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. 
In practice, the encoding of the fieldNorm seems to make this function much 
less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a 
general boost on documents with short titles, I only want to boost them if the 
title field actually matched the query.

/Jimi



From: Jack Krupansky 
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I mean, 
traditionally length normalization has simply tried to distinguish a title 
field (rarely more than a dozen words) from a full body of text, or maybe an 
abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a 
feature-length article, paper, or even book. IOW, traditionally it

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
Yes, it definately seems to be the main problem for us. I did some simple tests 
of the encoding and decoding calculations in DefaultSimilarity, and my findings 
are:

* For input between 1.0 and 0.5, a difference of 0.01 in the input causes the 
output to change by a value of 0 or 0.125 depending if it is an edge case or not
* For input between 0.5 and 0.25, a difference of 0.01 in the input causes the 
output to change by a value of 0 or 0.0625
* For input between 0.25 and 0.125, a difference of 0.01 in the input causes 
the output to change by a value of 0 or 0.015625
* And so on, with smaller and smaller differences in the output value for edge 
cases

I would say that the main problem is for input values between 1.0 and 0.5. So 
if one could tweak the SweetSpotSimilarity to start it's "raw" (ie not encoded) 
lengthNorm values at 0.5 instead of 1.0, it would solve my problem for the 
title field. This would of course worsen the precision for longer text values, 
but since this is a title field that is not a problem.

So, is there a way to configure SweetSpotSimilarity to use 0.5 as it's highest 
lengthNorm value, instead of 1.0?

/Jimi


From: Ahmet Arslan 
Sent: Thursday, April 21, 2016 2:24 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jim,

fieldNorm encode/decode thing cause some precision loss.
This may be a problem when dealing with very short documents.
You can find many discussions on this topic.

ahmet



On Thursday, April 21, 2016 3:10 AM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest 
Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt


And in general, when a search word matches the title, I would like to have the 
length of the title field influence the score, so that matching documents with 
shorter title get a higher score than documents with longer title, all else 
considered equal.

So, when a user searches for "John", I would like the results to be pretty much 
in the order presented above. Though, it is not crucial that for example 
document 1 comes before document 2. But I would surely want document 1-3 to 
come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. 
In practice, the encoding of the fieldNorm seems to make this function much 
less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a 
general boost on documents with short titles, I only want to boost them if the 
title field actually matched the query.

/Jimi



From: Jack Krupansky 
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, 
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, 

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
Yes, we do edismax per field boosting, with explicit boosting of the title 
field. So it sure makes length normalization less relevant. But not 
*completely* irrelevant, which is why I still want to have it as part of the 
scoring, just with much less impact that it currently has.

/Jimi

From: Jack Krupansky 
Sent: Thursday, April 21, 2016 4:46 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Or should this be higher rated about NY, since it's shorter:

* New York

Another though on length norms: with the advent of multi-field dismax with
per-field boosting, people tend to explicitly boost the title field so that
the traditional length normalization is less relevant.


-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:39 PM, Walter Underwood 
wrote:

> Sure, here are some real world examples from my time at Netflix.
>
> Is this movie twice as much about “new york”?
>
> * New York, New York
>
> Which one of these is the best match for “blade runner”:
>
> * Blade Runner: The Final Cut
> * Blade Runner: Theatrical & Director’s Cut
> * Blade Runner: Workprint
>
> http://dvd.netflix.com/Search?v1=blade+runner <
> http://dvd.netflix.com/Search?v1=blade+runner>
>
> At Netflix (when I was there), those were shown in popularity order with a
> boost function.
>
> And for stemming, should the movie “Saw” match “see”? Maybe not.
>
> wunder
> Walter Underwood
> wun...@wunderwood.org
> http://observer.wunderwood.org/  (my blog)
>
>
> > On Apr 20, 2016, at 5:28 PM, Jack Krupansky 
> wrote:
> >
> > Maybe it's a cultural difference, but I can't imagine why on a query for
> > "John", any of those titles would be treated as anything other than
> equals
> > - namely, that they are all about John. Maybe the issue is that this
> seems
> > like a contrived example, and I'm asking for a realistic example. Or,
> maybe
> > you have some rule of relevance that you haven't yet shared - and I mean
> > rule that a user would comprehend and consider valuable, not simply a
> > mechanical rule.
> >
> >
> >
> > -- Jack Krupansky
> >
> > On Wed, Apr 20, 2016 at 8:10 PM, 
> > wrote:
> >
> >> Ok sure, I can try and give some examples :)
> >>
> >> Lets say that we have the following documents:
> >>
> >> Id: 1
> >> Title: John Doe
> >>
> >> Id: 2
> >> Title: John Doe Jr.
> >>
> >> Id: 3
> >> Title: John Lennon: The Life
> >>
> >> Id: 4
> >> Title: John Thompson's Modern Course for the Piano: First Grade Book
> >>
> >> Id: 5
> >> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> >> Youngest Member of Jackson's Staff from John Brown's Raid to the
> Hanging of
> >> Mrs. Surratt
> >>
> >>
> >> And in general, when a search word matches the title, I would like to
> have
> >> the length of the title field influence the score, so that matching
> >> documents with shorter title get a higher score than documents with
> longer
> >> title, all else considered equal.
> >>
> >> So, when a user searches for "John", I would like the results to be
> pretty
> >> much in the order presented above. Though, it is not crucial that for
> >> example document 1 comes before document 2. But I would surely want
> >> document 1-3 to come before document 4 and 5.
> >>
> >> In my mind, the fieldNorm is a perfect solution for this. At least in
> >> theory. In practice, the encoding of the fieldNorm seems to make this
> >> function much less useful for this use case. Unless I have missed
> something.
> >>
> >> Is there another way to achive something like this? Note that I don't
> want
> >> a general boost on documents with short titles, I only want to boost
> them
> >> if the title field actually matched the query.
> >>
> >> /Jimi
> >>
> >> 
> >> From: Jack Krupansky 
> >> Sent: Thursday, April 21, 2016 1:28 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Re: Is it possible to configure a minimum field length for the
> >> fieldNorm value?
> >>
> >> I'm not sure I fully follow what distinction you're trying to focus on.
> I
> >> mean, traditionally length normalization has simply tried to
> distinguish a
> >> title field (rarely more than a dozen words) from a full body of text,
> or
> >> maybe an abstract, not things like exactly how many words were in a
> title.
> >> Or, as another example, a short newswire article of a few paragraphs
> vs. a
> >> feature-length article, paper, or even book. IOW, traditionally it was
> more
> >> of a boolean than a broad range of values. Sure, yes, you absolutely can
> >> define a custom similarity with a custom norm that supports a wide
> range of
> >> lengths, but you'll have to decide what you really want  to achieve to
> tune
> >> it.
> >>
> >> Maybe you could give a couple examples of field values that you feel
> should
> >> be scored differently based on length.
> >>
> >> -- Jack Krupansky
> >>
> >> On Wed, Apr 20, 2016 at 7:17 PM, 
> >> wrote:
> >>
> >>>

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
Yes, the example was contrived. Partly because our documents are mostly in 
Swedish text, but mostly because I thought that the example should be simple 
enough so it focused on the thing discussed (even though I simplifyed it to 
such a degree that I left out the current main problem with the fieldNorm, the 
fact that the values are too course when encoded). And we do have titles with 
title lengths varying in a way from 2 words to about 30 Words.

For me it makes perfect sense to have the shorter titles come up first in this 
example. It is basically the tf–idf principle. It is more likely that the 
document titled "John Doe" focuses on "John" than it is for the document titled 
"I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest 
Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. 
Surratt".

Now, having said that, I never said that the title length should have a *big* 
inpact of the score. Infact, this is the main problem I'm trying to solve. I 
want the inpact to be very, very, small. Basically I want this factor to only 
*nudge* the document score. I want it to work in such a way so that if one 
first would consider the score without this factor, only when two documents 
have scores quite close to each other should this factor have any real effect 
on the resulting order in the search results. That could be achieved if the 
fieldNorm only would change for example from 0.79 to 0.74, like the resulting 
values from SweetSpotSimilarity for two example documents I tested. But when 
these values are encoded and decoded, the values become 0.75 and 0.625, causing 
a much bigger impact on the final score.

/Jimi

From: Jack Krupansky 
Sent: Thursday, April 21, 2016 2:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Maybe it's a cultural difference, but I can't imagine why on a query for
"John", any of those titles would be treated as anything other than equals
- namely, that they are all about John. Maybe the issue is that this seems
like a contrived example, and I'm asking for a realistic example. Or, maybe
you have some rule of relevance that you haven't yet shared - and I mean
rule that a user would comprehend and consider valuable, not simply a
mechanical rule.



-- Jack Krupansky

On Wed, Apr 20, 2016 at 8:10 PM, 
wrote:

> Ok sure, I can try and give some examples :)
>
> Lets say that we have the following documents:
>
> Id: 1
> Title: John Doe
>
> Id: 2
> Title: John Doe Jr.
>
> Id: 3
> Title: John Lennon: The Life
>
> Id: 4
> Title: John Thompson's Modern Course for the Piano: First Grade Book
>
> Id: 5
> Title: I Rode With Stonewall: Being Chiefly The War Experiences of the
> Youngest Member of Jackson's Staff from John Brown's Raid to the Hanging of
> Mrs. Surratt
>
>
> And in general, when a search word matches the title, I would like to have
> the length of the title field influence the score, so that matching
> documents with shorter title get a higher score than documents with longer
> title, all else considered equal.
>
> So, when a user searches for "John", I would like the results to be pretty
> much in the order presented above. Though, it is not crucial that for
> example document 1 comes before document 2. But I would surely want
> document 1-3 to come before document 4 and 5.
>
> In my mind, the fieldNorm is a perfect solution for this. At least in
> theory. In practice, the encoding of the fieldNorm seems to make this
> function much less useful for this use case. Unless I have missed something.
>
> Is there another way to achive something like this? Note that I don't want
> a general boost on documents with short titles, I only want to boost them
> if the title field actually matched the query.
>
> /Jimi
>
> 
> From: Jack Krupansky 
> Sent: Thursday, April 21, 2016 1:28 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> I'm not sure I fully follow what distinction you're trying to focus on. I
> mean, traditionally length normalization has simply tried to distinguish a
> title field (rarely more than a dozen words) from a full body of text, or
> maybe an abstract, not things like exactly how many words were in a title.
> Or, as another example, a short newswire article of a few paragraphs vs. a
> feature-length article, paper, or even book. IOW, traditionally it was more
> of a boolean than a broad range of values. Sure, yes, you absolutely can
> define a custom similarity with a custom norm that supports a wide range of
> lengths, but you'll have to decide what you really want  to achieve to tune
> it.
>
> Maybe you could give a couple examples of field values that you feel should
> be scored differently based on length.
>
> -- Jack Krupansky
>
> On Wed, Apr 20, 2016 at 7:17 PM, 
> wrote:
>
> > I am talki

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
Ok sure, I can try and give some examples :)

Lets say that we have the following documents:

Id: 1
Title: John Doe

Id: 2
Title: John Doe Jr.

Id: 3
Title: John Lennon: The Life

Id: 4
Title: John Thompson's Modern Course for the Piano: First Grade Book

Id: 5
Title: I Rode With Stonewall: Being Chiefly The War Experiences of the Youngest 
Member of Jackson's Staff from John Brown's Raid to the Hanging of Mrs. Surratt


And in general, when a search word matches the title, I would like to have the 
length of the title field influence the score, so that matching documents with 
shorter title get a higher score than documents with longer title, all else 
considered equal.

So, when a user searches for "John", I would like the results to be pretty much 
in the order presented above. Though, it is not crucial that for example 
document 1 comes before document 2. But I would surely want document 1-3 to 
come before document 4 and 5.

In my mind, the fieldNorm is a perfect solution for this. At least in theory. 
In practice, the encoding of the fieldNorm seems to make this function much 
less useful for this use case. Unless I have missed something.

Is there another way to achive something like this? Note that I don't want a 
general boost on documents with short titles, I only want to boost them if the 
title field actually matched the query.

/Jimi


From: Jack Krupansky 
Sent: Thursday, April 21, 2016 1:28 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

I'm not sure I fully follow what distinction you're trying to focus on. I
mean, traditionally length normalization has simply tried to distinguish a
title field (rarely more than a dozen words) from a full body of text, or
maybe an abstract, not things like exactly how many words were in a title.
Or, as another example, a short newswire article of a few paragraphs vs. a
feature-length article, paper, or even book. IOW, traditionally it was more
of a boolean than a broad range of values. Sure, yes, you absolutely can
define a custom similarity with a custom norm that supports a wide range of
lengths, but you'll have to decide what you really want  to achieve to tune
it.

Maybe you could give a couple examples of field values that you feel should
be scored differently based on length.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 7:17 PM, 
wrote:

> I am talking about the title field. And for the title field, a sweetspot
> interval of 1 to 50 makes very little sense. I want to have a fieldNorm
> value that differentiates between for example 2, 3, 4 and 5 terms in the
> title, but only very little.
>
> The 20% number I got by simply calculating the difference in the title
> fieldNorm of two documents, where one title was one word longer than the
> other title. And one fieldNorm value was 20% larger then the other as a
> result of that. And since we use multiplicative scoring calculation, a 20%
> increase in the fieldNorm results in a 20% increase in the final score.
>
> I'm not talking about "scores as percentages". I'm simply noting that this
> minor change in the text data (adding or removing one single word) causes
> the score to change by a almost 20%. I noted this when I renamed a
> document, removing a word from the title, and that single change caused the
> document to move up several positions in the result list. We don't want
> such minor modifications to have such big impact of the resulting score.
>
> I'm not sure I can agree with you that "the effect of document length
> normalization factor is minimal". Then why does it inpact our result in
> such a big way? And as I said, we don't want to disable it completely, we
> just want it to have a much lesser effect, even on really short texts.
>
> /Jimi
>
> 
> From: Ahmet Arslan 
> Sent: Thursday, April 21, 2016 12:10 AM
> To: solr-user@lucene.apache.org
> Subject: Re: Is it possible to configure a minimum field length for the
> fieldNorm value?
>
> Hi Jimi,
>
> Please define a meaningful document-lenght range like min=1 max=50.
> By the way you need to reindex every time you change something.
>
> Regarding 20% score change, I am not sure how you calculated that number
> and I assume it is correct.
> What really matters is the relative order of documents. It doesn't mean
> anything addition of a word decreases the initial score by x%. Please see :
> https://wiki.apache.org/lucene-java/ScoresAsPercentages
>
> There is an information retrieval heuristic which says that addition of a
> non-query term should decrease the score.
>
> Lucene's default document length normalization may favor short document
> too much. But folks blend score with other structural fields (popularity),
> even completely bypass relevancy score and order by price, production date
> etc. I mean there are many use cases, the effect of document length
> normalization factor is minimal.

Re: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
I am talking about the title field. And for the title field, a sweetspot 
interval of 1 to 50 makes very little sense. I want to have a fieldNorm value 
that differentiates between for example 2, 3, 4 and 5 terms in the title, but 
only very little.

The 20% number I got by simply calculating the difference in the title 
fieldNorm of two documents, where one title was one word longer than the other 
title. And one fieldNorm value was 20% larger then the other as a result of 
that. And since we use multiplicative scoring calculation, a 20% increase in 
the fieldNorm results in a 20% increase in the final score.

I'm not talking about "scores as percentages". I'm simply noting that this 
minor change in the text data (adding or removing one single word) causes the 
score to change by a almost 20%. I noted this when I renamed a document, 
removing a word from the title, and that single change caused the document to 
move up several positions in the result list. We don't want such minor 
modifications to have such big impact of the resulting score.

I'm not sure I can agree with you that "the effect of document length 
normalization factor is minimal". Then why does it inpact our result in such a 
big way? And as I said, we don't want to disable it completely, we just want it 
to have a much lesser effect, even on really short texts.

/Jimi


From: Ahmet Arslan 
Sent: Thursday, April 21, 2016 12:10 AM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

Please define a meaningful document-lenght range like min=1 max=50.
By the way you need to reindex every time you change something.

Regarding 20% score change, I am not sure how you calculated that number and I 
assume it is correct.
What really matters is the relative order of documents. It doesn't mean 
anything addition of a word decreases the initial score by x%. Please see :
https://wiki.apache.org/lucene-java/ScoresAsPercentages

There is an information retrieval heuristic which says that addition of a 
non-query term should decrease the score.

Lucene's default document length normalization may favor short document too 
much. But folks blend score with other structural fields (popularity), even 
completely bypass relevancy score and order by price, production date etc. I 
mean there are many use cases, the effect of document length normalization 
factor is minimal.

Lucene/Solr is highly pluggable, very easy to customize.

Ahmet


On Wednesday, April 20, 2016 11:05 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some 
different values at the class gives quite good results. Setting ln_min=1, 
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less 
what I want. At least for the title field. I'm not sure what the actual effect 
of those settings would be on longer text fields, so maybe I will use the 
SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain 
specific requirements, like if to favor/punish short/medium/long texts, and 
how. I was just wondering how many actual use cases there are where one want's 
a ~20% difference in score between two documents, where the only difference is 
that one of the documents has one extra word in one field. (And now I'm talking 
about an extra word that doesn't affect anything else except the fieldNorm 
value). I for one find it hard to find such a use case, and would consider it a 
very special use case, and would consider a more lenient calculation a better 
fit for most use cases (and therefore most domains). :)

/Jimi


-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we w

RE: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
Hang on... It didn't work out as I wanted. But the problem seems to be in the 
encoding of the fieldNorm value. The decoded value is so coarse, so that when 
it is decoded the result is that two values that were quite close to each other 
originally, can become quite far apart after encoding and decoding.

For example, when testing this with two documents, the calculated fieldNorm 
value for the title field is 0.7905694 and 0.745356 respectively. Ie the 
difference is only about 0.05. But the encoded values become 122 and 121 
respectively, and when these values are decoded, they become 0.75 and 0.625. 
The difference now is 0.125. That is quite a big step, if you ask me. In fact, 
it is so big so it more or less makes this whole thing with SweetSpotSimilarity 
useless for me.

Am I missing something here? Is it really so that one can have a really great 
similarity implementation, that spits out great values, only to have them 
butchered because of the way Lucene stores the data? Can I do something to 
remedy this?

/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Wednesday, April 20, 2016 10:05 PM
To: solr-user@lucene.apache.org
Subject: RE: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some 
different values at the class gives quite good results. Setting ln_min=1, 
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less 
what I want. At least for the title field. I'm not sure what the actual effect 
of those settings would be on longer text fields, so maybe I will use the 
SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain 
specific requirements, like if to favor/punish short/medium/long texts, and 
how. I was just wondering how many actual use cases there are where one want's 
a ~20% difference in score between two documents, where the only difference is 
that one of the documents has one extra word in one field. (And now I'm talking 
about an extra word that doesn't affect anything else except the fieldNorm 
value). I for one find it hard to find such a use case, and would consider it a 
very special use case, and would consider a more lenient calculation a better 
fit for most use cases (and therefore most domains). :)

/Jimi

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID]
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi



-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, 
wr

RE: Questions about tie parameter for dismax/edismax

2016-04-20 Thread jimi.hullegard
Thanks Ahmet! The second I read that part about the "albino elephant" query I 
remembered that I had read that before, but just forgotten about it. That 
explanation is really good, and really should be part of the regular 
documentation if you ask me. :)

/Jimi

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Wednesday, April 20, 2016 8:23 PM
To: solr-user@lucene.apache.org
Subject: Re: Questions about tie parameter for dismax/edismax

Hi Jimi,

Field based scoring, where you query multiple fields (title,body,keywords etc) 
with multiple query terms, is an unsolved problem. 

(E)dismax is a heuristic approach to attack the problem.

Please see the javadoc of DisjunctionMaxQuery :
https://lucene.apache.org/core/6_0_0/core/org/apache/lucene/search/DisjunctionMaxQuery.html

some folks try to obtain optimum parameters of edismax from training data.
Others employ learning to rank techniques ...

Ahmet


On Wednesday, April 20, 2016 6:18 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
Hi,

I have been looking a bit at the tie parameter, and I think I understand how it 
works, but I still have a few questions about it.

1. It is not documented anywhere (as far as I have seen) what the default value 
is. Some testing indicates that the default value is 0, and it makes perfect 
sense. But shouldn't that fact be documented?

2. There is very little information about how to think when choosing a tie 
value. Is there really no general recommendations based on some different use 
cases? Or is it simple a matter of "try different values and see what happens"?

3. Some recommendations I have seen mention a really low value is the best 
option. But can someone explain why? I understand that one moves further away 
from the dismax "philosophy" the higher the tie value one uses. But I care only 
about the quality of the score calculation. Can someone explain why the score 
has a higher quality with a lower tie?

4. Regarding the dismax "philosophy". On the dismax wiki page it says:

"Max means that if your word 'foo' matches both title and body, the max score 
of these two (probably title match) is added to the score, not the sum of the 
two as a simple OR query would do. This gives more control over your ranking."

But it doesn't explain *why* this gives "more control over your ranking". Can 
someone explain the logic behind that statement? I'm not claiming that it is 
incorrect, I just want to understand it. :)

Regards
/Jimi 


RE: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
Hi Ahmet,

SweetSpotSimilarity seems quite nice. Some simple testing by throwing some 
different values at the class gives quite good results. Setting ln_min=1, 
ln_max=2, steepness=0.1 and discountOverlaps=true should give me more or less 
what I want. At least for the title field. I'm not sure what the actual effect 
of those settings would be on longer text fields, so maybe I will use the 
SweetSpotSimilarity only for the title field to start with.

Of course I understand that there are many things that can be considered domain 
specific requirements, like if to favor/punish short/medium/long texts, and 
how. I was just wondering how many actual use cases there are where one want's 
a ~20% difference in score between two documents, where the only difference is 
that one of the documents has one extra word in one field. (And now I'm talking 
about an extra word that doesn't affect anything else except the fieldNorm 
value). I for one find it hard to find such a use case, and would consider it a 
very special use case, and would consider a more lenient calculation a better 
fit for most use cases (and therefore most domains). :)

/Jimi

-Original Message-
From: Ahmet Arslan [mailto:iori...@yahoo.com.INVALID] 
Sent: Wednesday, April 20, 2016 8:14 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

Hi Jimi,

SweetSpotSimilarity allows you define a document length range, so that all 
documents in that range will get same fieldNorm value.
In your case, you can say that from 1 word up to 100 words do not employ 
document length punishment. If a document is longer than 100 do some punishment.

By the way; favoring/punishing  short, middle, or long documents is domain 
specific thing. You are free to decide what to do.

Ahmet



On Wednesday, April 20, 2016 7:46 PM, "jimi.hulleg...@svensktnaringsliv.se" 
 wrote:
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi



-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, 
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>


RE: Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
OK. Well, still, the fact that the score increases almost 20% because of just 
one extra term in the field, is not really reasonable if you ask me. But you 
seem to say that this is expected, reasonable and wanted behavior for most use 
case?

I'm not sure that I feel comfortable replacing the default Similarity 
implementation with a custom one. That would just increase the complexity of 
our setup and would make future upgrades harder (we would for example have to 
remember to check if the default similarity configuration or implementation 
changes).

No, if it really is the case that most people like and want this, and there is 
no way to configure Solr/Lucene to calculate fieldNorm in a more reasonable way 
(in my book) for short field values, then I just think we are forced to set 
omitNorms="true", maybe in combination with a simple field boost for shorter 
fields.

/Jimi

 
-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Wednesday, April 20, 2016 5:18 PM
To: solr-user@lucene.apache.org
Subject: Re: Is it possible to configure a minimum field length for the 
fieldNorm value?

FWIW, length for normalization is measured in terms (tokens), not characters.

With TDIFS similarity (the default before 6.0), the normalization is based on 
the inverse square root of the number of terms in the field:

return state.getBoost() * ((float) (1.0 / Math.sqrt(numTerms)));

That code is in ClassicSimilarity:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/lucene/core/src/java/org/apache/lucene/search/similarities/ClassicSimilarity.java#L115

You can always write your own custom Similarity class to override that 
calculation.

-- Jack Krupansky

On Wed, Apr 20, 2016 at 10:43 AM, 
wrote:

> Hi,
>
> In general I think that the fieldNorm factor in the score calculation 
> is quite good. But when the text is short I think that the effect is two big.
>
> Ie with two documents that have a short text in the same field, just a 
> few characters extra in of the documents lower the fieldNorm factor too much.
> In one test the text in document 1 is 30 characters long and has 
> fieldNorm 0.4375, and in document 2 the text is 37 characters long and 
> has fieldNorm 0.375. That means that the first document gets almost a 
> 20% higher score simply because of the 7 character difference.
>
> What are my options if I want to change this behavior? Can I set a 
> lower character limit, meaning that all fields with a length below 
> this limit gets the same fieldNorm value?
>
> I know I can force fieldNorm to be 1 by setting omitNorms="true" for 
> that field, but I would prefer to still have it, just limit its effect 
> on short texts.
>
> Regards
> /Jimi
>
>
>


Questions about tie parameter for dismax/edismax

2016-04-20 Thread jimi.hullegard
Hi,

I have been looking a bit at the tie parameter, and I think I understand how it 
works, but I still have a few questions about it.

1. It is not documented anywhere (as far as I have seen) what the default value 
is. Some testing indicates that the default value is 0, and it makes perfect 
sense. But shouldn't that fact be documented?

2. There is very little information about how to think when choosing a tie 
value. Is there really no general recommendations based on some different use 
cases? Or is it simple a matter of "try different values and see what happens"?

3. Some recommendations I have seen mention a really low value is the best 
option. But can someone explain why? I understand that one moves further away 
from the dismax "philosophy" the higher the tie value one uses. But I care only 
about the quality of the score calculation. Can someone explain why the score 
has a higher quality with a lower tie?

4. Regarding the dismax "philosophy". On the dismax wiki page it says:

"Max means that if your word 'foo' matches both title and body, the max score 
of these two (probably title match) is added to the score, not the sum of the 
two as a simple OR query would do. This gives more control over your ranking."

But it doesn't explain *why* this gives "more control over your ranking". Can 
someone explain the logic behind that statement? I'm not claiming that it is 
incorrect, I just want to understand it. :)

Regards
/Jimi


Is it possible to configure a minimum field length for the fieldNorm value?

2016-04-20 Thread jimi.hullegard
Hi,

In general I think that the fieldNorm factor in the score calculation is quite 
good. But when the text is short I think that the effect is two big.

Ie with two documents that have a short text in the same field, just a few 
characters extra in of the documents lower the fieldNorm factor too much. In 
one test the text in document 1 is 30 characters long and has fieldNorm 0.4375, 
and in document 2 the text is 37 characters long and has fieldNorm 0.375. That 
means that the first document gets almost a 20% higher score simply because of 
the 7 character difference.

What are my options if I want to change this behavior? Can I set a lower 
character limit, meaning that all fields with a length below this limit gets 
the same fieldNorm value?

I know I can force fieldNorm to be 1 by setting omitNorms="true" for that 
field, but I would prefer to still have it, just limit its effect on short 
texts.

Regards
/Jimi




RE: Can't get phrase field boosting to work using edismax

2016-04-06 Thread jimi.hullegard
On Wednesday, April 6, 2016 2:50 PM, apa...@elyograg.org wrote:
> 
> If you can only create a service desk request, then you might be clicking the 
> "Service Desk" menu item, 
> or maybe you're clicking the little down arrow on the right side of the big 
> red "Create" button.  
> Try clicking the main (left) part of the Create button.
> 
> https://www.dropbox.com/s/u8tq8v9qvb0aq0z/solr-issue-create.png?dl=0

Ah, thanks. It never occurred to me that clicking on the text "Create" would 
give me a different result compared to clicking on the arrow. In my mind, 
"Create" was simply the label, and the arrow indicating a dropdown option for 
"things to create".

/Jimi


RE: How to use TZ parameter in a query

2016-04-06 Thread jimi.hullegard
I think that this parameter is only used to interpret the dates provided in the 
query, like query filters. At least that is how I interpret the wiki text. Your 
interpretation makes more sense in general though, it would be nice if it was 
possible to modify the timezone for both the query and the result.

/Jimi

-Original Message-
From: Bogdan Marinescu [mailto:bogdan.marine...@awinta.com] 
Sent: Wednesday, April 6, 2016 11:20 AM
To: solr-user@lucene.apache.org
Subject: How to use TZ parameter in a query

Hi,

According to the wiki
https://wiki.apache.org/solr/CoreQueryParameters#TZ I can use the TZ param to 
specify the timezone.
I tried to make a query and put in the raw section TZ=Europe/Berlin or any 
other found in https://en.wikipedia.org/wiki/List_of_tz_database_time_zones but 
no luck. The date that I get back is still in UTC format.

Any ideas what I'm doing wrong?

Thanks


RE: Can't get phrase field boosting to work using edismax

2016-04-06 Thread jimi.hullegard
OK, well I'm not sure I agree with you. First of all, you ask me to point my 
"pf" towards a tokenized field, but I already do that (the fact that all text 
is tokenized into a single token doesn't change that fact). Also, I don't agree 
with the view that a single term phrase never is valid/reasonable. In this 
specific case, with a KeywordTokenizer, I see it as very reasonable indeed. And 
I would consider a "single term keyword phrase" solution more logical than a 
workaround using special magical characters inserted in the text. Just my two 
cents... :)

Oh, hang on... If a phrase is defined as multiple tokens, and pf is used for 
phrase  boosting, does that mean that even with a regular tokenizer the pf 
won't work for fields that only contain one word? For example if the title of 
one document is "John", and the user searches for 'John' (without any 
surrounding phrase-characters), will edismax not boost this document?

/Jimi

-Original Message-
From: Jan Høydahl [mailto:jan@cominvent.com] 
Sent: Wednesday, April 6, 2016 10:43 AM
To: solr-user@lucene.apache.org
Subject: Re: Can't get phrase field boosting to work using edismax

Hi,

Phrase match via “pf” requires the target field to contain a phrase. A phrase 
is defined as multiple tokens. Yours does not contain a phrase since you use 
the KeywordTokenizer, leaving only one token in the field. eDismax pf will thus 
never kick in. Please point your “pf” towards a tokenized field.

If what you are trying to achieve is to boost only when the whole query exactly 
matches the full content of the field, then have a look at my solution here 
https://github.com/cominvent/exactmatch

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 5. apr. 2016 kl. 19.10 skrev jimi.hulleg...@svensktnaringsliv.se:
> 
> Some more input, before I call it a day. Just for the heck of it, I tried 
> changing minClauseSize to 0 using the Eclipse debugger, so that it didn't 
> return null at line 1203, but instead returned the TermQuery on line 1205. 
> Then everything worked exactly as it should. The matching document got 
> boosted as expected. And in the explain output, this can be seen:
> 
> [...]
> 11.274228 = (MATCH) weight(exactTitle:some words^100.0 in 172) 
> [DefaultSimilarity], result of:
> [...]
> 
> So. In my case, having minClauseSize=2 on line 550 (line 565 for solr 5.5.0) 
> is the culprit. Is this a bug, or am I using the pf in the wrong way? Can 
> someone explain why minClauseSize can't be set to 0 here? The comment simply 
> states "we need at least two or there shouldn't be a boost", but no 
> explaination *why* at least two is needed.
> 
> Regards
> /Jimi
> 
> -Original Message-
> From: jimi.hulleg...@svensktnaringsliv.se 
> [mailto:jimi.hulleg...@svensktnaringsliv.se]
> Sent: Tuesday, April 5, 2016 6:51 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Can't get phrase field boosting to work using edismax
> 
> I now used the Eclipse debugger, to try and see if I can understand what is 
> happening, I it seems like the ExtendedDismaxQParser simply ignores my pf 
> parameter, since it doesn't interpret it as a phrase query.
> 
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.6.0/
> solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java
> 
> On line 1180 I get a query object of type TermQuery (with the term 
> "exactTitle:some words"). And in the if statements starting at line it is 
> quite clear that if it is not a PhraseQuery or a MultiPhraseQuery, or if the 
> minClauseSize > 1 (and it is set to 2 on line 550) the method simply returns 
> null (ie ignoring my pf parameter). Why is this happening?
> 
> I use Solr 4.6 by the way... I forgot to mention that in my original message.
> 
> 
> -Original Message-
> From: jimi.hulleg...@svensktnaringsliv.se 
> [mailto:jimi.hulleg...@svensktnaringsliv.se]
> Sent: Tuesday, April 5, 2016 5:36 PM
> To: solr-user@lucene.apache.org
> Subject: RE: Can't get phrase field boosting to work using edismax
> 
> OK. Interesting. But... I added a solr.TrimFilterFactory at the end of my 
> analyzer definition. Shouldn't that take care of the added space at the end? 
> The admin analysis page indicates that it works as it should, but I still 
> can't get edismax to boost.
> 
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Tuesday, April 5, 2016 4:42 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Can't get phrase field boosting to work using edismax
> 
> It looks like the code constructing the boost phrase for pf will always add a 
> trailing blank, which is never a problem when a normal tokenizer is used that 
> removes white space, but the keyword tokenizer will preserve that extra 
> space, which prevents an exact match.
> 
> See line 531:
> https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/
> solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java
> 
> I'd say

RE: Can't get phrase field boosting to work using edismax

2016-04-06 Thread jimi.hullegard
I guess I can conclude that this is a bug. But I wasn't able to report it in 
Jira. I just got to some servicedesk form 
(https://issues.apache.org/jira/servicedesk/customer/portal/5/create/27) that 
didn't seem related to solr in any way, (the affects/fix version fields didn't 
correspond to any solr version I have heard of). 

Can't a newly created jira user create bug issues straight away? If so, 
where/how exactly?

/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Tuesday, April 5, 2016 7:11 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't get phrase field boosting to work using edismax

Some more input, before I call it a day. Just for the heck of it, I tried 
changing minClauseSize to 0 using the Eclipse debugger, so that it didn't 
return null at line 1203, but instead returned the TermQuery on line 1205. Then 
everything worked exactly as it should. The matching document got boosted as 
expected. And in the explain output, this can be seen:

[...]
11.274228 = (MATCH) weight(exactTitle:some words^100.0 in 172) 
[DefaultSimilarity], result of:
[...]

So. In my case, having minClauseSize=2 on line 550 (line 565 for solr 5.5.0) is 
the culprit. Is this a bug, or am I using the pf in the wrong way? Can someone 
explain why minClauseSize can't be set to 0 here? The comment simply states "we 
need at least two or there shouldn't be a boost", but no explaination *why* at 
least two is needed.

Regards
/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se]
Sent: Tuesday, April 5, 2016 6:51 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't get phrase field boosting to work using edismax

I now used the Eclipse debugger, to try and see if I can understand what is 
happening, I it seems like the ExtendedDismaxQParser simply ignores my pf 
parameter, since it doesn't interpret it as a phrase query.

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.6.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

On line 1180 I get a query object of type TermQuery (with the term 
"exactTitle:some words"). And in the if statements starting at line it is quite 
clear that if it is not a PhraseQuery or a MultiPhraseQuery, or if the 
minClauseSize > 1 (and it is set to 2 on line 550) the method simply returns 
null (ie ignoring my pf parameter). Why is this happening?

I use Solr 4.6 by the way... I forgot to mention that in my original message.


-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se]
Sent: Tuesday, April 5, 2016 5:36 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't get phrase field boosting to work using edismax

OK. Interesting. But... I added a solr.TrimFilterFactory at the end of my 
analyzer definition. Shouldn't that take care of the added space at the end? 
The admin analysis page indicates that it works as it should, but I still can't 
get edismax to boost.

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Tuesday, April 5, 2016 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't get phrase field boosting to work using edismax

It looks like the code constructing the boost phrase for pf will always add a 
trailing blank, which is never a problem when a normal tokenizer is used that 
removes white space, but the keyword tokenizer will preserve that extra space, 
which prevents an exact match.

See line 531:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

I'd say it's a bug, but more a narrow use case that wasn't considered or tested.

-- Jack Krupansky

On Tue, Apr 5, 2016 at 7:50 AM,  wrote:

> Hi,
>
> I'm trying to boost documents using a phrase field boosting (ie the pf 
> parameter for edismax), but I can't get it to work (ie boosting 
> documents where the pf field match the query as a phrase).
>
> As far as I can tell, solr, or more specifically the edismax handler, 
> does
> *something* when I add this parameter. I know this because the QTime 
> increases from around 5-10ms to around 30-40 ms, and the score explain 
> structure is *slightly* modified (though with the same final score for 
> all documents). But nowhere in the explain structure can I see 
> anything about the pf. And I can't understand that. Shouldn't it be 
> included in the explain? If not, is there any way to force it to be included 
> somehow?
>
> The query looks something like this:
>
>
> ?q=some+words&rows=10&sort=score+desc&debugQuery=true&fl=objectid,exac
> tTitle,score%2C%5Bexplain+style%3Dtext%5D&qf=title%5E2&qf=swedishText1
> %5E1&defType=edismax&pf=exactTitle%5E5&wt=xml&indent=true
>
>
> I have one document that has the title "some words", and when I do a 
> simple query filter with exactTitle:"some words" I get a match for 
> that 

RE: Can't get phrase field boosting to work using edismax

2016-04-05 Thread jimi.hullegard
Some more input, before I call it a day. Just for the heck of it, I tried 
changing minClauseSize to 0 using the Eclipse debugger, so that it didn't 
return null at line 1203, but instead returned the TermQuery on line 1205. Then 
everything worked exactly as it should. The matching document got boosted as 
expected. And in the explain output, this can be seen:

[...]
11.274228 = (MATCH) weight(exactTitle:some words^100.0 in 172) 
[DefaultSimilarity], result of:
[...]

So. In my case, having minClauseSize=2 on line 550 (line 565 for solr 5.5.0) is 
the culprit. Is this a bug, or am I using the pf in the wrong way? Can someone 
explain why minClauseSize can't be set to 0 here? The comment simply states "we 
need at least two or there shouldn't be a boost", but no explaination *why* at 
least two is needed.

Regards
/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Tuesday, April 5, 2016 6:51 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't get phrase field boosting to work using edismax

I now used the Eclipse debugger, to try and see if I can understand what is 
happening, I it seems like the ExtendedDismaxQParser simply ignores my pf 
parameter, since it doesn't interpret it as a phrase query.

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.6.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

On line 1180 I get a query object of type TermQuery (with the term 
"exactTitle:some words"). And in the if statements starting at line it is quite 
clear that if it is not a PhraseQuery or a MultiPhraseQuery, or if the 
minClauseSize > 1 (and it is set to 2 on line 550) the method simply returns 
null (ie ignoring my pf parameter). Why is this happening?

I use Solr 4.6 by the way... I forgot to mention that in my original message.


-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se]
Sent: Tuesday, April 5, 2016 5:36 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't get phrase field boosting to work using edismax

OK. Interesting. But... I added a solr.TrimFilterFactory at the end of my 
analyzer definition. Shouldn't that take care of the added space at the end? 
The admin analysis page indicates that it works as it should, but I still can't 
get edismax to boost.

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Tuesday, April 5, 2016 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't get phrase field boosting to work using edismax

It looks like the code constructing the boost phrase for pf will always add a 
trailing blank, which is never a problem when a normal tokenizer is used that 
removes white space, but the keyword tokenizer will preserve that extra space, 
which prevents an exact match.

See line 531:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

I'd say it's a bug, but more a narrow use case that wasn't considered or tested.

-- Jack Krupansky

On Tue, Apr 5, 2016 at 7:50 AM,  wrote:

> Hi,
>
> I'm trying to boost documents using a phrase field boosting (ie the pf 
> parameter for edismax), but I can't get it to work (ie boosting 
> documents where the pf field match the query as a phrase).
>
> As far as I can tell, solr, or more specifically the edismax handler, 
> does
> *something* when I add this parameter. I know this because the QTime 
> increases from around 5-10ms to around 30-40 ms, and the score explain 
> structure is *slightly* modified (though with the same final score for 
> all documents). But nowhere in the explain structure can I see 
> anything about the pf. And I can't understand that. Shouldn't it be 
> included in the explain? If not, is there any way to force it to be included 
> somehow?
>
> The query looks something like this:
>
>
> ?q=some+words&rows=10&sort=score+desc&debugQuery=true&fl=objectid,exac
> tTitle,score%2C%5Bexplain+style%3Dtext%5D&qf=title%5E2&qf=swedishText1
> %5E1&defType=edismax&pf=exactTitle%5E5&wt=xml&indent=true
>
>
> I have one document that has the title "some words", and when I do a 
> simple query filter with exactTitle:"some words" I get a match for 
> that document. So then I would expect that the query above would boost 
> this document, and include information about this in the explain. But 
> nothing like this happens, and I can't understand why.
>
> The field looks like this:
>
>  required="false" multiValued="false" />
>
> And the fieldType looks like this:
>
>  positionIncrementGap="100">
>  
>class="solr.HTMLStripCharFilterFactory" />
>class="solr.KeywordTokenizerFactory" />
>class="solr.LowerCaseFilterFactory" />
>  
>

RE: Can't get phrase field boosting to work using edismax

2016-04-05 Thread jimi.hullegard
I now used the Eclipse debugger, to try and see if I can understand what is 
happening, I it seems like the ExtendedDismaxQParser simply ignores my pf 
parameter, since it doesn't interpret it as a phrase query.

https://github.com/apache/lucene-solr/blob/releases/lucene-solr/4.6.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

On line 1180 I get a query object of type TermQuery (with the term 
"exactTitle:some words"). And in the if statements starting at line it is quite 
clear that if it is not a PhraseQuery or a MultiPhraseQuery, or if the 
minClauseSize > 1 (and it is set to 2 on line 550) the method simply returns 
null (ie ignoring my pf parameter). Why is this happening?

I use Solr 4.6 by the way... I forgot to mention that in my original message.


-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Tuesday, April 5, 2016 5:36 PM
To: solr-user@lucene.apache.org
Subject: RE: Can't get phrase field boosting to work using edismax

OK. Interesting. But... I added a solr.TrimFilterFactory at the end of my 
analyzer definition. Shouldn't that take care of the added space at the end? 
The admin analysis page indicates that it works as it should, but I still can't 
get edismax to boost.

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Tuesday, April 5, 2016 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't get phrase field boosting to work using edismax

It looks like the code constructing the boost phrase for pf will always add a 
trailing blank, which is never a problem when a normal tokenizer is used that 
removes white space, but the keyword tokenizer will preserve that extra space, 
which prevents an exact match.

See line 531:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

I'd say it's a bug, but more a narrow use case that wasn't considered or tested.

-- Jack Krupansky

On Tue, Apr 5, 2016 at 7:50 AM,  wrote:

> Hi,
>
> I'm trying to boost documents using a phrase field boosting (ie the pf 
> parameter for edismax), but I can't get it to work (ie boosting 
> documents where the pf field match the query as a phrase).
>
> As far as I can tell, solr, or more specifically the edismax handler, 
> does
> *something* when I add this parameter. I know this because the QTime 
> increases from around 5-10ms to around 30-40 ms, and the score explain 
> structure is *slightly* modified (though with the same final score for 
> all documents). But nowhere in the explain structure can I see 
> anything about the pf. And I can't understand that. Shouldn't it be 
> included in the explain? If not, is there any way to force it to be included 
> somehow?
>
> The query looks something like this:
>
>
> ?q=some+words&rows=10&sort=score+desc&debugQuery=true&fl=objectid,exac
> tTitle,score%2C%5Bexplain+style%3Dtext%5D&qf=title%5E2&qf=swedishText1
> %5E1&defType=edismax&pf=exactTitle%5E5&wt=xml&indent=true
>
>
> I have one document that has the title "some words", and when I do a 
> simple query filter with exactTitle:"some words" I get a match for 
> that document. So then I would expect that the query above would boost 
> this document, and include information about this in the explain. But 
> nothing like this happens, and I can't understand why.
>
> The field looks like this:
>
>  required="false" multiValued="false" />
>
> And the fieldType looks like this:
>
>  positionIncrementGap="100">
>  
>class="solr.HTMLStripCharFilterFactory" />
>class="solr.KeywordTokenizerFactory" />
>class="solr.LowerCaseFilterFactory" />
>  
> 
>
>
> I have also tried boosting this document using a boost query, ie 
> bq=exactTitle:"some words", and this works as expected. The document 
> score is boosted, and the explain states this very clearly, with this segment:
>
> [...]
> 9.870669 = (MATCH) weight(exactTitle:some words^5.0 in 12) 
> [DefaultSimilarity], result of:
> [...]
>
> Why is this working, but q=some+words&pf=exactTitle^5 not? Shouldn't 
> edismax rewrite my "pf query" into something very similar to the "bq query"?
>
> Regards
> /Jimi
>


RE: Can't get phrase field boosting to work using edismax

2016-04-05 Thread jimi.hullegard
OK. Interesting. But... I added a solr.TrimFilterFactory at the end of my 
analyzer definition. Shouldn't that take care of the added space at the end? 
The admin analysis page indicates that it works as it should, but I still can't 
get edismax to boost.

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Tuesday, April 5, 2016 4:42 PM
To: solr-user@lucene.apache.org
Subject: Re: Can't get phrase field boosting to work using edismax

It looks like the code constructing the boost phrase for pf will always add a 
trailing blank, which is never a problem when a normal tokenizer is used that 
removes white space, but the keyword tokenizer will preserve that extra space, 
which prevents an exact match.

See line 531:
https://github.com/apache/lucene-solr/blob/releases/lucene-solr/5.5.0/solr/core/src/java/org/apache/solr/search/ExtendedDismaxQParser.java

I'd say it's a bug, but more a narrow use case that wasn't considered or tested.

-- Jack Krupansky

On Tue, Apr 5, 2016 at 7:50 AM,  wrote:

> Hi,
>
> I'm trying to boost documents using a phrase field boosting (ie the pf 
> parameter for edismax), but I can't get it to work (ie boosting 
> documents where the pf field match the query as a phrase).
>
> As far as I can tell, solr, or more specifically the edismax handler, 
> does
> *something* when I add this parameter. I know this because the QTime 
> increases from around 5-10ms to around 30-40 ms, and the score explain 
> structure is *slightly* modified (though with the same final score for 
> all documents). But nowhere in the explain structure can I see 
> anything about the pf. And I can't understand that. Shouldn't it be 
> included in the explain? If not, is there any way to force it to be included 
> somehow?
>
> The query looks something like this:
>
>
> ?q=some+words&rows=10&sort=score+desc&debugQuery=true&fl=objectid,exac
> tTitle,score%2C%5Bexplain+style%3Dtext%5D&qf=title%5E2&qf=swedishText1
> %5E1&defType=edismax&pf=exactTitle%5E5&wt=xml&indent=true
>
>
> I have one document that has the title "some words", and when I do a 
> simple query filter with exactTitle:"some words" I get a match for 
> that document. So then I would expect that the query above would boost 
> this document, and include information about this in the explain. But 
> nothing like this happens, and I can't understand why.
>
> The field looks like this:
>
>  required="false" multiValued="false" />
>
> And the fieldType looks like this:
>
>  positionIncrementGap="100">
>  
>class="solr.HTMLStripCharFilterFactory" />
>class="solr.KeywordTokenizerFactory" />
>class="solr.LowerCaseFilterFactory" />
>  
> 
>
>
> I have also tried boosting this document using a boost query, ie 
> bq=exactTitle:"some words", and this works as expected. The document 
> score is boosted, and the explain states this very clearly, with this segment:
>
> [...]
> 9.870669 = (MATCH) weight(exactTitle:some words^5.0 in 12) 
> [DefaultSimilarity], result of:
> [...]
>
> Why is this working, but q=some+words&pf=exactTitle^5 not? Shouldn't 
> edismax rewrite my "pf query" into something very similar to the "bq query"?
>
> Regards
> /Jimi
>


Can't get phrase field boosting to work using edismax

2016-04-05 Thread jimi.hullegard
Hi,

I'm trying to boost documents using a phrase field boosting (ie the pf 
parameter for edismax), but I can't get it to work (ie boosting documents where 
the pf field match the query as a phrase).

As far as I can tell, solr, or more specifically the edismax handler, does 
*something* when I add this parameter. I know this because the QTime increases 
from around 5-10ms to around 30-40 ms, and the score explain structure is 
*slightly* modified (though with the same final score for all documents). But 
nowhere in the explain structure can I see anything about the pf. And I can't 
understand that. Shouldn't it be included in the explain? If not, is there any 
way to force it to be included somehow?

The query looks something like this:

?q=some+words&rows=10&sort=score+desc&debugQuery=true&fl=objectid,exactTitle,score%2C%5Bexplain+style%3Dtext%5D&qf=title%5E2&qf=swedishText1%5E1&defType=edismax&pf=exactTitle%5E5&wt=xml&indent=true


I have one document that has the title "some words", and when I do a simple 
query filter with exactTitle:"some words" I get a match for that document. So 
then I would expect that the query above would boost this document, and include 
information about this in the explain. But nothing like this happens, and I 
can't understand why.

The field looks like this:



And the fieldType looks like this:


 
  
  
  
 



I have also tried boosting this document using a boost query, ie 
bq=exactTitle:"some words", and this works as expected. The document score is 
boosted, and the explain states this very clearly, with this segment:

[...]
9.870669 = (MATCH) weight(exactTitle:some words^5.0 in 12) [DefaultSimilarity], 
result of:
[...]

Why is this working, but q=some+words&pf=exactTitle^5 not? Shouldn't edismax 
rewrite my "pf query" into something very similar to the "bq query"?

Regards
/Jimi


RE: Explain style json? Without using wt=json...

2016-03-20 Thread jimi.hullegard
Forgot to add that we use Solr 4.6.0.

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Wednesday, March 16, 2016 9:39 PM
To: solr-user@lucene.apache.org
Subject: Explain style json? Without using wt=json...

Hi,

We are using Solrj to query our solr server, and it works great. However, it 
uses the binary format wt=javabin, and now when I'm trying to get better debug 
output, I notice a problem with this. The thing is, I want to include the 
explain data for each search result, by adding "[explain]" as a field for the 
fl parameter. And when using [explain style=nl] combined with wt=json, the 
explain output is proper and valid json. However, when I use something other 
than wt=json, the explain output is not proper json.

Is there any way for the explain segment to be proper, valid json, without 
using wt=json? Because Solrj forces wt=javabin, without any option to change 
it, as far as I can see.

And, the reason I want to explain segment in proper json format, is that I want 
to turn it into a JSONObject, in order to get proper indentation for easier 
reading. Because the regular output doesn't have proper indentation.

Regards
/Jimi


RE: Why is multiplicative boost prefered over additive?

2016-03-19 Thread jimi.hullegard
On Friday, March 18, 2016 2:19 PM, apa...@elyograg.org wrote:
> 
> The "max score" of a particular query can vary widely, and only has meaning 
> within the context of that query.  
> One query on an index might produce a max score of 0.944, so *every* document 
> has a score less than one, 
> while another query *on the same index* (that might even have some of the 
> same result documents) 
> might produce a max score of 12.7, so the top docs have a score *much* higher 
> than one.
> 
> If your additive boost is 5, this represents a relative boost of over 500 
> percent for the top docs 
> of the first query I talked about above, but less than 50% for the top docs 
> of the second.

Thanks Shawn. I think I understand. I guess I was stuck in the mindset of 
having all original scores within a defined interval. 

Although I still don't fully understand why solr can't normalize the score, so 
it is always between say 0.0 and 100.0. Because surely solr knows what the 
maximum "raw score" is.

Sure, I have read the page "Scores As Percentages", but the main argument there 
against a normalized score seems to be that it still doesn't make different 
queries truly "comparable", but that's not what I'm after anyway. I would only 
use the normalized score in my own boost calculation, nothing else.

But, anyway... Since the score(1+boost...) suggestion from Upayavira solves the 
problem with weights, I guess I will start using multiplicative boosts now. :)

But it would be nice to see how other people handle weighted boosts. And, in 
general I find it a bit hard to find concrete examples of queries where one 
combines multiple boost factors (like date recency, popularity, document type 
etc). Most documentation seem to focus on *one* factor only. Like "this is how 
you sort/score based on popularity", "this is how you get more recent documents 
first" etc...

/Jimi


RE: Why is multiplicative boost prefered over additive?

2016-03-19 Thread jimi.hullegard
On Friday, March 18, 2016 3:53 PM, wun...@wunderwood.org wrote:
> 
> Popularity has a very wide range. Try my example, scale 1 million and 100 
> into the same 1.0-0.0 range. Even with log popularity.

Well, in our case, we don't really care do differentiate between documents with 
low popularity. And if we know roughly what the popularity distribution is it 
is not hard to normalize it to a value between 0.0 and 1.0. The most simple 
approach is to simply focus on the maximum value, and mapping that value to 
1.0, so basically the normalization function is: 
normalizedValue=value/maxValue. But knowing the mean and median, or other 
statistical information, one could of course use a more advanced calculation.

In essence, if one can answer the question "How popular is this 
document/movie/item?", using "extremely popular", "very popular", "quite 
popular", "average", "not very popular" and "very unpopular" (ie popularity 
normalized down to 6 possible values), it should not be that hard to normalize 
the popularity to a value between 0.0 and 1.0.

/Jimi


RE: Why is multiplicative boost prefered over additive?

2016-03-19 Thread jimi.hullegard
On Friday, March 18, 2016 4:25 PM, wun...@wunderwood.org wrote:
> 
> That works fine if you have a query that matches things with a wide range of 
> popularities. But that is the easy case.
> 
> What about the query "twilight", which matches all the Twilight movies, all 
> of which are popular (millions of views).

Well, like I said, I focused on our use case. And we deal with articles, not 
movies. And the raw popularity value is basically just "the number of page 
views the last N days". We want to boost documents that many people have 
visited recently, but don't really care about the exact search result position 
when comparing documents with roughly the same popularity. So if all the 
matched documents have *roughly* the same popularity, then we basically don't 
want the popularity to influence the score much at all.

> Or "Lord of the Rings" which only matches movies with hundreds of views? 
> People really will notice when 
> the 1978 animated version shows up before the Peter Jackson films.

Well, doesn't the Peter Jackson "Lord of the Rings" films have more than just a 
few hundred views?

/Jimi


Explain style json? Without using wt=json...

2016-03-19 Thread jimi.hullegard
Hi,

We are using Solrj to query our solr server, and it works great. However, it 
uses the binary format wt=javabin, and now when I'm trying to get better debug 
output, I notice a problem with this. The thing is, I want to include the 
explain data for each search result, by adding "[explain]" as a field for the 
fl parameter. And when using [explain style=nl] combined with wt=json, the 
explain output is proper and valid json. However, when I use something other 
than wt=json, the explain output is not proper json.

Is there any way for the explain segment to be proper, valid json, without 
using wt=json? Because Solrj forces wt=javabin, without any option to change 
it, as far as I can see.

And, the reason I want to explain segment in proper json format, is that I want 
to turn it into a JSONObject, in order to get proper indentation for easier 
reading. Because the regular output doesn't have proper indentation.

Regards
/Jimi


Why is multiplicative boost prefered over additive?

2016-03-19 Thread jimi.hullegard
Hi,

After reading a bit on various sites, and especially the blog post "Comparing 
boost methods in Solr", it seems that the preferred boosting type is the 
multiplicative one, over the additive one. But I can't really get my head 
around *why* that is so, since in most boosting problems I can think of, it 
seems that an additive boost would suit better.

For example, in our project we want to boost documents depending on various 
factors, but in essence they can be summarized as:

- Regular edismax logic, like qf=title^2 mainText^1
- Multiple custom document fields, with weights specified at query time

So, first of, the custom fields... It became obvious to me quite quickly that 
multiplicative logic here would totally ruin the purpose of the weights, since 
something like "(f1 *  w1) * (f2 * w2)" is the same as "(f1 *  w2) * (f2 * 
w1)". So, I ended up using additive boost here.

Then we have the combination of the edismax boost, and my custom boost. As far 
as I understand it, when using the boost field with edismax, this combination 
is always performed using multiplicative logic. But the same problem exists 
here as it did with my custom fields. Because if I boost the aggregated result 
of the custom fields using some weight, it doesn't affect the order of the 
documents because that weight influences the edismax boost just as much. What I 
want is to have the weight only influence my custom boost value, so that I can 
control how much (or little) the final score should be effected by the custom 
boost.

So, in both cases I find myself wanting to use the additive boost. But surely I 
must be missing something, right? Am I thinking backwards or something?

I don't use any out-of-the-box example indexes, so I can provide you with a 
working URL that shows exactly what I am doing. But in essence my query looks 
like this:

- q=test
- defType=edismax
- qf=title^2&qf=mainText1^1
- 
totalRanking=div(sum(product(random1,1),product(random2,1.5),product(random3,2),product(random4,2.5),product(random5,3)),5)
- weightedTotalRanking=product($totalRanking,1.5)
- bf=$weightedTotalRanking
- fl=*,score,[explain style=text],$weightedTotalRanking

random1 to random5 are document fields of type double, with random values 
between 0.0 and 1.0.

With this setup, I can change the overall importance of my custom boosting 
using the factor in weightedTotalRanking (1.5 above). But that is only because 
bf is additive. If I switch to the boost parameter, I can no longer influence 
the order of the documents using this factor, no matter how high a value I 
choose.

Am I looking at the this the wrong way? Is there a much better approach to 
achieve what I want?

Regards
/Jimi


RE: Why is multiplicative boost prefered over additive?

2016-03-19 Thread jimi.hullegard
On Friday, March 18, 2016 5:11 PM, wun...@wunderwood.org wrote:
> 
> I used a popularity score based on the DVD being in people's queues and the 
> streaming views. 
> The Peter Jackson films were DVD only. They were in about 100 subscriber 
> queues. 
> The first Twilight film was in 1.25 million queues.
> Now think about the query "twilight zone". How do you make "Twilight" not be 
> the first hit for that?

1. Maybe your popularity value should include more types of "views", maybe both 
Netflix-internal (like total dvd rental, the last X days back, or since 
forever), and netflix-external (like total number of movie-tickets sold, 
worldwide or in the same country).
2. Shouldn't the word "zone" exclude the twilight movies altogether, or at 
least boost the results with that word in the title?
3. Maybe the popularity has a too much of influence on the score?
4. I never said my reasoning about normalizing popularity was applicable to 
your use case. On the contrary, like I said before, I focused on our own use 
case.

/Jimi


RE: Why is multiplicative boost prefered over additive?

2016-03-19 Thread jimi.hullegard
On Thursday, March 17, 2016 11:21 PM, u...@odoko.co.uk wrote:
> 
> If you use additive boosting, when you add a boost to a search with one term, 
> (e.g. between 0 and 1) 
> you get a different effect compared to when you add the same boost to a 
> search with four terms (e.g. between 0 and 4).

Wouldn't that be solvable by multiplying my boost with the max value? Ie in the 
search with one term, my boost is multiplied by 1, and in the case with four 
terms it is multiplied by four. Ie some kind of normalization should solve 
this, right?


> If, for example, you want to add a recency boost, say with recip, where the 
> boost value is between 0 and 1, 
> then use score*(1+boost). This way, a boost of 0 has no effect on the score, 
> whereas a boost of 1 doubles the score. 
> If you use plain multiplicative here, a boost of 0 wipes out the score 
> entirely, which can have nasty effects (it has, at least, for me).

I understand what you mean, but can you still call that a multiplicative 
function? Because score*(1+boost) is the same as score + score*boost. Ie, you 
basically take your boost, multiply it by the original score, and then *add* 
the original score.

But sure, if this technically is still called a multiplicative (which I guess 
it is, in a way, since you can achieve this using the boost function in 
edismax, which is declared as multiplicative).

/Jimi


RE: Why is multiplicative boost prefered over additive?

2016-03-18 Thread jimi.hullegard
On Thursday, March 17, 2016 7:58 PM, wun...@wunderwood.org wrote:
> 
> Think about using popularity as a boost. If one movie has a million rentals 
> and one has a hundred rentals, there is no additive formula that balances 
> that with text relevance. Even with log(popularity), it doesn't work.

I'm not sure I follow your logic now. If one can express the popularity as a 
value between 0.0 and 1.0, why can't one use that, together with a weight 
(indicating how much the popularity should influence the score, in general) and 
add that to the text relevance score? And how, exactly, would I achieve that 
using any multiplicative formula?

The logic of the weight, in this case, is that I want to be able to tweak how 
much influence the popularity has on the final score (and thus the sort order 
of the documents), where a weight of 0.0 would have the same effect as if the 
popularity wasn't included in the boost logic at all, and a high enough weight 
would have the same effect as if one sorted the documents solely on popularity.

/Jimi


RE: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread jimi.hullegard
Thanks Shawn, 

I had more or less assumed that the cwiki site was focused on the latest Solr 
version, but never really noticed that the "reference guide" was available in 
version-specific releases. I guess that is partly because I prefer googling 
about a specific topic, instead of reading some reference guide cover to cover. 
And from a google search for "edismax" (for example), it's not really trivial 
to click one's way into a version-specific reference guide on that topic. 
Instead, one tends to land on the wiki pages (with the old wiki as the first 
hit, sometimes).

Regards
/Jimi

-Original Message-
From: Shawn Heisey [mailto:apa...@elyograg.org] 
Sent: Monday, February 29, 2016 3:45 PM
To: solr-user@lucene.apache.org
Subject: Re: ExtendedDisMax configuration nowhere to be found

On 2/29/2016 7:00 AM, jimi.hulleg...@svensktnaringsliv.se wrote:
> So, should I assume that the "Confluence wiki" is the correct place for all 
> documentation, even for solr 4.6?

If you want documentation specifically for 4.6, there are version-specific 
releases of the guide:

https://archive.apache.org/dist/lucene/solr/ref-guide/

The confluence wiki is the "live" version of the reference guide, applicable to 
whatever version of Solr is being worked on at the moment, not the released 
versions.  Because it's such a large documentation set and Solr evolves 
incrementally, quite a lot of the confluence wiki is applicable to older 
versions, but the wiki as a whole is not intended for those older versions.

The project is gearing up to begin the work on releasing version 6.0, so you 
can expect a LOT of change activity on the confluence wiki in the near future.  
I have no idea how long it will take to finish 6.0.  The last two major 
releases (4.0 and 5.0) took months, but there's strong hope on the team that it 
will only take a few weeks this time.

If you want to keep an eye on the pulse of the project, join the dev list.

http://lucene.apache.org/solr/resources.html#mailing-lists

In addition to a fair number of messages from real people, the dev list 
receives automated email from back-end systems in the project infrastructure, 
which creates very high traffic.  The ability to create filters to move mail 
between folders may help you keep your sanity.

Also listed on the link above page is the commit notification list, which 
offers a particularly verbose look into what's happening to the project.

Thanks,
Shawn



RE: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread jimi.hullegard
Well, I have to say that I strongly disagree with you. No regular user should 
have to resort to the source code to understand that edismax is preconfigured. 
Because that is what this is all about, in essence. The current documentation 
doesn't mention this, and the only documentation about configuration I could 
find states that the configuration can be found in the example solrconfig.xml.

I don't know why you start talking about implementation details. That is a 
totally different discussion. I am talking about "What do I need to do to make 
feature X work?", and if the answer to that is "Nothing, feature X is already 
available and configured out-of-the-box", well, then the documentation should 
simply state that fact. There is no reason to hide the fact that the feature is 
already built in. This is what I'm talking about. But to be crystal clear what 
I'm talking about:

1. Something works automatically, with no indication why -> This I consider 
"automagical", and in most cases bad (at least from a professional point of 
view).
2. Something works automatically, with an indication why (even just a few words 
can be OK) -> This is good.

Now, regarding my responsibility as a novice solr user. Somehow I should have 
*known* that one seemingly official solr documentation site was old/deprecated, 
even though I didn't see any information about that on that site? And then when 
I found the correct documentation site, I somehow should have *known* that 
edismax was preconfigured? (Otherwise, how would I know what documentation 
changes to suggest?)

Now, I could make some rude comment now about your tone. But I know that I 
might not have articulated myself in the perfect manner myself in some of my 
messages in this thread. So maybe it is best that we assume that we both in a 
way misunderstood each other a bit (I never meant that the hard work of Solr 
developers/commiters was unappreciated, and you probably never meant that 
important information should be hidden from the user).

Also, I hope that you don't find my critique about the documentation in any way 
as some kind of "demands". I know I am not a paying customer, so I know I can't 
"demand" anything, and would never think of doing that. But that doesn't mean 
that one can't look at this from a crass "business" perspective. If one's goal 
is to have many happy and educated users, then it is not bad to think about 
having the documentation up-to-date and clear about things, even things that 
most experienced Solr users find obvious.

Regarding the old wiki page. I found it using google. If I search for 
"ExtendedDisMax", "edismax configuration", or even just "edismax", I get that 
page as the very first hit. So I can only assume that I am not the only one 
stumbling over that old wiki. Some kind of tombstone warning sounds smart.

Regards
/Jimi


-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Monday, February 29, 2016 2:40 PM
To: solr-user@lucene.apache.org
Subject: Re: ExtendedDisMax configuration nowhere to be found

There is nothing wrong with features that appear to be automagical - that 
should in fact be a goal for all modern software systems. Of course, there is 
no magic, it's all real logic and any magic is purely appearance - it's just 
that the underlying logic may be complex and not obvious to an uninformed 
observer. Deliberately hiding information from users (e.g., implementation 
details) is indeed a goal for Solr - no mere mortal should be exposed to the 
intricate detail of the underlying Lucene search library or the apparent magic 
of edismax. In truth, nothing is hidden - the source code of both Solr and 
Lucene are readily available. But to the user it may (and should) appear to 
magical and even automagical.

OTOH, maybe some of the doc on edismax was not as clear as it could have been, 
in which case it is up to you to point out which specific passage(s) caused 
your difficulty. AFAICT, nothing at all was hidden - the examples in the doc 
(which I pointed you to) seem very simple and direct to the point.
If you experienced them otherwise, it is up to you to point out any problems 
that you had. And as I pointed out, you had started with the old wiki when you 
should have started with the current Solr Reference Guide.

The old edismax wiki should in fact have a tombstone warning that indicates 
that it is obsolete and redirect people to the new doc. Out of curiosity, how 
did you get to that old wiki page in the first place?

-- Jack Krupansky

On Mon, Feb 29, 2016 at 3:20 AM, 
wrote:

> There is no need to deliberately misinterpret what I wrote. What I was 
> trying to say was that "automagical" things don't belong in a 
> professional environment, because it is hiding important information 
> from people. And this is bad as it is, but if it on top of that is the 
> *intended* meaning for things in solr to be "automagical", ie 
> *deliberately* hiding information from the solr users, 

RE: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread jimi.hullegard
Hi Jan,

Well, I have very likely confused some old documentation to be up to date, but 
all I did was to google for "ExtendedDisMax" and clicked on the first result:

https://wiki.apache.org/solr/ExtendedDisMax

I could only assume that this was a valid page since it belongs to 
wiki.apache.org/solr/ and nowhere on the page it says anything about the 
documentation being deprecated or having moved somewhere else. I now can see 
that the last modification date of that page is August 2013, but that is 
written in a small font at the very bottom of the page, so not something I 
noticed before.

So, should I assume that the "Confluence wiki" is the correct place for all 
documentation, even for solr 4.6?

I hope you can understand my confusion, since there seem to be 3 official 
documentation sites (wiki.apache.org/solr/, 
cwiki.apache.org/confluence/display/solr/ and lucene.apache.org/solr/), with no 
clear indication anywhere (that I could find) which site should be used or how 
they complement each other.

Also I would like to thank you for explaining that the query parses, response 
writers etc are built in in Solr now, so they don't need to be configured at 
all. I just am a bit confused about the fact that nobody thought that the 
documentation should be just as clear about this as you have been here. :)

Regards
/Jimi

-Original Message-
From: Jan Høydahl [mailto:jan@cominvent.com] 
Sent: Monday, February 29, 2016 12:52 PM
To: solr-user@lucene.apache.org
Subject: Re: ExtendedDisMax configuration nowhere to be found

Hi Jimi,

I believe you may be confusing some old config and thus was lead to believe 
that you need to pre-configure anything in order to start using dismax / 
edismax.

In ealier versions of Solr, there was both a  named “dismax” as 
well as a  named “dismax”, setting the defType parameter to 
“dismax”. This caused some confusion and was removed in 3.1 (see 
https://issues.apache.org/jira/browse/SOLR-2363).

All the built-in Solr Query parsers mentioned in the Reference Guide, such as 
“lucene”, “dismax”, “edismax” etc are pre-registered with these keywords, so no 
need for explicit  tags. And you can test them out through the 
normal /select end point without configuring a new  section in 
solrconfig.xml. You may of course choose to create your own  
section with predefined defaults for the defType, qf, pf, pf2, pf3, tie, mm… 
parameters once you got all those defined, but in the simplest case, all you 
need to do is as described in the RefGuide:

   
http://localhost:8983/solr/techproducts/select?q=foo&defType=edismax&qf=title^10,body^2

Feel free to ask further questions if you don’t get it right.

PS: Solr has a rich library of built-in components ready to use out of the box. 
There are many query parsers, response writers, function queries, field types 
etc that you can start using directly. Then you have some “contrib” features 
which need to be explicitly added to class path and registered in 
solrconfig.xml before using, and the same with 3rd party components you 
download elsewhere. But most of the bundled stuff you read about in the 
RefGuide is already integrated, ready to use.

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

> 29. feb. 2016 kl. 09.20 skrev jimi.hulleg...@svensktnaringsliv.se:
> 
> There is no need to deliberately misinterpret what I wrote. What I was trying 
> to say was that "automagical" things don't belong in a professional 
> environment, because it is hiding important information from people. And this 
> is bad as it is, but if it on top of that is the *intended* meaning for 
> things in solr to be "automagical", ie *deliberately* hiding information from 
> the solr users, well that attitude is just baffling in my eyes. I can only 
> hope that I misunderstood you.
> 
> /Jimi
> 
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 11:44 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
> 
> So, all this hard work that people have put into Solr to make it more like a 
> Disney theme park is just... wasted... on you? Sigh. Okay, I guess we can't 
> please everyone.
> 
> -- Jack Krupansky
> 
> On Sun, Feb 28, 2016 at 5:40 PM, 
> wrote:
> 
>> I have no problem with automatic. It is "automagicall" stuff that I 
>> find a bit hard to like. Ie things that are automatic, but doesn't 
>> explain how and why they are automatic. But Disney Land and Disney 
>> World are actually really good examples of places where the magic 
>> stuff is suitable, ie in themeparks, designed mostly for kids. In the 
>> grown up world of IT, most people prefer logical and documented stuff, not 
>> things that "just works"
>> without explaining why. No offence :)
>> 
>> /Jimi
>> 
>> -Original Message-
>> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
>> Sent: Sunday, February 28, 2016 11:31 PM
>> To: solr-user@lucene

RE: ExtendedDisMax configuration nowhere to be found

2016-02-29 Thread jimi.hullegard
There is no need to deliberately misinterpret what I wrote. What I was trying 
to say was that "automagical" things don't belong in a professional 
environment, because it is hiding important information from people. And this 
is bad as it is, but if it on top of that is the *intended* meaning for things 
in solr to be "automagical", ie *deliberately* hiding information from the solr 
users, well that attitude is just baffling in my eyes. I can only hope that I 
misunderstood you.

/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Sunday, February 28, 2016 11:44 PM
To: solr-user@lucene.apache.org
Subject: Re: ExtendedDisMax configuration nowhere to be found

So, all this hard work that people have put into Solr to make it more like a 
Disney theme park is just... wasted... on you? Sigh. Okay, I guess we can't 
please everyone.

-- Jack Krupansky

On Sun, Feb 28, 2016 at 5:40 PM, 
wrote:

> I have no problem with automatic. It is "automagicall" stuff that I 
> find a bit hard to like. Ie things that are automatic, but doesn't 
> explain how and why they are automatic. But Disney Land and Disney 
> World are actually really good examples of places where the magic 
> stuff is suitable, ie in themeparks, designed mostly for kids. In the 
> grown up world of IT, most people prefer logical and documented stuff, not 
> things that "just works"
> without explaining why. No offence :)
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 11:31 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> Yes, it absolutely is automagic - just look at those examples in the 
> Confluence ref guide. No special request handler is needed - just the 
> normal default handler. Just the defType and qf parameters are needed 
> - as shown in the wiki examples.
>
> It really is that simple! All you have to supply is the list of fields 
> to query (qf) and your actual query text (q).
>
> I know, I know... some people just can't handle automatic. (Some 
> people hate DisneyLand/World!)
>
> -- Jack Krupansky
>
> On Sun, Feb 28, 2016 at 5:16 PM, 
> wrote:
>
> > I'm sorry, but I am still confused. I'm expecting to see some 
> >  tag somewhere. Why doesn't the documentation nor 
> > the example solrconfig.xml contain such a tag?
> >
> > If the edismax requestHandler is defined automatically, the 
> > documentation should explain that. Also, there should still exist 
> > some xml code that corresponds exactly to that default setup, right? 
> > That is what I'm looking for.
> >
> > For now, this edismax thing seems to work "automagically", and I 
> > prefer to understand why and how something works.
> >
> > /Jimi
> >
> > -Original Message-
> > From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> > Sent: Sunday, February 28, 2016 10:58 PM
> > To: solr-user@lucene.apache.org
> > Subject: Re: ExtendedDisMax configuration nowhere to be found
> >
> > Consult the Confluence wiki for more recent doc:
> >
> > https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax
> > +Q
> > uery+Parser
> >
> > You can specify all the parameters on your query request as in the 
> > examples, or by placing the parameters in the "defaults" section for 
> > your request handler in solrconfig.xml.
> >
> >
> > -- Jack Krupansky
> >
> > On Sun, Feb 28, 2016 at 2:42 PM, 
> > 
> > wrote:
> >
> > > Hi,
> > >
> > > I want to setup ExtendedDisMax in our solr 4.6 server, but I can't 
> > > seem to find any example configuration for this. Ie the 
> > > configuration needed in solrconfig.xml. In the wiki page 
> > > http://wiki.apache.org/solr/ExtendedDisMax it simply says:
> > >
> > > "Extended DisMax is already configured in the example 
> > > configuration, with the name edismax."
> > >
> > > But this is not true for the solrconfig.xml in our setup (it only 
> > > contains an example for dismax, not edismax), and I downloaded the 
> > > latest solr zip file (solr 5.5.0), and it didn't have either 
> > > dismax or edismax in any of its solrconfig.xml files.
> > >
> > > Why is it so hard to find this configuration? Am I missing 
> > > something obvious?
> > >
> > > Regards
> > > /Jimi
> > >
> >
>


RE: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread jimi.hullegard
I have no problem with automatic. It is "automagicall" stuff that I find a bit 
hard to like. Ie things that are automatic, but doesn't explain how and why 
they are automatic. But Disney Land and Disney World are actually really good 
examples of places where the magic stuff is suitable, ie in themeparks, 
designed mostly for kids. In the grown up world of IT, most people prefer 
logical and documented stuff, not things that "just works" without explaining 
why. No offence :)

/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Sunday, February 28, 2016 11:31 PM
To: solr-user@lucene.apache.org
Subject: Re: ExtendedDisMax configuration nowhere to be found

Yes, it absolutely is automagic - just look at those examples in the Confluence 
ref guide. No special request handler is needed - just the normal default 
handler. Just the defType and qf parameters are needed - as shown in the wiki 
examples.

It really is that simple! All you have to supply is the list of fields to query 
(qf) and your actual query text (q).

I know, I know... some people just can't handle automatic. (Some people hate 
DisneyLand/World!)

-- Jack Krupansky

On Sun, Feb 28, 2016 at 5:16 PM, 
wrote:

> I'm sorry, but I am still confused. I'm expecting to see some 
>  tag somewhere. Why doesn't the documentation nor the 
> example solrconfig.xml contain such a tag?
>
> If the edismax requestHandler is defined automatically, the 
> documentation should explain that. Also, there should still exist some 
> xml code that corresponds exactly to that default setup, right? That 
> is what I'm looking for.
>
> For now, this edismax thing seems to work "automagically", and I 
> prefer to understand why and how something works.
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Sunday, February 28, 2016 10:58 PM
> To: solr-user@lucene.apache.org
> Subject: Re: ExtendedDisMax configuration nowhere to be found
>
> Consult the Confluence wiki for more recent doc:
>
> https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Q
> uery+Parser
>
> You can specify all the parameters on your query request as in the 
> examples, or by placing the parameters in the "defaults" section for 
> your request handler in solrconfig.xml.
>
>
> -- Jack Krupansky
>
> On Sun, Feb 28, 2016 at 2:42 PM, 
> wrote:
>
> > Hi,
> >
> > I want to setup ExtendedDisMax in our solr 4.6 server, but I can't 
> > seem to find any example configuration for this. Ie the 
> > configuration needed in solrconfig.xml. In the wiki page 
> > http://wiki.apache.org/solr/ExtendedDisMax it simply says:
> >
> > "Extended DisMax is already configured in the example configuration, 
> > with the name edismax."
> >
> > But this is not true for the solrconfig.xml in our setup (it only 
> > contains an example for dismax, not edismax), and I downloaded the 
> > latest solr zip file (solr 5.5.0), and it didn't have either dismax 
> > or edismax in any of its solrconfig.xml files.
> >
> > Why is it so hard to find this configuration? Am I missing something 
> > obvious?
> >
> > Regards
> > /Jimi
> >
>


RE: ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread jimi.hullegard
I'm sorry, but I am still confused. I'm expecting to see some  
tag somewhere. Why doesn't the documentation nor the example solrconfig.xml 
contain such a tag?

If the edismax requestHandler is defined automatically, the documentation 
should explain that. Also, there should still exist some xml code that 
corresponds exactly to that default setup, right? That is what I'm looking for.

For now, this edismax thing seems to work "automagically", and I prefer to 
understand why and how something works.

/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Sunday, February 28, 2016 10:58 PM
To: solr-user@lucene.apache.org
Subject: Re: ExtendedDisMax configuration nowhere to be found

Consult the Confluence wiki for more recent doc:
https://cwiki.apache.org/confluence/display/solr/The+Extended+DisMax+Query+Parser

You can specify all the parameters on your query request as in the examples, or 
by placing the parameters in the "defaults" section for your request handler in 
solrconfig.xml.


-- Jack Krupansky

On Sun, Feb 28, 2016 at 2:42 PM, 
wrote:

> Hi,
>
> I want to setup ExtendedDisMax in our solr 4.6 server, but I can't 
> seem to find any example configuration for this. Ie the configuration 
> needed in solrconfig.xml. In the wiki page 
> http://wiki.apache.org/solr/ExtendedDisMax it simply says:
>
> "Extended DisMax is already configured in the example configuration, 
> with the name edismax."
>
> But this is not true for the solrconfig.xml in our setup (it only 
> contains an example for dismax, not edismax), and I downloaded the 
> latest solr zip file (solr 5.5.0), and it didn't have either dismax or 
> edismax in any of its solrconfig.xml files.
>
> Why is it so hard to find this configuration? Am I missing something 
> obvious?
>
> Regards
> /Jimi
>


ExtendedDisMax configuration nowhere to be found

2016-02-28 Thread jimi.hullegard
Hi,

I want to setup ExtendedDisMax in our solr 4.6 server, but I can't seem to find 
any example configuration for this. Ie the configuration needed in 
solrconfig.xml. In the wiki page http://wiki.apache.org/solr/ExtendedDisMax it 
simply says:

"Extended DisMax is already configured in the example configuration, with the 
name edismax."

But this is not true for the solrconfig.xml in our setup (it only contains an 
example for dismax, not edismax), and I downloaded the latest solr zip file 
(solr 5.5.0), and it didn't have either dismax or edismax in any of its 
solrconfig.xml files.

Why is it so hard to find this configuration? Am I missing something obvious?

Regards
/Jimi


Index time or query time boost, and help with boost syntax

2016-02-22 Thread jimi.hullegard
Hi,

We have a use case where we want to influence the score of the documents based 
on the document type, and I am a bit unsure what is the best way to achieve 
this. In essence we have about 100.000 documents, of about 15 different 
document types. And we more or less want to tweak the score differently for 
each document type (ie it is not just one document type that should be boosed 
over all the others).

How would you suggest that we do this? First I thought that query time boosing 
would be perfect for this, because that way we can tweak and fine tune the 
boost levels without having to reindex everything each time. But to be honest, 
I really don't understand how I would put such a query together, using the 
edismax parser. I can't seem to find one single example for edismax for this, 
using the multiplicative boost, that boosts like this: documentType:person^1.8 
documentType:publication^1.5 documentType:news^1.5 documentType:event^1.3 
etc... Can someone help me out with the syntax?

Another approach could be that we use index time boost. That would simplify the 
querys, and to be honest I don't think that we need to modify the boosting 
factors much after the initial tweaking is done, and also our indexing process 
is fairly quick and light weight, so it isn't a big deal to perform a full 
reindex.
But here I am also unsure of how to set that up properly. Basically we want to 
boost the documents based on document type, regardless of the query. According 
to the documentaiton, this is what happens when one uses the boost attribute on 
the doc element in the xml. However the documentation also mentions that this 
is just "a convinience mechanism equivilent to specifying a boost attribute on 
each of the individual fields that support norms". This leaves me wondering:

1. If boost is defined on both the doc and field level, how is that 
interpreted? Are the values merged using 
add/multiply/max/some-other-math-function? Or is the doc boost just used as a 
default value for fields that doesn't defined their own boost?
2. What about fields that doesn't have norms? If a query matches such a field, 
wouldn't that effect the score, without me being able to effect that score?
3. On a general note: Is the score I'm boosting really the 
total/outermost/final score of the document? So that a boost of 2.0 would 
double the final score of that document, all else equal? Or I'm I simply 
boosting one "inner score", that in turn is used in some complex math 
expression so that it might not influence the final score at all in 
circumstances, and other times might only influence the score in a much smaller 
way?

An alternative I guess could be to start out with query time boosting like 
above, to find the apropriate boosting levels. And then convert this to some 
kind of hybrid solition afterwards, where the boost factor is stored in a field 
in the document (thus being set at index time), and then being used in a boost 
function in the query. With this solution, I guess that it would also be 
possible to have multiple "boost fields" in the documents, each with different 
relative boost values based on document type, and then be able to choose at 
query time what boost field we want. Would that be a good solution you think? 
But would it be possible to go from a query boost of the type 
"documentType:person^1.8 ..." to a function query boost that uses a document 
field with that value? Ie, would the resulting scores be the same for 
"documentType:person^1.8 ..." on one hand, and a function boost query with a 
field that has the value 1.8 for documents of type person? Or could the boost 
values from these different boost styles result in different final scores?

Regards
/Jimi


RE: Mix Solr 4 and 5?

2016-01-22 Thread jimi.hullegard
Oh, one more thing. Would this setup still be possible if we would want to have 
the new 5.x solr server be the solr cloud version? I'm not saying that 
SolrCloud is a requirement for us (it might even not be suitable, since our 
index is not that large), but still would be good to know.

/Jimi

-Original Message-
From: jimi.hulleg...@svensktnaringsliv.se 
[mailto:jimi.hulleg...@svensktnaringsliv.se] 
Sent: Friday, January 22, 2016 11:03 PM
To: solr-user@lucene.apache.org
Subject: RE: Mix Solr 4 and 5?

OK, so just to be clear. As far as you know, and from your point of view, you 
would consider it a better solution to stick with the 4.6 solrj client jar for 
both the 4.6 and 5.x communication, rather than switching the 4.6 solrj client 
jar to the 5.x version and hoping that the CMS solr-specific code (written with 
4.x in mind) still would work?

And by the way, thanks for all the feedback and input all you guys. I really 
appreciate it.
/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
Sent: Friday, January 22, 2016 10:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Mix Solr 4 and 5?

Personally, I think the Solr project should endeavor to commit to guaranteeing 
that a SolrJ x.y client will be compatible with a Solr x+1.y2 Solr server. 
AFAICT there currently isn't such a formal compat commitment or promise, but 
also AFAIK there is no known non-compat issue between SolrJ 4.y and Sol 5.y2. 
Let's see if anybody else knows of any. There might be issues if you do extreme 
things like examining the detailed cluster state from Zookeeper or use some of 
the non-traditional APIs introduced since 4.6 that may have been works in 
progress, but as long as you keep your app usage of these more advanced 
features to a minimum, you should be able to sidestep such issues.

If you try it and do run into a compat issue, we should make an effort to 
consider that an unacceptable bug since upgrading clients is not always such an 
easy or even feasible process, and if the old clients aren't using any new 
features there would be a reasonable expectation that they should continue to 
work.


-- Jack Krupansky

On Fri, Jan 22, 2016 at 10:40 AM, 
wrote:

> Yeah, sort of. Solr isn't bundled in the CMS, it is in a separate 
> Tomcat instance. But our code is running on the same Tomcat as the 
> CMS, and the CMS uses solrj 4.x to talk with its solr. And now we want 
> to be able to talk with our own separate solr, running solr 5.x, and 
> would prefer to use solrj for this.
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Friday, January 22, 2016 10:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Mix Solr 4 and 5?
>
> Just to be clear, are you talking about a single app that does SolrJ 
> calls to both your CMS and your free text search index? So, one Java 
> app that is simultaneously sending requests to two Solr instances (once 4, 
> one 5)?
>
> -- Jack Krupansky
>
> On Fri, Jan 22, 2016 at 1:57 AM, 
> wrote:
>
> > Hi,
> >
> > Long story short, we use a CMS that is integrated with Solr 4.6, 
> > with the solrj jar file in the global/common Tomcat classpath. We 
> > currently use a Google Search Appliance machine for our own freetext 
> > search needs, but plan to replace that with some other solution in 
> > the near future. Since we already work with solr because of the CMS 
> > integration, we would like to select solr for this project.
> >
> > But I would prefer to use the latest version, ie solr 5, and I am 
> > not sure how that would work in our situation. Can we use the solrj 
> > client for solr
> > 4 when indexing and searching on a solr 5 server? If so, would we 
> > miss some important feature, and would this setup be future proof?
> >
> > Or can we somehow use both solr4 and solr 5 client libraries at the 
> > same time, in the same context? It is not possible to upgrade the 
> > solr server that the CMS is using, and it is not possible to remove 
> > the 4.6 solrj jar from the common classpath in Tomcat. That is, 
> > unless the solr 5 version of solrj is backwards compatible, so that 
> > we can switch the jar files and our CMS would still be able to index 
> > and search in
> it's own solr 4 server.
> >
> > What would you say that our options are? I would really not like 
> > having do low level http calls to the solr 5 server.
> >
> > Regards
> > /Jimi
> >
>


RE: Mix Solr 4 and 5?

2016-01-22 Thread jimi.hullegard
OK, so just to be clear. As far as you know, and from your point of view, you 
would consider it a better solution to stick with the 4.6 solrj client jar for 
both the 4.6 and 5.x communication, rather than switching the 4.6 solrj client 
jar to the 5.x version and hoping that the CMS solr-specific code (written with 
4.x in mind) still would work?

And by the way, thanks for all the feedback and input all you guys. I really 
appreciate it.
/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Friday, January 22, 2016 10:52 PM
To: solr-user@lucene.apache.org
Subject: Re: Mix Solr 4 and 5?

Personally, I think the Solr project should endeavor to commit to guaranteeing 
that a SolrJ x.y client will be compatible with a Solr x+1.y2 Solr server. 
AFAICT there currently isn't such a formal compat commitment or promise, but 
also AFAIK there is no known non-compat issue between SolrJ 4.y and Sol 5.y2. 
Let's see if anybody else knows of any. There might be issues if you do extreme 
things like examining the detailed cluster state from Zookeeper or use some of 
the non-traditional APIs introduced since 4.6 that may have been works in 
progress, but as long as you keep your app usage of these more advanced 
features to a minimum, you should be able to sidestep such issues.

If you try it and do run into a compat issue, we should make an effort to 
consider that an unacceptable bug since upgrading clients is not always such an 
easy or even feasible process, and if the old clients aren't using any new 
features there would be a reasonable expectation that they should continue to 
work.


-- Jack Krupansky

On Fri, Jan 22, 2016 at 10:40 AM, 
wrote:

> Yeah, sort of. Solr isn't bundled in the CMS, it is in a separate 
> Tomcat instance. But our code is running on the same Tomcat as the 
> CMS, and the CMS uses solrj 4.x to talk with its solr. And now we want 
> to be able to talk with our own separate solr, running solr 5.x, and 
> would prefer to use solrj for this.
>
> /Jimi
>
> -Original Message-
> From: Jack Krupansky [mailto:jack.krupan...@gmail.com]
> Sent: Friday, January 22, 2016 10:11 PM
> To: solr-user@lucene.apache.org
> Subject: Re: Mix Solr 4 and 5?
>
> Just to be clear, are you talking about a single app that does SolrJ 
> calls to both your CMS and your free text search index? So, one Java 
> app that is simultaneously sending requests to two Solr instances (once 4, 
> one 5)?
>
> -- Jack Krupansky
>
> On Fri, Jan 22, 2016 at 1:57 AM, 
> wrote:
>
> > Hi,
> >
> > Long story short, we use a CMS that is integrated with Solr 4.6, 
> > with the solrj jar file in the global/common Tomcat classpath. We 
> > currently use a Google Search Appliance machine for our own freetext 
> > search needs, but plan to replace that with some other solution in 
> > the near future. Since we already work with solr because of the CMS 
> > integration, we would like to select solr for this project.
> >
> > But I would prefer to use the latest version, ie solr 5, and I am 
> > not sure how that would work in our situation. Can we use the solrj 
> > client for solr
> > 4 when indexing and searching on a solr 5 server? If so, would we 
> > miss some important feature, and would this setup be future proof?
> >
> > Or can we somehow use both solr4 and solr 5 client libraries at the 
> > same time, in the same context? It is not possible to upgrade the 
> > solr server that the CMS is using, and it is not possible to remove 
> > the 4.6 solrj jar from the common classpath in Tomcat. That is, 
> > unless the solr 5 version of solrj is backwards compatible, so that 
> > we can switch the jar files and our CMS would still be able to index 
> > and search in
> it's own solr 4 server.
> >
> > What would you say that our options are? I would really not like 
> > having do low level http calls to the solr 5 server.
> >
> > Regards
> > /Jimi
> >
>


RE: Mix Solr 4 and 5?

2016-01-22 Thread jimi.hullegard
Shawn wrote:
> 
> If you are NOT running SolrCloud, then that should work with no problem. 
> The HTTP API is fairly static and has not seen any major upheaval recently.
> If you're NOT running SolrCloud, you may even be able to replace the 
> SolrJ jar in your existing system with the 5.4.1 version (and update 
> SolrJ's dependent jars) and have everything continue to work.
> 
> If you ARE running SolrCloud, I would not try mixing 4.x and 5.x, 
> in either direction.  SolrCloud is evolving very quickly ... I wouldn't
> even mix *minor* versions, much less *major* versions.  
> There are differences in how the zookeeper database is laid out, 
> and mixing versions is not guaranteed to work, especially if SolrJ 
> is older than Solr.  If the version difference is small and SolrJ is newer 
> than Solr, there's a chance of success, but with the situation you 
> have described, SolrCloud would likely not work.

When you talk about not mixing 4.x and 5.x when using SolrCloud, you mean 
between the client and the server that talk to each other, right? Or would it 
be a problem keeping our existing non cloud solr 4.x server, upgrading the 
client solrj jar to 5.x (assuming this works, like you and others here seem to 
think it should/could), and then adding a new solr cloud 5.x server? That way, 
there the two separate communication "channels" are solrj 5.x <--> solr 4.x 
server, and, solrj 5.x  <--> solrcloud 5.x.

Or does the mere presense of a solr 4.x server and a solr cloud 5.x server on 
the same network cause problems, even when they don't know about eachother?

Regards
/Jimi


RE: Mix Solr 4 and 5?

2016-01-22 Thread jimi.hullegard
Yeah, sort of. Solr isn't bundled in the CMS, it is in a separate Tomcat 
instance. But our code is running on the same Tomcat as the CMS, and the CMS 
uses solrj 4.x to talk with its solr. And now we want to be able to talk with 
our own separate solr, running solr 5.x, and would prefer to use solrj for this.

/Jimi

-Original Message-
From: Jack Krupansky [mailto:jack.krupan...@gmail.com] 
Sent: Friday, January 22, 2016 10:11 PM
To: solr-user@lucene.apache.org
Subject: Re: Mix Solr 4 and 5?

Just to be clear, are you talking about a single app that does SolrJ calls to 
both your CMS and your free text search index? So, one Java app that is 
simultaneously sending requests to two Solr instances (once 4, one 5)?

-- Jack Krupansky

On Fri, Jan 22, 2016 at 1:57 AM, 
wrote:

> Hi,
>
> Long story short, we use a CMS that is integrated with Solr 4.6, with 
> the solrj jar file in the global/common Tomcat classpath. We currently 
> use a Google Search Appliance machine for our own freetext search 
> needs, but plan to replace that with some other solution in the near 
> future. Since we already work with solr because of the CMS 
> integration, we would like to select solr for this project.
>
> But I would prefer to use the latest version, ie solr 5, and I am not 
> sure how that would work in our situation. Can we use the solrj client 
> for solr
> 4 when indexing and searching on a solr 5 server? If so, would we miss 
> some important feature, and would this setup be future proof?
>
> Or can we somehow use both solr4 and solr 5 client libraries at the 
> same time, in the same context? It is not possible to upgrade the solr 
> server that the CMS is using, and it is not possible to remove the 4.6 
> solrj jar from the common classpath in Tomcat. That is, unless the 
> solr 5 version of solrj is backwards compatible, so that we can switch 
> the jar files and our CMS would still be able to index and search in it's own 
> solr 4 server.
>
> What would you say that our options are? I would really not like 
> having do low level http calls to the solr 5 server.
>
> Regards
> /Jimi
>


Mix Solr 4 and 5?

2016-01-21 Thread jimi.hullegard
Hi,

Long story short, we use a CMS that is integrated with Solr 4.6, with the solrj 
jar file in the global/common Tomcat classpath. We currently use a Google 
Search Appliance machine for our own freetext search needs, but plan to replace 
that with some other solution in the near future. Since we already work with 
solr because of the CMS integration, we would like to select solr for this 
project.

But I would prefer to use the latest version, ie solr 5, and I am not sure how 
that would work in our situation. Can we use the solrj client for solr 4 when 
indexing and searching on a solr 5 server? If so, would we miss some important 
feature, and would this setup be future proof?

Or can we somehow use both solr4 and solr 5 client libraries at the same time, 
in the same context? It is not possible to upgrade the solr server that the CMS 
is using, and it is not possible to remove the 4.6 solrj jar from the common 
classpath in Tomcat. That is, unless the solr 5 version of solrj is backwards 
compatible, so that we can switch the jar files and our CMS would still be able 
to index and search in it's own solr 4 server.

What would you say that our options are? I would really not like having do low 
level http calls to the solr 5 server.

Regards
/Jimi


SV: Date truncation and time zone when searching

2014-05-21 Thread jimi.hullegard
Thanks! I totally forgot to add the word "math" (as in 'solr date math time 
zone') when searching for a solution on this, so I never stumbled upon that 
jira issue. Will giv it a try.

/Jimi

> -Ursprungligt meddelande-
> Från: Erick Erickson [mailto:erickerick...@gmail.com]
> Skickat: den 21 maj 2014 17:25
> Till: solr-user@lucene.apache.org
> Ämne: Re: Date truncation and time zone when searching
> 
> Try the TZ parameter on the query, as blah&TZ=GMT-4
> 
> There's a good discussion of why PDT is ambiguous here:
> https://issues.apache.org/jira/browse/SOLR-2690.
> 
> You can define whatever default parameters you want in your handler in the
>  section, the TZ parameter included.
> 
> 
> Best
> Erick
> 
> On Wed, May 21, 2014 at 7:30 AM,  
> wrote:
> > OK. Feels a bit strange that one would have to do this manual calculation in
> every place that performs searches like this.
> > Would be much more logical if solr supported specifying the timezone in
> the query (with a default setting being possible to configure in solrconfig),
> and that solr itself did this conversion behind the scenes. Maybe that will
> come in the future?
> >
> > /Jimi
> >
> >> -Ursprungligt meddelande-
> >> Från: Michael Ryan [mailto:mr...@moreover.com]
> >> Skickat: den 21 maj 2014 16:23
> >> Till: solr-user@lucene.apache.org
> >> Ämne: RE: Date truncation and time zone when searching
> >>
> >> Well for CEST, which is 2 hours ahead, I would think you could just do...
> >>
> >> datefield:[* TO NOW/MONTH-2HOURS]
> >>
> >> That would give you everything up to 2014-04-30 22:00:00 GMT, which
> >> is
> >> 2014-05-01 00:00:00 CEST.
> >>
> >> Always always always store the correct value.
> >>
> >> -Michael
> >>
> >> -Original Message-
> >> From: jimi.hulleg...@svensktnaringsliv.se
> >> [mailto:jimi.hulleg...@svensktnaringsliv.se]
> >> Sent: Wednesday, May 21, 2014 10:12 AM
> >> To: solr-user@lucene.apache.org
> >> Subject: Date truncation and time zone when searching
> >>
> >> Hi,
> >>
> >> What is the prefered way to do searches with date truncation with
> >> respect to a specific time zone? The dates are stored correctly, ie I
> >> can see the UTC date in the index and if I add 1 or 2 hours
> >> (depending on daylight saving time or
> >> not) I get the time in our time zone (CET/CEST). But when I search I
> >> want to be able to do something like:
> >>
> >> datefield:[* TO NOW/MONTH]
> >> or:
> >> datefield:[* TO NOW/DAY+3DAYS]
> >> etc...
> >>
> >> ...and I want the MONTH or DAY truncation to correspond to our time
> zone.
> >> How can I achieve this? Some of our dates are have the time set to
> >> 00:00 (CET/CEST), but since the date truncation logic in solr only
> >> handles UTC that would classify those documents to "belong to the day
> >> before", and if the date is on 00:00 on January 1, then the document
> >> is even considered to belong to the year before it's actual year!
> >>
> >> Surely I can't be the only one hindered by this. What are the common
> >> workarounds used today? Some people mention that they trick solr by
> >> using the date and time expressed as for their own time zone, and
> >> then claiming that the time zone is UTC. Ie saying
> >> "2000-01-01T00:00:00Z" for the beginning of this millennium, even
> >> though it should be "1999-12-31T23:00:00Z". But I don't like this
> >> idea of storing incorrect date values in the index, since then we need to
> "convert" them again before we display them.
> >>
> >> Regards
> >> /Jimi


SV: Date truncation and time zone when searching

2014-05-21 Thread jimi.hullegard
OK. Feels a bit strange that one would have to do this manual calculation in 
every place that performs searches like this.
Would be much more logical if solr supported specifying the timezone in the 
query (with a default setting being possible to configure in solrconfig), and 
that solr itself did this conversion behind the scenes. Maybe that will come in 
the future?

/Jimi

> -Ursprungligt meddelande-
> Från: Michael Ryan [mailto:mr...@moreover.com]
> Skickat: den 21 maj 2014 16:23
> Till: solr-user@lucene.apache.org
> Ämne: RE: Date truncation and time zone when searching
> 
> Well for CEST, which is 2 hours ahead, I would think you could just do...
> 
> datefield:[* TO NOW/MONTH-2HOURS]
> 
> That would give you everything up to 2014-04-30 22:00:00 GMT, which is
> 2014-05-01 00:00:00 CEST.
> 
> Always always always store the correct value.
> 
> -Michael
> 
> -Original Message-
> From: jimi.hulleg...@svensktnaringsliv.se
> [mailto:jimi.hulleg...@svensktnaringsliv.se]
> Sent: Wednesday, May 21, 2014 10:12 AM
> To: solr-user@lucene.apache.org
> Subject: Date truncation and time zone when searching
> 
> Hi,
> 
> What is the prefered way to do searches with date truncation with respect to
> a specific time zone? The dates are stored correctly, ie I can see the UTC 
> date
> in the index and if I add 1 or 2 hours (depending on daylight saving time or
> not) I get the time in our time zone (CET/CEST). But when I search I want to
> be able to do something like:
> 
> datefield:[* TO NOW/MONTH]
> or:
> datefield:[* TO NOW/DAY+3DAYS]
> etc...
> 
> ...and I want the MONTH or DAY truncation to correspond to our time zone.
> How can I achieve this? Some of our dates are have the time set to 00:00
> (CET/CEST), but since the date truncation logic in solr only handles UTC that
> would classify those documents to "belong to the day before", and if the
> date is on 00:00 on January 1, then the document is even considered to
> belong to the year before it's actual year!
> 
> Surely I can't be the only one hindered by this. What are the common
> workarounds used today? Some people mention that they trick solr by using
> the date and time expressed as for their own time zone, and then claiming
> that the time zone is UTC. Ie saying "2000-01-01T00:00:00Z" for the beginning
> of this millennium, even though it should be "1999-12-31T23:00:00Z". But I
> don't like this idea of storing incorrect date values in the index, since 
> then we
> need to "convert" them again before we display them.
> 
> Regards
> /Jimi


Date truncation and time zone when searching

2014-05-21 Thread jimi.hullegard
Hi,

What is the prefered way to do searches with date truncation with respect to a 
specific time zone? The dates are stored correctly, ie I can see the UTC date 
in the index and if I add 1 or 2 hours (depending on daylight saving time or 
not) I get the time in our time zone (CET/CEST). But when I search I want to be 
able to do something like:

datefield:[* TO NOW/MONTH]
or:
datefield:[* TO NOW/DAY+3DAYS]
etc...

...and I want the MONTH or DAY truncation to correspond to our time zone. How 
can I achieve this? Some of our dates are have the time set to 00:00 
(CET/CEST), but since the date truncation logic in solr only handles UTC that 
would classify those documents to "belong to the day before", and if the date 
is on 00:00 on January 1, then the document is even considered to belong to the 
year before it's actual year!

Surely I can't be the only one hindered by this. What are the common 
workarounds used today? Some people mention that they trick solr by using the 
date and time expressed as for their own time zone, and then claiming that the 
time zone is UTC. Ie saying "2000-01-01T00:00:00Z" for the beginning of this 
millennium, even though it should be "1999-12-31T23:00:00Z". But I don't like 
this idea of storing incorrect date values in the index, since then we need to 
"convert" them again before we display them.

Regards
/Jimi


Invalid use of SingleClientConnManager: connection still allocated

2013-11-16 Thread jimi.hullegard
Hi,

We have a problem with solr in our test environment in our new project. The 
stack trace can be seen below. The thing is that this only effects the search 
that is performed by the CMS itself. Our own custom searches still works fine. 
Anyone know what could cause this error? A restart doesn't help.

org.apache.solr.client.solrj.SolrServerException: Error executing query
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:98)
at org.apache.solr.client.solrj.SolrServer.query(SolrServer.java:301)
...
Caused by: java.lang.IllegalStateException: Invalid use of 
SingleClientConnManager: connection still allocated.
Make sure to release the connection before allocating another one.
at 
org.apache.http.impl.conn.SingleClientConnManager.getConnection(SingleClientConnManager.java:217)
at 
org.apache.http.impl.conn.SingleClientConnManager$1.getConnection(SingleClientConnManager.java:190)
at 
org.apache.http.impl.client.DefaultRequestDirector.execute(DefaultRequestDirector.java:401)
at 
org.apache.http.impl.client.AbstractHttpClient.execute(AbstractHttpClient.java:820)
at 
org.apache.http.impl.client.cache.CachingHttpClient.callBackend(CachingHttpClient.java:715)
at 
org.apache.http.impl.client.cache.CachingHttpClient.handleCacheMiss(CachingHttpClient.java:497)
at 
org.apache.http.impl.client.cache.CachingHttpClient.execute(CachingHttpClient.java:433)
at 
org.apache.http.impl.client.cache.CachingHttpClient.execute(CachingHttpClient.java:352)
at 
org.apache.http.impl.client.cache.CachingHttpClient.execute(CachingHttpClient.java:346)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:352)
at 
org.apache.solr.client.solrj.impl.HttpSolrServer.request(HttpSolrServer.java:181)
at 
org.apache.solr.client.solrj.request.QueryRequest.process(QueryRequest.java:90)
... 230 more

Regards
/Jimi


Re: Date range faceting with various gap sizes?

2013-11-15 Thread jimi.hullegard
> Chris Hostetter wrote:
>
> You can see that in the resulting URL you got the params are duplicated -- the
> problem is that when expressed this way, Solr doesn't know when the
> different values of the start/end/gap params should be applied -- it just
> loops over each of the facet.range fields (in your case: the same field
> twice) and then looks for a coorisponding start/end/gap value and finds the
> first one since there are duplicates.

OK, that explains it. I thought that it would match the params, so that it 
would match the first parameter "facet.range=scheduledate_start_tdate" with the 
first parameter 
"f.scheduledate_start_tdate.facet.range.start=1990-01-01T11:00:00.000Z", the 
second parameter "facet.range=scheduledate_start_tdate" with the second 
parameter 
"f.scheduledate_start_tdate.facet.range.start=2011-01-01T11:00:00.000Z" and so 
on.

> what you want to do can be accomplished (as of Solr 4.3 - see SOLR-1351) by
> using "local params" in the facet.range (or facet.date) params...
> 
> http://localhost:8983/solr/select?q=*:*&rows=0&facet=true&facet.range={!
> facet.range.start=NOW/MONTH%20facet.range.end=NOW/MONTH%2B1M
> ONTH%20facet.range.gap=%2B1DAY}manufacturedate_dt&facet.range={!fa
> cet.range.start=NOW/MONTH%20facet.range.end=NOW/MONTH%2B1MO
> NTH%20facet.range.gap=%2B5DAY}manufacturedate_dt

Thanks for this info. I'm not sure it is easy to upgrade Solr for us though, 
since it is more or less integrated into the CMS we use.

But I actually realized that for this particular case, we don't need different 
gap sizes. We can use the "before" and "after" metadata instead.
 
Regards
/Jimi




SV: Date range faceting with various gap sizes?

2013-11-12 Thread jimi.hullegard
Directly after I sent my email, I tested using two different field names, 
instead of the same field name for both range facets. And then it worked.

So, it seems there is a bug that can't handle multiple range facets for the 
same field name. A workaround is to use a copyfield to another field, and do 
the second range facet for that second field name. But surely this is a bug in 
Solr, right? Maybe it has already been reported, but I couldn't find anything.

/Jimi

> -Ursprungligt meddelande-
> Från: jimi.hulleg...@svensktnaringsliv.se
> [mailto:jimi.hulleg...@svensktnaringsliv.se]
> Skickat: den 12 november 2013 15:08
> Till: solr-user@lucene.apache.org
> Ämne: Date range faceting with various gap sizes?
> 
> Hi,
> 
> I'm experimenting with date range faceting, and would like to use different
> gaps depending on how old the date is. But I am not sure on how to do that.
> 
> This is what I have tried, using the java API Solrj 4.0.0 and Solr 4.1.0:
> 
> solrQuery.addDateRangeFacet("scheduledate_start_tdate", date1, date2,
> "+1YEAR"); solrQuery.addDateRangeFacet("scheduledate_start_tdate",
> date3, date4, "+1MONTH"); solrQuery.setFacetMinCount(1);
> 
> The first date interval is between 1990 and 2011, and the second interval is
> 2011 to 2014.
> 
> This results in this URL:
> 
> http://localhost:8080/solr/select?q=*:*&facet.range=scheduledate_start_td
> ate&facet.range=scheduledate_start_tdate&f.scheduledate_start_tdate.fac
> et.range.start=1990-01-
> 01T11%3A00%3A00.000Z&f.scheduledate_start_tdate.facet.range.start=2011
> -01-
> 01T11%3A00%3A00.000Z&f.scheduledate_start_tdate.facet.range.end=2011
> -01-
> 01T10%3A59%3A59.999Z&f.scheduledate_start_tdate.facet.range.end=2014
> -01-
> 01T11%3A00%3A00.000Z&f.scheduledate_start_tdate.facet.range.gap=%2B1
> YEAR&f.scheduledate_start_tdate.facet.range.gap=%2B1MONTH&facet=tru
> e&facet.mincount=1&wt=xml&indent=true
> 
> And the response contains this:
> 
> 
> 
>  
>   name="2006-01-01T11:00:00Z">207
>   name="2007-01-01T11:00:00Z">818
>   name="2008-01-01T11:00:00Z">811
>   name="2009-01-01T11:00:00Z">618
>   name="2010-01-01T11:00:00Z">612
>  
>  +1YEAR
>  1990-01-01T11:00:00Z
>  2011-01-01T11:00:00Z
> 
> 
>  
>   name="2006-01-01T11:00:00Z">207
>   name="2007-01-01T11:00:00Z">818
>   name="2008-01-01T11:00:00Z">811
>   name="2009-01-01T11:00:00Z">618
>   name="2010-01-01T11:00:00Z">612
>  
>  +1YEAR
>  1990-01-01T11:00:00Z
>  2011-01-01T11:00:00Z
> 
> 
> 
> Right away I notice that this is incorrect. The second facet range incorrectly
> uses the same gap, start and end-values as the first one. Can someone
> understand why? And is there a way to make this work?
> 
> Regards
> /Jimi



Date range faceting with various gap sizes?

2013-11-12 Thread jimi.hullegard
Hi,

I'm experimenting with date range faceting, and would like to use different 
gaps depending on how old the date is. But I am not sure on how to do that.

This is what I have tried, using the java API Solrj 4.0.0 and Solr 4.1.0:

solrQuery.addDateRangeFacet("scheduledate_start_tdate", date1, date2, "+1YEAR");
solrQuery.addDateRangeFacet("scheduledate_start_tdate", date3, date4, 
"+1MONTH");
solrQuery.setFacetMinCount(1);

The first date interval is between 1990 and 2011, and the second interval is 
2011 to 2014.

This results in this URL:

http://localhost:8080/solr/select?q=*:*&facet.range=scheduledate_start_tdate&facet.range=scheduledate_start_tdate&f.scheduledate_start_tdate.facet.range.start=1990-01-01T11%3A00%3A00.000Z&f.scheduledate_start_tdate.facet.range.start=2011-01-01T11%3A00%3A00.000Z&f.scheduledate_start_tdate.facet.range.end=2011-01-01T10%3A59%3A59.999Z&f.scheduledate_start_tdate.facet.range.end=2014-01-01T11%3A00%3A00.000Z&f.scheduledate_start_tdate.facet.range.gap=%2B1YEAR&f.scheduledate_start_tdate.facet.range.gap=%2B1MONTH&facet=true&facet.mincount=1&wt=xml&indent=true

And the response contains this:



 
 207
 818
 811
 618
 612
 
 +1YEAR
 1990-01-01T11:00:00Z
 2011-01-01T11:00:00Z


 
 207
 818
 811
 618
 612
 
 +1YEAR
 1990-01-01T11:00:00Z
 2011-01-01T11:00:00Z



Right away I notice that this is incorrect. The second facet range incorrectly 
uses the same gap, start and end-values as the first one. Can someone 
understand why? And is there a way to make this work?

Regards
/Jimi


RE: Any way to dynamically rename fields in the schema?

2013-10-22 Thread jimi.hullegard
Thanks Jan, for the links and quick explanation.

In our case Solr is integrated into the CMS we use, so I Think upgrading to a 
4.4+ version (that supports write calls to the Schema API) is not an option at 
the moment.

The field alias function for the result set sounds simple enough, as long as we 
will be able to modify the search request in all places that we need to perform 
a search with such fields (there might be cases where the CMS would be in the 
way).

The field alias function for queries sounds a bit more complex. But if I 
understood things correctly, this can also be configured by simple addition of 
URL parameters for the search request, right?

Regards
/Jimi


From: Jan Høydahl [jan@cominvent.com]
Sent: Tuesday, October 22, 2013 1:43 PM
To: solr-user@lucene.apache.org
Subject: Re: Any way to dynamically rename fields in the schema?

Hi,

In general, if you use dynamic fields you will have to live with some 
predictable naming pattern.
If you want to stick with dynamic fields, but use clean names, there are a few 
features worth looking at;

Field name aliases:
https://cwiki.apache.org/confluence/display/solr/Common+Query+Parameters#CommonQueryParameters-FieldNameAliases
This renames fields in the result set

Aliases in edismax
http://wiki.apache.org/solr/ExtendedDisMax#Field_aliasing_.2F_renaming
Lets your users query pretty field names instead of the dynamic ones


If this is not for you, please consider the new Schema API, in which you can 
programmatically add fields to the schema right before you need them:
https://cwiki.apache.org/confluence/display/solr/Schema+API

--
Jan Høydahl, search solution architect
Cominvent AS - www.cominvent.com

22. okt. 2013 kl. 12:53 skrev jimi.hulleg...@svensktnaringsliv.se:

> Hi,
>
> Is there a way to dynamically change a field name using some magic regex or 
> similar in the schema file?
> For example, if we have a field named "subtitle_string_indexed_stored", then 
> we could have a dynamic field that matches "*_string_indexed_stored" and then 
> renames it to simply "subtitle" (ie simply removes the 
> "_string_indexed_stored" part).
>
> The reasons we want this are:
>
> 1. We don't want to have to update the schema every time we need to 
> index/store a new field.
>
> 2. We want to be able to specify how each field should be indexed/stored.
>
> 3. We want to be able to use simple and clean field names when searching and 
> fetching stored fields. Like "subtitle".
>
>
> Number one above means that we must use dynamic fields of some type (right?). 
> And number two means that we must have a separate dynamic field for each 
> configuration combination, and the only way to controll that is by some 
> naming standard in the field names.
>
> So these two combined means that we will end up with fields with ugly 
> prefixes that controlls the internal behaivor of solr. Ie it doesn't go well 
> combined with the third number in the list above.
>
> Is there a way to solve this?
>
>
> Regards
> /Jimi


Any way to dynamically rename fields in the schema?

2013-10-22 Thread jimi.hullegard
Hi,

Is there a way to dynamically change a field name using some magic regex or 
similar in the schema file?
For example, if we have a field named "subtitle_string_indexed_stored", then we 
could have a dynamic field that matches "*_string_indexed_stored" and then 
renames it to simply "subtitle" (ie simply removes the "_string_indexed_stored" 
part).

The reasons we want this are:

1. We don't want to have to update the schema every time we need to index/store 
a new field.

2. We want to be able to specify how each field should be indexed/stored.

3. We want to be able to use simple and clean field names when searching and 
fetching stored fields. Like "subtitle".


Number one above means that we must use dynamic fields of some type (right?). 
And number two means that we must have a separate dynamic field for each 
configuration combination, and the only way to controll that is by some naming 
standard in the field names.

So these two combined means that we will end up with fields with ugly prefixes 
that controlls the internal behaivor of solr. Ie it doesn't go well combined 
with the third number in the list above.

Is there a way to solve this?


Regards
/Jimi