mm, tie, qs, ps and CJKBigramFilter and edismax and dismax

2013-09-03 Thread Naomi Dushay
When I have a field using CJKBigramFilter,  parsed CJK chars have a different 
parsedQuery than  non-CJK  queries.

  (旧小说 is 3 chars, so 2 bigrams)

args sent in:   q={!qf=bi_fld}旧小说pf=pf2=pf3=

 debugQuery
   str name=rawquerystring{!qf=bi_fld}旧小说/str
   str name=querystring{!qf=bi_fld}旧小说/str
   str name=parsedquery(+DisjunctionMaxQuerybi_fld:旧小 
bi_fld:小说)~2))~0.01) ())/no_coord/str
   str name=parsedquery_toString+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()/str


If i use a non-CJK query string, with the same field:

args sent in:  q={!qf=bi_fld}foo barpf=pf2=pf3=

debugQuery:
   str name=rawquerystring{!qf=bi_fld}foo bar/str
   str name=querystring{!qf=bi_fld}foo bar/str
   str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) 
DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord/str
   str name=parsedquery_toString+(((bi_fld:foo)~0.01 
(bi_fld:bar)~0.01)~2)/str


Why are the  parsedquery_toString   formula different?  And is there any 
difference in the actual relevancy formula?

How can you tell the difference between the MinNrShouldMatch and a qs or ps or 
tie value, if they are all represented as ~n  in the parsedQuery string?


To try to get a handle on qs, ps, tie and mm:

 args:  q={!qf=bi_fld pf=bi_fld}a b c dqs=5ps=4

debugQuery:
  str name=rawquerystring{!qf=bi_fld pf=bi_fld}a b c d/str
  str name=querystring{!qf=bi_fld pf=bi_fld}a b c d/str
  str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:a b~5)~0.01) 
DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) 
DisjunctionMaxQuery((bi_fld:c d~4)~0.01))/no_coord/str
  str name=parsedquery_toString+(((bi_fld:a b~5)~0.01 (bi_fld:c)~0.01 
(bi_fld:d)~0.01)~3) (bi_fld:c d~4)~0.01/str


I get that qs, the query slop, is for explicit phrases in the query, so a b~5 
   makes sense.   I also get that ps is for boosting of phrases, so I get  
(bi_fld:c d~4) … but where is   (cjk_uni_pub_search:a b c d~4)  ?


Using dismax (instead of edismax):

args:   q={!dismax  qf=bi_fld pf=bi_fld}a b c dqs=5ps=4

debugQuery:
  str name=rawquerystring{!dismax qf=bi_fld pf=bi_fld}a b c d/str
  str name=querystring{!dismax qf=bi_fld pf=bi_fld}a b c d/str
  str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:a b~5)~0.01) 
DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) 
DisjunctionMaxQuery((bi_fld:a b c d~4)~0.01))/no_coord/str
  str name=parsedquery_toString+(((bi_fld:a b~5)~0.01 (bi_fld:c)~0.01 
(bi_fld:d)~0.01)~3) (bi_fld:a b c d~4)~0.01/str


So is this an edismax bug?



FYI,   I am running Solr 4.4. I have fields defined like so:
fieldtype name=text_cjk_bi class=solr.TextField 
positionIncrementGap=1 autoGeneratePhraseQueries=false
  analyzer
tokenizer class=solr.ICUTokenizerFactory /
filter class=solr.CJKWidthFilterFactory/
filter class=solr.ICUTransformFilterFactory id=Traditional-Simplified/
filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/
filter class=solr.ICUFoldingFilterFactory/
filter class=solr.CJKBigramFilterFactory han=true hiragana=true 
katakana=true hangul=true outputUnigrams=false /
  /analyzer
/fieldtype

The request handler uses edismax:

requestHandler name=search class=solr.SearchHandler default=true
lst name=defaults
str name=defTypeedismax/str
str name=q.alt:/str
str name=mm6-1 690%/str
int name=qs1/int
int name=ps0/int

Re: mm, tie, qs, ps and CJKBigramFilter and edismax and dismax

2013-09-03 Thread Naomi Dushay
Re the relevancy changes I note below for edismax, there are already some 
issues filed:

pertaining to the difference in how the phrase queries are merged into the main 
query:
  See Michael Dodsworth's comment of 25/Sep/12  on this issue:   
https://issues.apache.org/jira/browse/SOLR-2058  -- ticket is closed, but this 
issue is not addressed.

and pertaining to skipping terms in phrase boosting when part of the query is a 
phrase:
  https://issues.apache.org/jira/browse/SOLR-4130

- Naomi


On Sep 3, 2013, at 5:54 PM, Naomi Dushay wrote:

 When I have a field using CJKBigramFilter,  parsed CJK chars have a different 
 parsedQuery than  non-CJK  queries.
 
   (旧小说 is 3 chars, so 2 bigrams)
 
 args sent in:   q={!qf=bi_fld}旧小说pf=pf2=pf3=
 
  debugQuery
str name=rawquerystring{!qf=bi_fld}旧小说/str
str name=querystring{!qf=bi_fld}旧小说/str
str name=parsedquery(+DisjunctionMaxQuerybi_fld:旧小 
 bi_fld:小说)~2))~0.01) ())/no_coord/str
str name=parsedquery_toString+(((bi_fld:旧小 bi_fld:小说)~2))~0.01 ()/str
 
 
 If i use a non-CJK query string, with the same field:
 
 args sent in:  q={!qf=bi_fld}foo barpf=pf2=pf3=
 
 debugQuery:
str name=rawquerystring{!qf=bi_fld}foo bar/str
str name=querystring{!qf=bi_fld}foo bar/str
str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:foo)~0.01) 
 DisjunctionMaxQuery((bi_fld:bar)~0.01))~2))/no_coord/str
str name=parsedquery_toString+(((bi_fld:foo)~0.01 
 (bi_fld:bar)~0.01)~2)/str
 
 
 Why are the  parsedquery_toString   formula different?  And is there any 
 difference in the actual relevancy formula?
 
 How can you tell the difference between the MinNrShouldMatch and a qs or ps 
 or tie value, if they are all represented as ~n  in the parsedQuery string?
 
 
 To try to get a handle on qs, ps, tie and mm:
 
  args:  q={!qf=bi_fld pf=bi_fld}a b c dqs=5ps=4
 
 debugQuery:
   str name=rawquerystring{!qf=bi_fld pf=bi_fld}a b c d/str
   str name=querystring{!qf=bi_fld pf=bi_fld}a b c d/str
   str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:a b~5)~0.01) 
 DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) 
 DisjunctionMaxQuery((bi_fld:c d~4)~0.01))/no_coord/str
   str name=parsedquery_toString+(((bi_fld:a b~5)~0.01 (bi_fld:c)~0.01 
 (bi_fld:d)~0.01)~3) (bi_fld:c d~4)~0.01/str
 
 
 I get that qs, the query slop, is for explicit phrases in the query, so a 
 b~5makes sense.   I also get that ps is for boosting of phrases, so I 
 get  (bi_fld:c d~4) … but where is   (cjk_uni_pub_search:a b c d~4)  ?
 
 
 Using dismax (instead of edismax):
 
 args:   q={!dismax  qf=bi_fld pf=bi_fld}a b c dqs=5ps=4
 
 debugQuery:
   str name=rawquerystring{!dismax qf=bi_fld pf=bi_fld}a b c d/str
   str name=querystring{!dismax qf=bi_fld pf=bi_fld}a b c d/str
   str name=parsedquery(+((DisjunctionMaxQuery((bi_fld:a b~5)~0.01) 
 DisjunctionMaxQuery((bi_fld:c)~0.01) DisjunctionMaxQuery((bi_fld:d)~0.01))~3) 
 DisjunctionMaxQuery((bi_fld:a b c d~4)~0.01))/no_coord/str
   str name=parsedquery_toString+(((bi_fld:a b~5)~0.01 (bi_fld:c)~0.01 
 (bi_fld:d)~0.01)~3) (bi_fld:a b c d~4)~0.01/str
 
 
 So is this an edismax bug?
 
 
 
 FYI,   I am running Solr 4.4. I have fields defined like so:
 fieldtype name=text_cjk_bi class=solr.TextField 
 positionIncrementGap=1 autoGeneratePhraseQueries=false
   analyzer
 tokenizer class=solr.ICUTokenizerFactory /
 filter class=solr.CJKWidthFilterFactory/
 filter class=solr.ICUTransformFilterFactory 
 id=Traditional-Simplified/
 filter class=solr.ICUTransformFilterFactory id=Katakana-Hiragana/
 filter class=solr.ICUFoldingFilterFactory/
 filter class=solr.CJKBigramFilterFactory han=true hiragana=true 
 katakana=true hangul=true outputUnigrams=false /
   /analyzer
 /fieldtype
 
 The request handler uses edismax:
 
 requestHandler name=search class=solr.SearchHandler default=true
 lst name=defaults
 str name=defTypeedismax/str
 str name=q.alt:/str
 str name=mm6-1 690%/str
 int name=qs1/int
 int name=ps0/int
 



Re: ICUTokenizer class not found with Solr 4.4

2013-08-27 Thread Naomi Dushay
Hi Tom,

Sorry - I was meeting with the East-Asia librarians …

Perhaps you are missing the following from your solrconfig

lib dir=/home/blacklight/solr-home/lib /

(this is the top of my solrconfig.xml:


config
  !-- NOTE: various comments and unused configuration possibilities have been 
purged
 from this file.  Please refer to http://wiki.apache.org/solr/SolrConfigXml,
 as well as the default solrconfig file included with Solr --

  
abortOnConfigurationError${solr.abortOnConfigurationError:true}/abortOnConfigurationError

  luceneMatchVersion4.4/luceneMatchVersion

  lib dir=/home/blacklight/solr-home/lib /

  dataDir/data/solr/cjk-icu/dataDir

  directoryFactory name=DirectoryFactory 
class=${solr.directoryFactory:solr.NRTCachingDirectoryFactory}/
  codecFactory class=solr.SchemaCodecFactory/
  schemaFactory class=ClassicIndexSchemaFactory/

  indexConfig
…


and here is my solr.xml, if it matters:

note the sharedLib value

?xml version=1.0 encoding=UTF-8 ?
solr persistent=true sharedLib=lib
  cores defaultCoreName=current adminPath=/admin/cores
core name=current collection=current dataDir=/data/solr/ 
loadOnStartup=true instanceDir=./ transient=false/
  /cores
/solr





On Aug 27, 2013, at 3:29 PM, Tom Burton-West wrote:

 Hello all,
 
 According to the README.txt in solr-4.4.0/solr/example/solr/collection1, all 
 we have to do is create a collection1/lib directory and put whatever jars we 
 want in there. 
 
 .. /lib.   
If it exists, Solr will load any Jars
found in this directory and use them to resolve any plugins
 specified in your solrconfig.xml or schema.xml 
 
 
   I did so  (see below).  However, I keep getting a class not found error 
 (see below).
 
 Has the default changed from what is documented in the README.txt file?
 Is there something I have to change in solrconfig.xml or solr.xml to make 
 this work?
 
 I looked at SOLR-4852, but don't understand.   It sounds like maybe there is 
 a problem if the collection1/lib directory is also specified in 
 solrconfig.xml.  But I didn't do that. (i.e. out of the box solrconfig.xml)
  Does this mean that by following what it says in the README.txt, I am making 
 some kind of a configuration error.  I also don't understand the workaround 
 in SOLR-4852.
 
 Is this an ICU issue?  A java 7 issue?  a Solr 4.4 issue,  or did I simply 
 not understand the README.txt?
 
 
 
 Tom
 
 --
 
 
 org.apache.solr.common.SolrException; null:java.lang.NoClassDefFoundError: 
 org/apache/lucene/analysis/icu/segmentation/ICUTokenizer
 
  ls collection1/lib
 icu4j-49.1.jar  
 lucene-analyzers-icu-4.4-SNAPSHOT.jar  
 solr-analysis-extras-4.4-SNAPSHOT.jar
 
 https://issues.apache.org/jira/browse/SOLR-4852
 
 Collection1/README.txt excerpt:
 
  lib/
 This directory is optional.  If it exists, Solr will load any Jars
 found in this directory and use them to resolve any plugins
 specified in your solrconfig.xml or schema.xml (ie: Analyzers,
 Request Handlers, etc...).  Alternatively you can use the lib
 syntax in conf/solrconfig.xml to direct Solr to your plugins.  See 
 the example conf/solrconfig.xml file for details.
 



Re: [solrmarc-tech] apostrophe / ayn / alif

2012-05-24 Thread Naomi Dushay
The alif and ayn can also be used as diacritic-like characters in Korean;  this 
is a known practice.   But thanks anyway.

On May 24, 2012, at 9:30 AM, Charles Riley wrote:

 Hi Naomi,
 
 I don't have a conclusive answer for you on this yet, but let me pick up on a 
 few points.
 
 First, the apostrophe is probably being handled through ignoring punctuation 
 in the ICUCollationKeyFilterFactory.  
 
 Alif isn't a diacritic but a letter, and its character properties would be 
 handled as such, apparently also outside the scope of what the folding filter 
 factory does unless it's tailored.
 
 From the solrwiki, this looks like a helpful rule of thumb:
 
 When To use a CharFilter vs a TokenFilter
 There are several pairs of CharFilters and TokenFilters that have related 
 (ie: MappingCharFilter and ASCIIFoldingFilter) or nearly identical 
 functionality (ie: PatternReplaceCharFilterFactory and 
 PatternReplaceFilterFactory) and it may not always be obvious which is the 
 best choice.
 
 The ultimate decision depends largely on what Tokenizer you are using, and 
 whether you need to out smart it by preprocessing the stream of characters.
 
 For example, maybe you have a tokenizer such as StandardTokenizer and you are 
 pretty happy with how it works overall, but you want to customize how some 
 specific characters behave.
 
 In such a situation you could modify the rules and re-build your own 
 tokenizer with javacc, but perhaps its easier to simply map some of the 
 characters before tokenization with a CharFilter.
 
 
 Charles
 
 On Tue, May 15, 2012 at 2:47 PM, Naomi Dushay ndus...@stanford.edu wrote:
 We are using the ICUFoldingFilterFactory with great success to fold 
 diacritics so searches with and without the diacritics get the same results.
 
 We recently discovered we have some Korean records that use an alif diacritic 
 instead of an apostrophe, and this diacritic is NOT getting folded.   Has 
 anyone experienced this for alif or ayn characters?   Do you have a solution?
 
 
 - Naomi
 
 --
 You received this message because you are subscribed to the Google Groups 
 solrmarc-tech group.
 To post to this group, send email to solrmarc-t...@googlegroups.com.
 To unsubscribe from this group, send email to 
 solrmarc-tech+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/solrmarc-tech?hl=en.
 
 
 
 
 -- 
 Charles L. Riley
 Catalog Librarian for Africana
 Sterling Memorial Library, Yale University
 zenodo...@gmail.com
 203-432-7566
 
 
 -- 
 You received this message because you are subscribed to the Google Groups 
 solrmarc-tech group.
 To post to this group, send email to solrmarc-t...@googlegroups.com.
 To unsubscribe from this group, send email to 
 solrmarc-tech+unsubscr...@googlegroups.com.
 For more options, visit this group at 
 http://groups.google.com/group/solrmarc-tech?hl=en.



apostrophe / ayn / alif

2012-05-15 Thread Naomi Dushay
We are using the ICUFoldingFilterFactory with great success to fold diacritics 
so searches with and without the diacritics get the same results.

We recently discovered we have some Korean records that use an alif diacritic 
instead of an apostrophe, and this diacritic is NOT getting folded.   Has 
anyone experienced this for alif or ayn characters?   Do you have a solution?


- Naomi

autoGeneratePhraseQueries sort of silently set to false

2012-02-23 Thread Naomi Dushay
Another thing I noticed when upgrading from Solr 1.4 to Solr 3.5 had to do with 
results when there were hyphenated words:   aaa-bbb.   Erik Hatcher pointed me 
to the autoGeneratePhraseQueries attribute now available on fieldtype 
definitions in schema.xml.  This is a great feature, and everything is peachy 
if you start with Solr 3.4.   But many of us started earlier and are upgrading, 
and that's a different story.

It was surprising to me that

a.  the default for this new feature caused different search results than Solr 
1.4 

b.  it wasn't documented clearly, IMO

http://wiki.apache.org/solr/SchemaXml   makes no mention of it


In the schema.xml example, there is this at the top:

!-- attribute name is the name of this schema and is only used for display 
purposes.
   Applications should change this to reflect the nature of the search 
collection.
   version=1.4 is Solr's version number for the schema syntax and 
semantics.  It should
   not normally be changed by applications.
   1.0: multiValued attribute did not exist, all fields are multiValued by 
nature
   1.1: multiValued attribute introduced, false by default 
   1.2: omitTermFreqAndPositions attribute introduced, true by default 
except for text fields.
   1.3: removed optional field compress feature
   1.4: default auto-phrase (QueryParser feature) to off
 --

And there was this in a couple of field definitions:

fieldType name=text_en_splitting class=solr.TextField 
positionIncrementGap=100 autoGeneratePhraseQueries=true
fieldType name=text_ja class=solr.TextField positionIncrementGap=100 
autoGeneratePhraseQueries=false

But that was it.



Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Naomi Dushay
Robert,

You found it!   it is the phrase slop.  What do I do now?   I am using Solr 
from trunk from December, and all those JIRA tixes are marked fixed …

- Naomi


Solr 1.4:

luceneQueryParser:

URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3
final query:  all_search:the beatl as musician revolv through the antholog~3

got result


Solr 3.5

luceneQueryParser:

URL: q=all_search:The Beatles as musicians : Revolver through the Anthology~3
final query:  all_search:the beatl as musician revolv through the antholog~3

NO result



 lucene QueryParser:
 
 URL:  q=all_search:The Beatles as musicians : Revolver through the Anthology
 final query:  all_search:the beatl as musician revolv through the antholog




On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote:

 On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: 
  Jonathan has brought it to my attention that BOTH of my failing searches 
  happen to have 8 terms, and one of the terms is repeated: 
  
   The Beatles as musicians : Revolver through the Anthology 
   Color-blindness [print/digital]; its dangers and its detection 
  
  but this is a PHRASE search. 
  
 
 Can you take your same phrase queries, and simply add some slop to 
 them (e.g. ~3) and ensure they still match with the lucene 
 queryparser? SloppyPhraseQuery has a bit of a history with repeats 
 since Lucene 2.9 that you were using. 
 
 https://issues.apache.org/jira/browse/LUCENE-3068
 https://issues.apache.org/jira/browse/LUCENE-3215
 https://issues.apache.org/jira/browse/LUCENE-3412
 
 -- 
 lucidimagination.com 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML



--
View this message in context: 
http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
Sent from the Solr - User mailing list archive at Nabble.com.

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Naomi Dushay
Robert,

I will create a jira issue with the documentation.  FYI, I tried ps values of 
3, 2, 1 and 0 and none of them worked with dismax;   For lucene QueryParser, 
only the value of 0 got results.

- Naomi


On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote:

 Is it possible to also provide your document? 
 If you could attach the document and the analysis config and queries 
 to a JIRA issue, that would be most ideal. 
 
 On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote:
 
  Robert, 
  
  You found it!   it is the phrase slop.  What do I do now?   I am using Solr 
  from trunk from December, and all those JIRA tixes are marked fixed … 
  
  - Naomi 
  
  
  Solr 1.4: 
  
  luceneQueryParser: 
  
  URL: q=all_search:The Beatles as musicians : Revolver through the 
  Anthology~3 
  final query:  all_search:the beatl as musician revolv through the 
  antholog~3 
  
  got result 
  
  
  Solr 3.5 
  
  luceneQueryParser: 
  
  URL: q=all_search:The Beatles as musicians : Revolver through the 
  Anthology~3 
  final query:  all_search:the beatl as musician revolv through the 
  antholog~3 
  
  NO result 
  
  
  
  lucene QueryParser: 
  
  URL:  q=all_search:The Beatles as musicians : Revolver through the 
  Anthology 
  final query:  all_search:the beatl as musician revolv through the 
  antholog 
  
  
  
  
  On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: 
  
  On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: 
   Jonathan has brought it to my attention that BOTH of my failing searches 
   happen to have 8 terms, and one of the terms is repeated: 
   
The Beatles as musicians : Revolver through the Anthology 
Color-blindness [print/digital]; its dangers and its detection 
   
   but this is a PHRASE search. 
   
  
  Can you take your same phrase queries, and simply add some slop to 
  them (e.g. ~3) and ensure they still match with the lucene 
  queryparser? SloppyPhraseQuery has a bit of a history with repeats 
  since Lucene 2.9 that you were using. 
  
  https://issues.apache.org/jira/browse/LUCENE-3068
  https://issues.apache.org/jira/browse/LUCENE-3215
  https://issues.apache.org/jira/browse/LUCENE-3412
  
  -- 
  lucidimagination.com 
  
  
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
  To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
  dismax only, click here. 
  NAML 
  
  
  
  -- 
  View this message in context: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
  Sent from the Solr - User mailing list archive at Nabble.com.
 
 
 
 -- 
 lucidimagination.com 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML



Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Naomi Dushay
Robert -

Did you mean for me to attach my docs to an existing ticket (which one?) or 
just want to make sure I attach the docs to the new issue?

- Naomi

On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote:

 Please attach your docs if you dont mind. 
 
 I worked up tests for this (in general for ANY phrase query, 
 increasing the slop should never remove results, only potentially 
 enlarge them). 
 
 It fails already... but its good to also have your test case too... 
 
 On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote:
 
  Robert, 
  
  I will create a jira issue with the documentation.  FYI, I tried ps values 
  of 3, 2, 1 and 0 and none of them worked with dismax;   For lucene 
  QueryParser, only the value of 0 got results. 
  
  - Naomi 
  
  
  On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: 
  
  Is it possible to also provide your document? 
  If you could attach the document and the analysis config and queries 
  to a JIRA issue, that would be most ideal. 
  
  On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: 
  
   Robert, 
   
   You found it!   it is the phrase slop.  What do I do now?   I am using 
   Solr from trunk from December, and all those JIRA tixes are marked fixed 
   … 
   
   - Naomi 
   
   
   Solr 1.4: 
   
   luceneQueryParser: 
   
   URL: q=all_search:The Beatles as musicians : Revolver through the 
   Anthology~3 
   final query:  all_search:the beatl as musician revolv through the 
   antholog~3 
   
   got result 
   
   
   Solr 3.5 
   
   luceneQueryParser: 
   
   URL: q=all_search:The Beatles as musicians : Revolver through the 
   Anthology~3 
   final query:  all_search:the beatl as musician revolv through the 
   antholog~3 
   
   NO result 
   
   
   
   lucene QueryParser: 
   
   URL:  q=all_search:The Beatles as musicians : Revolver through the 
   Anthology 
   final query:  all_search:the beatl as musician revolv through the 
   antholog 
   
   
   
   
   On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: 
   
   On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] wrote: 
Jonathan has brought it to my attention that BOTH of my failing 
searches happen to have 8 terms, and one of the terms is repeated: 

 The Beatles as musicians : Revolver through the Anthology 
 Color-blindness [print/digital]; its dangers and its detection 

but this is a PHRASE search. 

   
   Can you take your same phrase queries, and simply add some slop to 
   them (e.g. ~3) and ensure they still match with the lucene 
   queryparser? SloppyPhraseQuery has a bit of a history with repeats 
   since Lucene 2.9 that you were using. 
   
   https://issues.apache.org/jira/browse/LUCENE-3068
   https://issues.apache.org/jira/browse/LUCENE-3215
   https://issues.apache.org/jira/browse/LUCENE-3412
   
   -- 
   lucidimagination.com 
   
   
   If you reply to this email, your message will be added to the 
   discussion below: 
   http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
   To unsubscribe from result present in Solr 1.4, but missing in Solr 
   3.5, dismax only, click here. 
   NAML 
   
   
   
   -- 
   View this message in context: 
   http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
   Sent from the Solr - User mailing list archive at Nabble.com. 
  
  
  
  -- 
  lucidimagination.com 
  
  
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html
  To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
  dismax only, click here. 
  NAML 
 
 
 
 
 -- 
 lucidimagination.com 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html
 To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
 dismax only, click here.
 NAML



Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-23 Thread Naomi Dushay
Ticket created:

https://issues.apache.org/jira/browse/SOLR-3158

(perhaps it's a lucene problem, not a Solr one -- feel free to move it or 
whatever.)

- Naomi


On Feb 23, 2012, at 11:55 AM, Robert Muir [via Lucene] wrote:

 Please make a new one if you dont mind! 
 
 On Thu, Feb 23, 2012 at 2:45 PM, Naomi Dushay [hidden email] wrote:
 
  Robert - 
  
  Did you mean for me to attach my docs to an existing ticket (which one?) or 
  just want to make sure I attach the docs to the new issue? 
  
  - Naomi 
  
  On Feb 23, 2012, at 11:39 AM, Robert Muir [via Lucene] wrote: 
  
  Please attach your docs if you dont mind. 
  
  I worked up tests for this (in general for ANY phrase query, 
  increasing the slop should never remove results, only potentially 
  enlarge them). 
  
  It fails already... but its good to also have your test case too... 
  
  On Thu, Feb 23, 2012 at 2:20 PM, Naomi Dushay [hidden email] wrote: 
  
   Robert, 
   
   I will create a jira issue with the documentation.  FYI, I tried ps 
   values of 3, 2, 1 and 0 and none of them worked with dismax;   For 
   lucene QueryParser, only the value of 0 got results. 
   
   - Naomi 
   
   
   On Feb 23, 2012, at 11:12 AM, Robert Muir [via Lucene] wrote: 
   
   Is it possible to also provide your document? 
   If you could attach the document and the analysis config and queries 
   to a JIRA issue, that would be most ideal. 
   
   On Thu, Feb 23, 2012 at 2:05 PM, Naomi Dushay [hidden email] wrote: 
   
Robert, 

You found it!   it is the phrase slop.  What do I do now?   I am 
using Solr from trunk from December, and all those JIRA tixes are 
marked fixed … 

- Naomi 


Solr 1.4: 

luceneQueryParser: 

URL: q=all_search:The Beatles as musicians : Revolver through the 
Anthology~3 
final query:  all_search:the beatl as musician revolv through the 
antholog~3 

got result 


Solr 3.5 

luceneQueryParser: 

URL: q=all_search:The Beatles as musicians : Revolver through the 
Anthology~3 
final query:  all_search:the beatl as musician revolv through the 
antholog~3 

NO result 



lucene QueryParser: 

URL:  q=all_search:The Beatles as musicians : Revolver through the 
Anthology 
final query:  all_search:the beatl as musician revolv through the 
antholog 




On Feb 22, 2012, at 7:34 PM, Robert Muir [via Lucene] wrote: 

On Wed, Feb 22, 2012 at 7:35 PM, Naomi Dushay [hidden email] 
wrote: 
 Jonathan has brought it to my attention that BOTH of my failing 
 searches happen to have 8 terms, and one of the terms is repeated: 
 
  The Beatles as musicians : Revolver through the Anthology 
  Color-blindness [print/digital]; its dangers and its detection 
 
 but this is a PHRASE search. 
 

Can you take your same phrase queries, and simply add some slop to 
them (e.g. ~3) and ensure they still match with the lucene 
queryparser? SloppyPhraseQuery has a bit of a history with repeats 
since Lucene 2.9 that you were using. 

https://issues.apache.org/jira/browse/LUCENE-3068
https://issues.apache.org/jira/browse/LUCENE-3215
https://issues.apache.org/jira/browse/LUCENE-3412

-- 
lucidimagination.com 


If you reply to this email, your message will be added to the 
discussion below: 
http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768619.html
To unsubscribe from result present in Solr 1.4, but missing in Solr 
3.5, dismax only, click here. 
NAML 



-- 
View this message in context: 
http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770665.html
Sent from the Solr - User mailing list archive at Nabble.com. 
   
   
   
   -- 
   lucidimagination.com 
   
   
   If you reply to this email, your message will be added to the 
   discussion below: 
   http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770681.html
   To unsubscribe from result present in Solr 1.4, but missing in Solr 
   3.5, dismax only, click here. 
   NAML 
   
  
  
  
  -- 
  lucidimagination.com 
  
  
  If you reply to this email, your message will be added to the discussion 
  below: 
  http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770746.html
  To unsubscribe from result present in Solr 1.4, but missing in Solr 3.5, 
  dismax only, click here. 
  NAML 
 
 
 
 
 -- 
 lucidimagination.com 
 
 
 If you reply to this email, your message will be added to the discussion 
 below:
 http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3770786.html
 To unsubscribe from result present

result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-22 Thread Naomi Dushay
I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   I 
have a test checking for a search result in Solr, and the test passes in Solr 
1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I just 
included output from lucene QueryParser to prove the document exists and is 
found 

I am completely stumped.


Here are the debugQuery details:

***Solr 3.5***

lucene QueryParser: 

URL:   q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
  1.0 = queryWeight(all_search:the beatl as musician revolv through the 
antholog), product of:
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
0.02063975 = queryNorm
  6.0562754 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 1064395), product of:
1.0 = tf(phraseFreq=1.0)
48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 musician=11805 
revolv=872 through=81366 the=3531140 antholog=11611)
0.125 = fieldNorm(field=all_search, doc=1064395)

dismax QueryParser:   
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver 
through the Anthology
final query:   +(all_search:the beatl as musician revolv through the 
antholog~1)~0.01 (all_search:the beatl as musician revolv through the 
antholog~3)~0.01

(no matches)


***Solr 1.4***

lucene QueryParser:   

URL:  q=all_search:The Beatles as musicians : Revolver through the Anthology
final query:  all_search:the beatl as musician revolv through the antholog

5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
antholog in 3469163), product of:
  1.0 = tf(phraseFreq=1.0)
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.109375 = fieldNorm(field=all_search, doc=3469163)

dismax QueryParser:   
URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver 
through the Anthology
final query:  +(all_search:the beatl as musician revolv through the 
antholog~1)~0.01 (all_search:the beatl as musician revolv through the 
antholog~3)~0.01

score:

7.449651 = (MATCH) sum of:
  3.7248254 = weight(all_search:the beatl as musician revolv through the 
antholog~1 in 3469163), product of:
0.7071068 = queryWeight(all_search:the beatl as musician revolv through 
the antholog~1), product of:
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.014681898 = queryNorm
5.2676983 = fieldWeight(all_search:the beatl as musician revolv through 
the antholog in 3469163), product of:
  1.0 = tf(phraseFreq=1.0)
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.109375 = fieldNorm(field=all_search, doc=3469163)
  3.7248254 = weight(all_search:the beatl as musician revolv through the 
antholog~3 in 3469163), product of:
0.7071068 = queryWeight(all_search:the beatl as musician revolv through 
the antholog~3), product of:
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.014681898 = queryNorm
5.2676983 = fieldWeight(all_search:the beatl as musician revolv through 
the antholog in 3469163), product of:
  1.0 = tf(phraseFreq=1.0)
  48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
revolv=820 through=88238 the=3542123 antholog=11205)
  0.109375 = fieldNorm(field=all_search, doc=3469163)





Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-22 Thread Naomi Dushay
I forgot to include the field definition information:

schema.xml:
  field name=all_search type=text indexed=true stored=false /

solr 3.5:
  fieldtype name=text class=solr.TextField
positionIncrementGap=100 autoGeneratePhraseQueries=true
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=solr.ICUFoldingFilterFactory/  
filter class=solr.WordDelimiterFilterFactory
  splitOnCaseChange=1 generateWordParts=1 catenateWords=1
  splitOnNumerics=0 generateNumberParts=1 catenateNumbers=1
  catenateAll=0 preserveOriginal=0 stemEnglishPossessive=1 /
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldtype

solr1.4:
fieldtype name=text class=solr.TextField positionIncrementGap=100
  analyzer
tokenizer class=solr.WhitespaceTokenizerFactory /
filter class=schema.UnicodeNormalizationFilterFactory
version=icu4j composed=false remove_diacritics=true
remove_modifiers=true fold=true /
filter class=solr.WordDelimiterFilterFactory 
  splitOnCaseChange=1 generateWordParts=1 catenateWords=1 
  splitOnNumerics=0 generateNumberParts=1 catenateNumbers=1 
  catenateAll=0 preserveOriginal=0 stemEnglishPossessive=1 /
filter class=solr.LowerCaseFilterFactory /
filter class=solr.EnglishPorterFilterFactory
protected=protwords.txt /
filter class=solr.RemoveDuplicatesTokenFilterFactory /
  /analyzer
/fieldtype


And the analysis page shows the same results for Solr 3.5 and 1.4


Solr 3.5:

position1   2   3   4   5   6   7   8
term text   the beatl   as  musicianrevolv  through the 
antholog
keyword false   false   false   false   false   false   false   false
startOffset 0   4   12  15  27  36  44  48
endOffset   3   11  14  24  35  43  47  57
typewordwordwordwordwordwordwordword

Solr 1.4:

term position   1   2   3   4   5   6   7   8
term text   the beatl   as  musicianrevolv  through the 
antholog
term type   wordwordwordwordwordwordwordword
source start,end0,3 4,1112,14   15,24   27,35   36,43   44,47   
48,57

- Naomi

--
View this message in context: 
http://lucene.472066.n3.nabble.com/result-present-in-Solr-1-4-but-missing-in-Solr-3-5-dismax-only-tp3767851p3768007.html
Sent from the Solr - User mailing list archive at Nabble.com.


Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-22 Thread Naomi Dushay
Jonathan,

I have the same problem without the colon - I tested that, but didn't mention 
it.   

mm can't be the issue either:   in Solr 3.5, if I remove one of the occurrences 
of the  (doesn't matter which), I get results.  Removing any other word does 
NOT get results.   And if the query isn't a phrase query, it gets results.

And no, it can't be related to what you refer to as the  dismax stopwords 
problem, since i can demonstrate the problem with a single field.  mm can't be 
the issue 


I have run into problems in the past with a non-alpha character surrounded by 
spaces tanking my search results for dismax … but I fixed that with this 
fieldType:

!-- single token with punctuation terms removed so dismax doesn't look for 
punctuation terms in these fields --
!-- On client side, Lucene query parser breaks things up by whitespace 
*before* field analysis for dismax --
!-- so punctuation terms ( : ;) are stopwords to allow results from other 
fields when these chars are surrounded by spaces in query --
!--  do not lowercase --
fieldType name=string_punct_stop class=solr.TextField omitNorms=true
  analyzer type=index
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ICUNormalizer2FilterFactory name=nfkc 
mode=compose /
  /analyzer
  analyzer type=query
tokenizer class=solr.KeywordTokenizerFactory /
filter class=solr.ICUNormalizer2FilterFactory name=nfkc 
mode=compose /
!-- removing punctuation for Lucene query parser issues --
filter class=solr.StopFilterFactory ignoreCase=true 
words=stopwords_punctuation.txt enablePositionIncrements=true /
  /analyzer
/fieldType

My stopwords_punctuation.txt file is

#Punctuation characters we want to ignore in queries
:
;

/

and used this type instead of string for fields in my dismax qf.Thus, the 
punctuation terms in the query are not present for the fields that were 
formerly string fields.

- Naomi

On Feb 22, 2012, at 3:41 PM, Jonathan Rochkind wrote:

 So I don't really know what I'm talking about, and I'm not really sure if 
 it's related or not, but your particular query:
 
 The Beatles as musicians : Revolver through the Anthology
 
 With the lone word that's a ':', reminds me of a dismax stopwords-type 
 problem I ran into. Now, I ran into it on 1.4.  I don't know why it would be 
 different on 1.4 and 3.x. And I see you aren't even using a multi-field 
 dismax in your sample query, so it couldn't possibly be what I ran into... I 
 don't think. But I'll write this anyway in case it gives someone some ideas.
 
 The problem I ran into is caused by different analysis in two fields both 
 used in a dismax, one that ends up keeping : as a token, and one that 
 doesn't.  Which ends up having the same effect as the famous 'dismax 
 stopwords problem'.
 
 Maybe somehow your schema changed such to produce this problem in 3.x but not 
 in 1.4? Although again I realize the fact that you are only using a single 
 field in your demo dismax query kind of suggests it's not this problem. 
 Wonder if you try the query without the :, if the problem goes away, that 
 might be a hint. Or, maybe someone more skilled at understanding what's in 
 those Solr debug statements than I am (it's kind of all greek to me) will be 
 able to take this hint and rule out or confirm that it may have something to 
 do with your problem.
 
 Here I write up the issue I ran into (which may or may not have anything to 
 do with what you ran into)
 
 http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/
 
 
 Also, you don't say what your 'mm' is in your dismax queries, that could be 
 relevant if it's got anything to do with anything similar to the issue I'm 
 talking about.
 
 Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens 
 for 'mm' in such a way that the 'varying field analysis dismax gotcha' can 
 manifest with only one field, if the way dismax counts tokens for 'mm' 
 differs from number of tokens the single field's analysis produces?
 
 Jonathan
 
 On 2/22/2012 2:55 PM, Naomi Dushay wrote:
 I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   
 I have a test checking for a search result in Solr, and the test passes in 
 Solr 1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I 
 just included output from lucene QueryParser to prove the document exists 
 and is found
 
 I am completely stumped.
 
 
 Here are the debugQuery details:
 
 ***Solr 3.5***
 
 lucene QueryParser:
 
 URL:   q=all_search:The Beatles as musicians : Revolver through the 
 Anthology
 final query:  all_search:the beatl as musician revolv through the antholog
 
 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through 
 the antholog in 1064395), product of:
   1.0 = queryWeight(all_search:the beatl as musician revolv through the 
 antholog), product of:
 48.450203 = idf(all_search

Re: result present in Solr 1.4, but missing in Solr 3.5, dismax only

2012-02-22 Thread Naomi Dushay
Jonathan has brought it to my attention that BOTH of my failing searches happen 
to have 8 terms, and one of the terms is repeated:

 The Beatles as musicians : Revolver through the Anthology
 Color-blindness [print/digital]; its dangers and its detection

but this is a PHRASE search.  

In case it's relevant, both Solr 1.4 and Solr 3.5:
 do NOT use stopwords in the fieldtype;  
 mm is  6-1 690%  for dismax
 qs is 1
 ps is 3

And both use this filter last

filter class=solr.RemoveDuplicatesTokenFilterFactory /

… but I believe that filter is only used for consecutive tokens.

Lastly, 

 Color-blindness [print/digital]; its and its detection   works   (danger 
is removed, rather than one of the repeated its)

- Naomi



On Feb 22, 2012, at 3:41 PM, Jonathan Rochkind wrote:

 So I don't really know what I'm talking about, and I'm not really sure if 
 it's related or not, but your particular query:
 
 The Beatles as musicians : Revolver through the Anthology
 
 With the lone word that's a ':', reminds me of a dismax stopwords-type 
 problem I ran into. Now, I ran into it on 1.4.  I don't know why it would be 
 different on 1.4 and 3.x. And I see you aren't even using a multi-field 
 dismax in your sample query, so it couldn't possibly be what I ran into... I 
 don't think. But I'll write this anyway in case it gives someone some ideas.
 
 The problem I ran into is caused by different analysis in two fields both 
 used in a dismax, one that ends up keeping : as a token, and one that 
 doesn't.  Which ends up having the same effect as the famous 'dismax 
 stopwords problem'.
 
 Maybe somehow your schema changed such to produce this problem in 3.x but not 
 in 1.4? Although again I realize the fact that you are only using a single 
 field in your demo dismax query kind of suggests it's not this problem. 
 Wonder if you try the query without the :, if the problem goes away, that 
 might be a hint. Or, maybe someone more skilled at understanding what's in 
 those Solr debug statements than I am (it's kind of all greek to me) will be 
 able to take this hint and rule out or confirm that it may have something to 
 do with your problem.
 
 Here I write up the issue I ran into (which may or may not have anything to 
 do with what you ran into)
 
 http://bibwild.wordpress.com/2011/06/15/more-dismax-gotchas-varying-field-analysis-and-mm/
 
 
 Also, you don't say what your 'mm' is in your dismax queries, that could be 
 relevant if it's got anything to do with anything similar to the issue I'm 
 talking about.
 
 Hmm, I wonder if Solr 3.x changes the way dismax calculates number of tokens 
 for 'mm' in such a way that the 'varying field analysis dismax gotcha' can 
 manifest with only one field, if the way dismax counts tokens for 'mm' 
 differs from number of tokens the single field's analysis produces?
 
 Jonathan
 
 On 2/22/2012 2:55 PM, Naomi Dushay wrote:
 I am working on upgrading Solr from 1.4 to 3.5, and I have hit a problem.   
 I have a test checking for a search result in Solr, and the test passes in 
 Solr 1.4, but fails in Solr 3.5.   Dismax is the desired QueryParser -- I 
 just included output from lucene QueryParser to prove the document exists 
 and is found
 
 I am completely stumped.
 
 
 Here are the debugQuery details:
 
 ***Solr 3.5***
 
 lucene QueryParser:
 
 URL:   q=all_search:The Beatles as musicians : Revolver through the 
 Anthology
 final query:  all_search:the beatl as musician revolv through the antholog
 
 6.0562754 = (MATCH) weight(all_search:the beatl as musician revolv through 
 the antholog in 1064395), product of:
   1.0 = queryWeight(all_search:the beatl as musician revolv through the 
 antholog), product of:
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 
 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
 0.02063975 = queryNorm
   6.0562754 = fieldWeight(all_search:the beatl as musician revolv through 
 the antholog in 1064395), product of:
 1.0 = tf(phraseFreq=1.0)
 48.450203 = idf(all_search: the=3531140 beatl=398 as=645923 
 musician=11805 revolv=872 through=81366 the=3531140 antholog=11611)
 0.125 = fieldNorm(field=all_search, doc=1064395)
 
 dismax QueryParser:
 URL:  qf=all_searchpf=all_searchq=The Beatles as musicians : Revolver 
 through the Anthology
 final query:   +(all_search:the beatl as musician revolv through the 
 antholog~1)~0.01 (all_search:the beatl as musician revolv through the 
 antholog~3)~0.01
 
 (no matches)
 
 
 ***Solr 1.4***
 
 lucene QueryParser:
 
 URL:  q=all_search:The Beatles as musicians : Revolver through the 
 Anthology
 final query:  all_search:the beatl as musician revolv through the antholog
 
 5.2676983 = fieldWeight(all_search:the beatl as musician revolv through the 
 antholog in 3469163), product of:
   1.0 = tf(phraseFreq=1.0)
   48.16181 = idf(all_search: the=3542123 beatl=391 as=749890 musician=11955 
 revolv=820 through=88238 the=3542123 antholog=11205)
   0.109375 = fieldNorm(field

Re: hierarchical faceting in Solr?

2011-08-23 Thread Naomi Dushay

Chris Beer just did a revamp of the wiki page at:

  http://wiki.apache.org/solr/HierarchicalFaceting

Yay Chris!

- Naomi
( ... and I helped!)


On Aug 22, 2011, at 10:49 AM, Naomi Dushay wrote:


Chris,

Is there a document somewhere on how to do this?  If not, might you  
create one?   I could even imagine such a document living on the  
Solr wiki ...  this one has mostly ancient content:


http://wiki.apache.org/solr/HierarchicalFaceting

- Naomi




hierarchical faceting in Solr?

2011-08-22 Thread Naomi Dushay

Chris,

Is there a document somewhere on how to do this?  If not, might you  
create one?   I could even imagine such a document living on the Solr  
wiki ...  this one has mostly ancient content:


http://wiki.apache.org/solr/HierarchicalFaceting

- Naomi


Re: defType argument weirdness

2011-07-19 Thread Naomi Dushay
qf_dismax and pf_dismax   are irrelevant -- I shouldn't have included  
that info.  They are passed in the url and they work;   they do not  
affect this problem.


Your reminder of debugQuery  was a good one - I use that a lot but  
forgot in this case.


Regardless, I thought that defType=dismaxq=*:*   is supposed to  
be equivalent to  q={!defType=dismax}*:*  and also equivalent to q={! 
dismax}*:*



defType=dismaxq=*:*   DOESN'T WORK
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedquery+() ()/str
str name=parsedquery_toString+() ()/str

leaving out the explicit query
defType=dismax WORKS
null name=rawquerystring/
null name=querystring/
str name=parsedquery+MatchAllDocsQuery(*:*)/str
str name=parsedquery_toString+*:*/str


q={!dismax}*:*   DOESN'T WORK
str name=rawquerystring*:*/str
str name=querystring*:*/str
str name=parsedquery+() ()/str
str name=parsedquery_toString+() ()/str

leaving out the explicit query:
q={!dismax}WORKS
str name=rawquerystring{!dismax}/str
str name=querystring{!dismax}/str
str name=parsedquery+MatchAllDocsQuery(*:*)/str
str name=parsedquery_toString+*:*/str


q={!defType=dismax}*:*WORKS
str name=rawquerystring{!defType=dismax}*:*/str
str name=querystring{!defType=dismax}*:*/str
str name=parsedqueryMatchAllDocsQuery(*:*)/str
str name=parsedquery_toString*:*/str

leaving out the explicit query:
q={!defType=dismax}DOESN'T WORK
org.apache.lucene.queryParser.ParseException: Cannot parse '':  
Encountered EOF at line 1, column 0.


On Jul 18, 2011, at 5:44 PM, Erick Erickson wrote:


What are qf_dismax and pf_dismax? They are meaningless to
Solr. Try adding debugQuery=on to your URL and you'll
see the parsed query, which helps a lot here

If you change these to the proper dismax values (qf and pf)
you'll get beter results. As it is, I think you'll see output like:

str name=parsedquery+() ()/str

showing that your query isn't actually going against
any fields

Best
Erick

On Mon, Jul 18, 2011 at 7:15 PM, Naomi Dushay ndus...@stanford.edu  
wrote:
I found a weird behavior with the Solr  defType argument, perhaps  
with

respect to default queries?

 defType=dismaxq=*:*  no hits

 q={!defType=dismax}*:* hits

 defType=dismax hits


Here is the request handler, which I explicitly indicate:

requestHandler name=search class=solr.SearchHandler  
default=true

   lst name=defaults
   str name=defTypelucene/str

   !-- lucene params --
   str name=dfhas_model_s/str
   str name=q.opAND/str

   !-- dismax params --
   str name=mm 2-1 5-2 690% /str
   str name=q.alt*:*/str
   str name=qf_dismaxid^0.8 id_t^0.8 title_t^0.3  
mods_t^0.2

text/str
   str name=pf_dismaxid^0.9  id_t^0.9 title_t^0.5  
mods_t^0.2

text/str
   int name=ps100/int
   float name=tie0.01/float
/requestHandler


Solr Specification Version: 1.4.0
Solr Implementation Version: 1.4.0 833479 - grantingersoll -  
2009-11-06

12:33:40
Lucene Specification Version: 2.9.1
Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25

- Naomi





defType argument weirdness

2011-07-18 Thread Naomi Dushay
I found a weird behavior with the Solr  defType argument, perhaps with  
respect to default queries?


 defType=dismaxq=*:*  no hits

 q={!defType=dismax}*:* hits

 defType=dismax hits


Here is the request handler, which I explicitly indicate:

requestHandler name=search class=solr.SearchHandler default=true
lst name=defaults
str name=defTypelucene/str

!-- lucene params --
str name=dfhas_model_s/str
str name=q.opAND/str

!-- dismax params --
str name=mm 2-1 5-2 690% /str
str name=q.alt*:*/str
		str name=qf_dismaxid^0.8 id_t^0.8 title_t^0.3 mods_t^0.2 text/ 
str
		str name=pf_dismaxid^0.9  id_t^0.9 title_t^0.5 mods_t^0.2 text/ 
str

int name=ps100/int
float name=tie0.01/float
/requestHandler


Solr Specification Version: 1.4.0
Solr Implementation Version: 1.4.0 833479 - grantingersoll -  
2009-11-06 12:33:40

Lucene Specification Version: 2.9.1
Lucene Implementation Version: 2.9.1 832363 - 2009-11-03 04:37:25

- Naomi


a Solr search recall problem you probably don't even know you're having

2010-11-05 Thread Naomi Dushay
(sorry for cross postings - I think this is important information to  
disseminate)


Executive Summary:  you probably need to increase your query slop.  A  
lot.



We recently had a feedback ticket that a title search with a hyphen  
wasn't working properly.  This is especially curious because we solved  
a bunch of problems with hyphen searching AND WROTE TESTS in the  
process, and all the existing hyphen tests pass.  Tests like hyphens  
with no spaces before or after, 3 significant terms, 2 stopwords pass.


Our metadata contains:
record A with title:   Red-rose chain.
record B with title:   Prisoner in a red-rose chain.

A title search:  prisoner in a red-rose chain  returns no results

Further exploration (the following are all title searches):
red-rose chain  ==  record A only
red rose chain ==  record A only
red rose chain == record A only
red-rose chain == record A only
red rose chain ==  records A and B
red rose chain ==  records A and B  (!!)

For more details and more about the solution, see  
http://discovery-grindstone.blogspot.com/2010/11/solr-and-hyphenated-words.html

- Naomi Dushay
Senior Developer
Stanford University Libraries
 

Re: a Solr search recall problem you probably don't even know you're having

2010-11-05 Thread Naomi Dushay

Robert,

Thanks!   I've been using Solr 1.5 from trunk back in March - time to  
upgrade!  I also like the put the stopword filter after the WDF  
filter fix.


- Naomi

On Nov 5, 2010, at 12:36 PM, Robert Muir wrote:

On Fri, Nov 5, 2010 at 3:04 PM, Naomi Dushay ndus...@stanford.edu  
wrote:

(sorry for cross postings - I think this is important information to
disseminate)

Executive Summary:  you probably need to increase your query slop.   
A lot.




I looked at your example, and it really looks a lot like
https://issues.apache.org/jira/browse/SOLR-1852

This was fixed, and released in Solr 1.4.1... and of course from the
upgrading notes:
However, a reindex is needed for some of the analysis fixes to take  
effect.


Your example Prisoner in a red-rose chain in Solr 1.4.1 no longer
has the positions 1,4,7,8, but instead 1,4,5,6.

I recommend upgrading to this bugfix release and re-indexing if you
are having problems like this




facet data cleanup

2010-06-08 Thread Naomi Dushay

Hi folks,

We have a data cleanup effort going on here, and I thought I would  
share some information about how to poke around your facet values.   
Most of this comes from:

http://wiki.apache.org/solr/SimpleFacetParameters


Exploring Facet Values:
---

facet field to examine: facet.field=
number of values to return: facet.limit=n
offset into the values: facet.offset=n
sort the facets alphabetically: facet.sort=index

http://your.solr.baseurl/select?rows=0facet.field=ffldnamefacet.sort=indexfacet.limit=250facet.offset=0


Missing Facet Values:
---

to find how many documents are missing values:		 
facet.missing=truefacet.mincount=really big


http://your.solr.baseurl/select?rows=0facet.field=ffldnamefacet.mincount=1000facet.missing=true

to find the documents with missing values:
		http://your.solr.baseurl/select?qt=standardq=+uniquekey:[* TO *] - 
ffldname:[* TO *]


number of rows: rows=
offset: start=



- Naomi Dushay
Stanford University Libraries
http://searchworks.stanford.edu   --  Blacklight on top of Solr


indexversion not updating on master

2010-04-13 Thread Naomi Dushay
I'm having trouble with replication, and i believe it's because the  
indexversion isn't updating on master.


My solrconfig.xml on master:

 requestHandler name=/replication class=solr.ReplicationHandler 
lst name=master
  str name=replicateAfterstartup/str
  str name=replicateAftercommit/str
  str name=replicateAfteroptimize/str
  !-- str name=backupAfteroptimize/str --
  str name=confFilessolrconfig- 
slave.xml:solrconfig.xml,schema.xml,stopwords.txt/str

/lst
  /requestHandler

  BTW, I am certain that this does NOT work:
  str name=replicateAfterstartup,commit,optimize/str
  it MUST be separate elements.


My solrconfig.xml on slave:

  requestHandler name=/replication class=solr.ReplicationHandler 
lst name=slave
  str name=masterUrlhttp://my_host:8983/solr/replication/str
  !--Format is HH:mm:ss --
  str name=pollInterval00:15:00/str
/lst
  /requestHandler


/replication?command=details   on master:

(I don't understand why there are two indexVersion and two generation  
entries in this data)


lst name=details
  str name=indexSize19.91 GB/str
  str name=indexPath/data/solr/index/str
  arr name=commits
lst
  long name=indexVersion1270535894533/long
  long name=generation32/long
  arr name=filelist
str_1xv.fdt/str
...
str_1xv.frq/str
strsegments_w/str
  /arr
/lst
  /arr
  str name=isMastertrue/str
  str name=isSlavefalse/str
  long name=indexVersion1270535894534/long
  long name=generation33/long
/lst


master log shows the commit:

INFO: start  
commit 
(optimize=false,waitFlush=false,waitSearcher=true,expungeDeletes=false)

Apr 12, 2010 4:00:54 PM org.apache.solr.search.SolrIndexSearcher init
INFO: Opening searc...@31dd7736 main
Apr 12, 2010 4:00:54 PM org.apache.solr.update.DirectUpdateHandler2  
commit

INFO: end_commit_flush
Apr 12, 2010 4:00:54 PM org.apache.solr.search.SolrIndexSearcher warm


but indexversion  is  the OLD one, not the NEW one:

response
  lst name=responseHeader
int name=status0/int
int name=QTime0/int
  /lst
  long name=indexversion1270535894533/long
  long name=generation32/long
/response


What's going on?
- Naomi

Re: indexversion not updating on master

2010-04-13 Thread Naomi Dushay
Does it matter that my last index update did NOT add any new documents  
and did NOT delete any existing documents?   (For testing, I just re- 
ran the last update)


- Naomi

On Apr 13, 2010, at 11:09 AM, Naomi Dushay wrote:

I'm having trouble with replication, and i believe it's because the  
indexversion isn't updating on master.


My solrconfig.xml on master:

requestHandler name=/replication class=solr.ReplicationHandler 
   lst name=master
 str name=replicateAfterstartup/str
 str name=replicateAftercommit/str
 str name=replicateAfteroptimize/str
 !-- str name=backupAfteroptimize/str --
 str name=confFilessolrconfig- 
slave.xml:solrconfig.xml,schema.xml,stopwords.txt/str

   /lst
 /requestHandler

 BTW, I am certain that this does NOT work:
 str name=replicateAfterstartup,commit,optimize/str
 it MUST be separate elements.


My solrconfig.xml on slave:

 requestHandler name=/replication class=solr.ReplicationHandler 
   lst name=slave
 str name=masterUrlhttp://my_host:8983/solr/replication/str
 !--Format is HH:mm:ss --
 str name=pollInterval00:15:00/str
   /lst
 /requestHandler


/replication?command=details   on master:

(I don't understand why there are two indexVersion and two  
generation entries in this data)


lst name=details
 str name=indexSize19.91 GB/str
 str name=indexPath/data/solr/index/str
 arr name=commits
   lst
 long name=indexVersion1270535894533/long
 long name=generation32/long
 arr name=filelist
   str_1xv.fdt/str
   ...
   str_1xv.frq/str
   strsegments_w/str
 /arr
   /lst
 /arr
 str name=isMastertrue/str
 str name=isSlavefalse/str
 long name=indexVersion1270535894534/long
 long name=generation33/long
/lst


master log shows the commit:

INFO: start  
commit 
(optimize 
=false,waitFlush=false,waitSearcher=true,expungeDeletes=false)
Apr 12, 2010 4:00:54 PM org.apache.solr.search.SolrIndexSearcher  
init

INFO: Opening searc...@31dd7736 main
Apr 12, 2010 4:00:54 PM org.apache.solr.update.DirectUpdateHandler2  
commit

INFO: end_commit_flush
Apr 12, 2010 4:00:54 PM org.apache.solr.search.SolrIndexSearcher warm


but indexversion  is  the OLD one, not the NEW one:

response
 lst name=responseHeader
   int name=status0/int
   int name=QTime0/int
 /lst
 long name=indexversion1270535894533/long
 long name=generation32/long
/response


What's going on?
- Naomi




termsComponent and filter queries

2010-01-19 Thread Naomi Dushay
I have a field that has millions of values, and I need to get the  
next X values in alpha order.  The terms component works fabulously  
for this.


Here is a cooked up example of the terms

a
b
f
q
r
rr
rrr
y
z
zzz

So if I ask for the 3 terms after r, I get rr, rrr and y.

But now I'd like to apply a filter query on a different field.  After  
the filter, my terms might be:


b
q
r
y
z
zzz

So the 3 terms after r, given the filter, become  y z and zzz

Given that I have millions of terms, and they are not predictable for  
range queries ... how can I get


the next X values of my field
after one or more filters are applied?

- Naomi


java doc error local params syntax for dismax

2009-09-23 Thread Naomi Dushay

The javadoc for  DisMaxQParserPlugin states:

{!dismax qf=myfield,mytitle^2}foo creates a dismax query

but actually, that gives an error.

The correct syntax is

{!dismax qf=myfield mytitle^2}foo

(could use single quote instead of double quote).

- Naomi

Re: java doc error local params syntax for dismax

2009-09-23 Thread Naomi Dushay
It's not just the spaces - it's that the quotes (single or double  
flavor) is required as well.



On Sep 23, 2009, at 3:10 PM, Yonik Seeley wrote:

On Wed, Sep 23, 2009 at 5:59 PM, Naomi Dushay ndus...@stanford.edu  
wrote:

The javadoc for  DisMaxQParserPlugin states:

{!dismax qf=myfield,mytitle^2}foo creates a dismax query

but actually, that gives an error.

The correct syntax is

{!dismax qf=myfield mytitle^2}foo

(could use single quote instead of double quote).


Thanks, I always forget that dismax uses space separated, not comma
separated lists.

-Yonik




Re: java doc error local params syntax for dismax

2009-09-23 Thread Naomi Dushay

Okay, but

{!dismax qf=myfield mytitle^2}foo works

{!dismax qf=myfield mytitle^2}foo does NOT work

- Naomi

On Sep 23, 2009, at 5:52 PM, Yonik Seeley wrote:

On Wed, Sep 23, 2009 at 8:24 PM, Naomi Dushay ndus...@stanford.edu  
wrote:
It's not just the spaces - it's that the quotes (single or double  
flavor) is

required as well.


LocalParams are space delimited, so the original example would have
worked if the dismax parser accepted comma delimited fields.

-Yonik
http://www.lucidimagination.com




On Sep 23, 2009, at 3:10 PM, Yonik Seeley wrote:


On Wed, Sep 23, 2009 at 5:59 PM, Naomi Dushay ndus...@stanford.edu
wrote:


The javadoc for  DisMaxQParserPlugin states:

{!dismax qf=myfield,mytitle^2}foo creates a dismax query

but actually, that gives an error.

The correct syntax is

{!dismax qf=myfield mytitle^2}foo

(could use single quote instead of double quote).


Thanks, I always forget that dismax uses space separated, not comma
separated lists.

-Yonik







Re: range queries on string field with millions of values

2008-11-29 Thread Naomi Dushay

Hi Hoss,

Thanks for this.

The terms component approach, if i understand it correctly, will be  
problematic.  I need to present not only the next X call numbers in  
sequence, but other fields in those documents (e.g. title, author).  I  
assume the Terms Component approach will only give me the next X call  
number values, not the documents.


It sounds like Glen Newton's suggestion of mapping the call numbers to  
a float number is the most likely solution.


I know it sounds ridiculous to do all this for a call number browse  
but our faculty have explicitly asked for this.  For humanities  
scholars especially, they know the call numbers that are of interest  
to them, and they browse the stacks that way (ML 1500s are opera, V35  
is verdi ...).   They are using the research methods that have been  
successful for their entire careers.  Plus, library materials are  
going to off site, high density storage, so the only way for them to  
to browse all materials, regardless of location, via call number is  
online.   I doubt they'll find this feature as useful as they expect,  
but it behooves us to give the users what they ask for.


So yeah, our user needs are perhaps a little outside of your  
expectations.  :-)


- Naomi


On Nov 29, 2008, at 2:58 PM, Chris Hostetter wrote:



: The results are correct.  But the response time sucks.
:
: Reading the docs about caches, I thought I could populate the  
query result
: cache with an autowarming query and the response time would be  
okay.  But that

: hasn't worked.  (See excerpts from my solrConfig file below.)
:
: A repeated query is very fast, implying caching happens for a  
particular

: starting point (42 above).
:
: Is there a way to populate the cache with the ENTIRE sorted list  
of values for
: the field, so any arbitrary starting point will get results from  
the cache,
: rather than grabbing all results from (x) to the end, then sorting  
all these

: results, then returning the first 10?

there's two caches that come into play for something like this...

the first cache is a low level Lucene cache called the FieldCache  
that

is completley hidden from you (and for the most part: from Solr).
anytime you sort on a field, it get's built, and reuse for all sorts  
on

that field.  My originl concern was that it wasn't getting warmed on
newSearcher (because you have to be explicit about that.

the second cache is the queryResultsCache which caches a window of  
an
ordered list of documents based on a query, and a sort.  you can see  
this
cache in your Solr stats, and yes: these two requests results in  
different

cache keys for the queryResultsCache...

   q=yourField:[42+TO+*]sort=yourField+ascrows=10
   q=yourField:[52+TO+*]sort=yourField+ascrows=10

...BUT! ... the two queries below will result in the same cache key,  
and

the second will be a cache hit, provided a sufficient value for
the queryResultWindowSize ...

   q=yourField:[42+TO+*]sort=yourField+ascrows=10
   q=yourField:[42+TO+*]sort=yourField+ascrows=10start=10

so perhaps the key to your problem is to just make sure that once  
the user
gives you an id to start with, you scroll by increasing the start  
param
(not altering the id) ... the first query might be slow but every  
query
after that should be a cache hit (depending on your page size, and  
how far

you expect people to scroll, you should consider increasing
queryResultWindowSize)

But as Yonik said: the new TermsComponent may actually be a better  
option
for you -- doing two requests for every page (the first to get the N  
Terms
in your id field starting with your input, the second to do an query  
for

docs matching any of those N ids) might actually be faster even though
there won't likely even be any cache hits.


My opinion:  Your use case sounds like a waste of effort.  I can't  
imagine
anyone using a library catalog system ever wanting to lookup a  
callnumber,
and then scroll through all posisble books with similar call numbers  
-- it
seems much more likely that i'd want to look at other books with  
similar
authors, or keywords, or tags ... all things that are actaully  
*easier* to
do with Solr.  (but then again: i don't work in a library.  i trust  
that

you know something i don't about what your users want.)


-Hoss



Naomi Dushay
[EMAIL PROTECTED]





Re: range queries on string field with millions of values

2008-11-28 Thread Naomi Dushay
The point isn't really how the exact sort works - it's the performance  
issues, coupled with an unpredictable distribution along the entire  
possible sort space.


the sort works
the range queries work
the performance sucks

and I haven't thought of a clever work around.

- Naomi

On Nov 27, 2008, at 9:41 AM, Alexander Ramos Jardim wrote:

I did not even understand what you are considering to be the order  
on your

call numbers.

2008/11/26 Naomi Dushay [EMAIL PROTECTED]

I have a performance problem and I haven't thought of a clever way  
around

it.

I work at the Stanford University Libraries.  We have a collection  
of over
8 million items.  Each item has a call number.  I have been asked  
to provide

a way to browse forward and backward from an arbitrary call number.

I have managed to create a fields that present the call numbers in
appropriate sorts, both forward and reverse.  (This is necessary  
because raw

call numbers don't sort properly:   A123 AZ27 B99 B999 BBB11).

We can ignore the reverse sorted range query problem;  it's the  
same as the

forward sorted range query.

So I use a query like this:

sortCallNum[A123 B34 1970 TO *]rows=10.


Call numbers are squirrelly, so we can't predict the string that will
appropriately grab at least 10 subsequent documents.  They are  
certainly not

consecutive!

so from
A123 B34 1970

we're unable to predict if any of these will return at least 10  
results:


A123 B34 1980  or
A123 B34 V.8  or
A123 B44 or
A123 B67 or
A123 C27 or
A124* or
A22* or
AA* or

You get the idea.

I have not figured out a way to efficiently query for the next 10  
call

numbers in sort order.  I have also mucked about with the cache
initialization, but that's not working either:

  listener event=firstSearcher class=solr.QuerySenderListener
arr name=queries
  !-- populate query result cache for sorted queries --
  lst
  str name=qshelfkey:[0 TO *]/str
  str name=sortshelfkey asc/str
  /lst
/arr

Can anyone help me with this?

- Naomi





--
Alexander Ramos Jardim


Naomi Dushay
[EMAIL PROTECTED]





Re: range queries on string field with millions of values

2008-11-28 Thread Naomi Dushay

Gosh,  I'm sorry to be so unclear.  Hmm.  Trying to clarify below:

On Nov 28, 2008, at 3:52 PM, Chris Hostetter wrote:

Having read through this thread, i'm not sure i understand what  
exactly

the problem is.  my naive understanding is...

1) you want to sort by a field
2) you want to be able to paginate through all docs in order of this
field.
3) you want to be able to start your pagination at any arbitrary  
value for

this field.

so (assuming the field is a simple number for now) you could us  
something

like

  q=yourField:[42 TO *sort=yourField+ascrows=10start-0

where 42 is the arbitrary ID someone wants to start at.



perfect.  This is the query I'm using.

The results are correct.  But the response time sucks.

Reading the docs about caches, I thought I could populate the query  
result cache with an autowarming query and the response time would be  
okay.  But that hasn't worked.  (See excerpts from my solrConfig file  
below.)


A repeated query is very fast, implying caching happens for a  
particular starting point (42 above).


Is there a way to populate the cache with the ENTIRE sorted list of  
values for the field, so any arbitrary starting point will get results  
from the cache, rather than grabbing all results from (x) to the end,  
then sorting all these results, then returning the first 10?



This sentence below seems to imply that you have a solution which  
produces

correct results, but doesn't produce results quickly...


right.

: I have a performance problem and I haven't thought of a clever way  
around it.


...however this lines seems to suggest that you're having trouble
getting at least 10 results from any query (?)

: Call numbers are squirrelly, so we can't predict the string that  
will
: appropriately grab at least 10 subsequent documents.  They are  
certainly not

: consecutive!
:
: so from
: A123 B34 1970
:
: we're unable to predict if any of these will return at least 10  
results:


I was trying to express that I couldn't do this:

myfield:[X TO Y]

because I can't algorithmically compute Y.

Glen Newton suggested a work around, whereby I represent my  
squirrelly, but sortable, field values as floating point numbers, and  
then I can compute Y.


...but i'm not sure what exactly that means.  for any given field,  
there

is always going to be some values X such that myField:[X TO *] won't
return at least 10 docs ... the are the last values in the index in  
order
-- surely it's okay for your app to have an end state when you run  
out 
of data? :)


yes.  Understood.  This is not an issue.


Oh, and BTW...

: numbers in sort order.  I have also mucked about with the cache
: initialization, but that's not working either:
:
: listener event=firstSearcher  
class=solr.QuerySenderListener


...make sure you also do a newSearcher listener that does the same  
thing,

otherwise your FieldCache (used for sorting) may not be warmed when
commits happen)


Yup yup yup.

from solrconfig:

filterCache
  class=solr.LRUCache
  size=2000
  initialSize=1000
  autowarmCount=50/

queryResultCache
  class=solr.LRUCache
  size=1000
  initialSize=500
  autowarmCount=500/


listener event=newSearcher class=solr.QuerySenderListener
  arr name=queries
!-- populate query result cache for sorted queries --
lst
str name=qshelfkey:[0 TO *]/str
str name=sortshelfkey asc/str
/lst
  /arr
/listener

listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
!-- populate query result cache for sorted queries --
lst
str name=qshelfkey:[0 TO *]/str
str name=sortshelfkey asc/str
/lst




range queries on string field with millions of values

2008-11-26 Thread Naomi Dushay
I have a performance problem and I haven't thought of a clever way  
around it.


I work at the Stanford University Libraries.  We have a collection of  
over 8 million items.  Each item has a call number.  I have been asked  
to provide a way to browse forward and backward from an arbitrary call  
number.


I have managed to create a fields that present the call numbers in  
appropriate sorts, both forward and reverse.  (This is necessary  
because raw call numbers don't sort properly:   A123 AZ27 B99 B999  
BBB11).


We can ignore the reverse sorted range query problem;  it's the same  
as the forward sorted range query.


So I use a query like this:

sortCallNum[A123 B34 1970 TO *]rows=10.


Call numbers are squirrelly, so we can't predict the string that will  
appropriately grab at least 10 subsequent documents.  They are  
certainly not consecutive!


so from
A123 B34 1970

we're unable to predict if any of these will return at least 10 results:

A123 B34 1980  or
A123 B34 V.8  or
A123 B44 or
A123 B67 or
A123 C27 or
A124* or
A22* or
AA* or

You get the idea.

I have not figured out a way to efficiently query for the next 10  
call numbers in sort order.  I have also mucked about with the cache  
initialization, but that's not working either:


listener event=firstSearcher class=solr.QuerySenderListener
  arr name=queries
!-- populate query result cache for sorted queries --
lst
str name=qshelfkey:[0 TO *]/str
str name=sortshelfkey asc/str
/lst
  /arr

Can anyone help me with this?

- Naomi



single character terms in index - why?

2008-05-12 Thread Naomi Dushay
I'm experienced with Lucene, less so than SOLR.  I am looking at two  
systems built on top of SOLR for a library discovery service:   
blacklight and vufind.


I checked the raw lucene index using Luke and noticed that both of  
these indexes have single character terms in the index, such as d or  
f.   I asked about this on the vufind list, and was told I didn't  
understand SOLR and why it would need these.


So I'm now asking:  why would SOLR want single character terms?  a  
is usually a stopword.  I know the Library MARC data from which the  
index is derived has a lot of these characters because they denote  
subfields in the data.  But why would we want them to be searchable?


Naomi Dushay
[EMAIL PROTECTED]