Re: How does query on AND work

Per Steffensen Fri, 23 May 2014 08:31:30 -0700

I can answer some of this myself now that I have dived into it tounderstand what Solr/Lucene does and to see if it can be done better* In current Solr/Lucene (or at least in 4.4) indices on both"no_dlng_doc_ind_sto" and "timestamp_dlng_doc_ind_sto" are used and thedoc-id-sets found are intersected to get the final set of doc-ids* It IS more efficient to just use the index for the"no_dlng_doc_ind_sto"-part of the request to get doc-ids that match thatpart and then fetch timestamp-doc-values for those doc-ids to filter outthe docs that does not match the "timestamp_dlng_doc_ind_sto"-part ofthe query. I have made changes to our version of Solr (and Lucene) to dothat and response-times go from about 10 secs to about 1 sec (of coursedependent on whats in file-cache etc.) - in cases where"no_dlng_doc_ind_sto" hit about 500-1000 docs and"timestamp_dlng_doc_ind_sto" hit about 3-4 billion.


Regards, Per Steffensen


On 19/05/14 13:33, Per Steffensen wrote:

Hi
Lets say I have a Solr collection (running across several servers)containing 5 billion documents. A.o. each document have a value forfield "no_dlng_doc_ind_sto" (a long) and field"timestamp_dlng_doc_ind_sto" (also a long). Both "no_dlng_doc_ind_sto"and "timestamp_dlng_doc_ind_sto" are doc-value, indexed and stored.Like this in schema.xml<dynamicField name="*_dlng_doc_ind_sto" type="dlng" indexed="true"stored="true" required="true" docValues="true"/><fieldType name="dlng" class="solr.TrieLongField" precisionStep="0"positionIncrementGap="0" docValuesFormat="Disk"/>
I make queries like this: no_dlng_doc_ind_sto:(<NO>) ANDtimestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])* The "no_dlng_doc_ind_sto:(<NO>)"-part of a typical query will hitbetween 500 and 1000 documents out of the total 5 billion* The "timestamp_dlng_doc_ind_sto:([<TIME_START> TO <TIME_END>])"-partof a typical query will hit between 3-4 billion documents out of thetotal 5 billion
Question is how Solr/Lucene deals with such requests?
I am thinking that using the indices on both "no_dlng_doc_ind_sto" and"timestamp_dlng_doc_ind_sto" to get two sets of doc-ids and then makean intersection of those might not be the most efficient. You aremaking an intersection of two doc-id-sets of size 500-1000 and 3-4billion. It might be faster to just use the index for"no_dlng_doc_ind_sto" to get the doc-ids for the 500-1000 documents,then for each of those fetch their "timestamp_dlng_doc_ind_sto"-value(using doc-value) to filter out the ones among the 500-1000 that doesnot match the timestamp-part of the query.But what does Solr/Lucene actually do? Is it Solr- or Lucene-code thatmake the decision on what to do? Can you somehow "hint" thesearch-engine that you want one or the other method used?
Solr 4.4 (and corresponding Lucene), BTW, if that makes a difference

Regards, Per Steffensen

Re: How does query on AND work

Reply via email to