I was just running some performance tests against my Solr instance
(using the standard query language), and I discovered a shocking (at
least to me) speed difference between queries involving phrase queries
(i.e. stuff between quotation marks) and ones that don't.

For instance, here are some log snippets for queries without phrases.
Note the QTimes:

  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q=Tall}
hits=10 status=0 QTime=672 </message>
  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q=euro}
hits=10 status=0 QTime=656 </message>
  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q=(((goldman+AND+ernst)+AND+young)+AND+split)}
hits=10 status=0 QTime=1563 </message>
  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q=(reclassification+OR+recapitalization)}
hits=10 status=0 QTime=2156 </message>

And here are some with phrases. Again, note the QTimes:

  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q="Merriman+Curhan"}
hits=10 status=0 QTime=5703 </message>
  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q="Hang+Seng"}
hits=10 status=0 QTime=17734 </message>
  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q="the+mississippi+band+of+choctaw"}
hits=10 status=0 QTime=44015 </message>
  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q="Finite+reinsurance"}
hits=10 status=0 QTime=80531 </message>
  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q="latham+initial+purchaser+by+shearman+sterling"}
hits=0 status=0 QTime=914572 </message>
  <message>[exhibitcore] webapp=/solr path=/select/
params={fl=score,companyname&amp;hl.fl=body&amp;hl.snippets=3&amp;hl=on&amp;q="strategic+review+committee"}
hits=10 status=0 QTime=2829467 </message>

I've tried to pick these samples in a sort-of-random manner from my
log file. (I then sorted them.)

Now I gather that phrase queries are inherently slower than non-phrase
queries, but 1-3 orders of magnitude difference seems noteworthy.

This is on Solr r654965, which I don't think is *too* far behind the
trunk version. 1200Mb RAM allocated to Solr. 8M documents. Lots of
compressed, stored fields. Most docs are probably like 50Kb, but some
of them might be 10Mb, 100Mb. The index as a whole is 106GB.
maxFieldLength=10000. The index was recently optimized. (It has only
one segment right now.)

I'm thinking that even supposing I've indexed everything in a horrible
inefficient manner, and even supposing my machine is woefully
underpowered, that wouldn't really explain why the phrase queries
would be *that* much slower, would it? Any ideas? Indexing with
termPositions wouldn't help, would it? (Now I'm not using
termPositions or termVectors.) Or what if I used an alternative query
parser, so phrase queries could be implemented in terms of the
SpanNearQuery class rather than the PhraseQuery class?

Thanks,
Chris

Reply via email to