I was just running some performance tests against my Solr instance (using the standard query language), and I discovered a shocking (at least to me) speed difference between queries involving phrase queries (i.e. stuff between quotation marks) and ones that don't.
For instance, here are some log snippets for queries without phrases. Note the QTimes: <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q=Tall} hits=10 status=0 QTime=672 </message> <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q=euro} hits=10 status=0 QTime=656 </message> <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q=(((goldman+AND+ernst)+AND+young)+AND+split)} hits=10 status=0 QTime=1563 </message> <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q=(reclassification+OR+recapitalization)} hits=10 status=0 QTime=2156 </message> And here are some with phrases. Again, note the QTimes: <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q="Merriman+Curhan"} hits=10 status=0 QTime=5703 </message> <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q="Hang+Seng"} hits=10 status=0 QTime=17734 </message> <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q="the+mississippi+band+of+choctaw"} hits=10 status=0 QTime=44015 </message> <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q="Finite+reinsurance"} hits=10 status=0 QTime=80531 </message> <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q="latham+initial+purchaser+by+shearman+sterling"} hits=0 status=0 QTime=914572 </message> <message>[exhibitcore] webapp=/solr path=/select/ params={fl=score,companyname&hl.fl=body&hl.snippets=3&hl=on&q="strategic+review+committee"} hits=10 status=0 QTime=2829467 </message> I've tried to pick these samples in a sort-of-random manner from my log file. (I then sorted them.) Now I gather that phrase queries are inherently slower than non-phrase queries, but 1-3 orders of magnitude difference seems noteworthy. This is on Solr r654965, which I don't think is *too* far behind the trunk version. 1200Mb RAM allocated to Solr. 8M documents. Lots of compressed, stored fields. Most docs are probably like 50Kb, but some of them might be 10Mb, 100Mb. The index as a whole is 106GB. maxFieldLength=10000. The index was recently optimized. (It has only one segment right now.) I'm thinking that even supposing I've indexed everything in a horrible inefficient manner, and even supposing my machine is woefully underpowered, that wouldn't really explain why the phrase queries would be *that* much slower, would it? Any ideas? Indexing with termPositions wouldn't help, would it? (Now I'm not using termPositions or termVectors.) Or what if I used an alternative query parser, so phrase queries could be implemented in terms of the SpanNearQuery class rather than the PhraseQuery class? Thanks, Chris