Could you index your 'phrase tags' as single tokens? Then your phrase queries become simple TermQuerys.
On Wed, Oct 24, 2012 at 12:26 PM, Robert Muir <rcm...@gmail.com> wrote: > On Wed, Oct 24, 2012 at 11:09 AM, Aaron Daubman <daub...@gmail.com> wrote: > > Greetings, > > > > We have a solr instance in use that gets some perhaps atypical queries > > and suffers from poor (>2 second) QTimes. > > > > Documents (~2,350,000) in this instance are mainly comprised of > > various "descriptive fields", such as multi-word (phrase) tags - an > > average document contains 200-400 phrases like this across several > > different multi-valued field types. > > > > A custom QueryComponent has been built that functions somewhat like a > > very specific MoreLikeThis. A seed document is specified via the > > incoming query, its terms are retrieved, boosted both by query > > parameters as well as fields within the document that specify term > > weighting, sorted by this custom boosting, and then a second query is > > crafted by taking the top 200 (sorted by the custom boosting) > > resulting field values paired with their fields and searching for > > documents matching these 200 values. > > a few more ideas: > * use shingles e.g. to turn two-word phrases into single terms (how > long is your average phrase?). > * in addition to the above, maybe for phrases with > 2 terms, consider > just a boolean conjunction of the shingled phrases instead of a "real" > phrase query: e.g. "more like this" -> (more_like AND like_this). This > would have some false positives. > * use a more aggressive stopwords list for your "MorePhrasesLikeThis". > * reduce this number 200, and instead work harder to prune out which > phrases are the "most descriptive" from the seed document, e.g. based > on some heuristics like their frequency or location within that seed > document, so your query isnt so massive. >