Am 06.08.2012 20:29, schrieb Mike Sokolov: Hi Mike,
> There was some interesting work done on optimizing queries including > very common words (stop words) that I think overlaps with your problem. > See this blog post > http://www.hathitrust.org/blogs/large-scale-search/slow-queries-and-common-words-part-2 > from the Hathi Trust. > > The upshot in a nutshell was that queries including terms with very > large postings lists (ie high occurrences) were slow, and the approach > they took to dealing with this was to index n-grams (ie pairs and > triplets of adjacent tokens). However I'm not sure this would help much > if your queries will typically include only a single token. This is very interesting for our use case indeed. However, you are right that indexing n-grams is not (per sé) a solution for my given problem because I'm working on an application using multiple indexes. A query for one isolated frequent term will indeed be rare presumably, or at least rare enough to tolerate slow response times, but the results will typically be intersected with results from other indexes. To illustrate this more practically: the index I described having relatively few distinct and partially extremely frequent tokens indexes part-of-speech (POS) tags with positional information stored in the payload. A parallel index indexes actual text; a typical query may look for a certain POS tag in one index and a word X at the same position with a matching payload in the other index. So both indexes need to be queries completely before the intersection can be performed. Best, Carsten -- Institut für Deutsche Sprache | http://www.ids-mannheim.de Projekt KorAP | http://korap.ids-mannheim.de Tel. +49-(0)621-43740789 | schno...@ids-mannheim.de Korpusanalyseplattform der nächsten Generation Next Generation Corpus Analysis Platform --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org