On 22-Jan-08, at 11:05 AM, Phillip Farber wrote:
Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so
we are talking perhaps 50K words total to index so as you point out
the index might not be too big. It's the *data* that is big not
the *index*, right?. So I don't think SOLR-303 (distributed
search) is required here.
Obviously as the number of documents increase the index size must
increase to some degree -- I think linearly? But what index size
will result for 7M documents over 50K words where we're talking
just 2 fields per doc: 1 id field and one OCR field of ~1.4M?
Ballpark?
That's 280K tokens per document, assuming ~5 chars/word. That's 2
trillion tokens. Lucene's posting list compression is decent, but
you're still talking about a minimum of 2-4TB for the index (that's
assuming 1 or 2 bytes per token).
Regarding single word queries, do you think, say, 0.5 sec/query to
return 7M score-ranked IDs is possible/reasonable in this scenario?
Well, the average compressed posting list will be at least 80MB that
needs to be read from the NAS and decoded and ranked. Since the size
is exponentially distributed, common terms will be much bigger and
rarer terms much smaller.
You want to return all 7M ids for every query? That in itself would
be 100's of MB of xml to generate, transfer, and parse.
0.5s seems a little optimistic.
Since your queries are so simple, I think it might be better to use
lucene directly. You can read the matching doc ids for a term
directly from the posting list in that case.
-Mike