On 22-Jan-08, at 11:05 AM, Phillip Farber wrote:

Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR so we are talking perhaps 50K words total to index so as you point out the index might not be too big. It's the *data* that is big not the *index*, right?. So I don't think SOLR-303 (distributed search) is required here.

Obviously as the number of documents increase the index size must increase to some degree -- I think linearly? But what index size will result for 7M documents over 50K words where we're talking just 2 fields per doc: 1 id field and one OCR field of ~1.4M? Ballpark?

That's 280K tokens per document, assuming ~5 chars/word. That's 2 trillion tokens. Lucene's posting list compression is decent, but you're still talking about a minimum of 2-4TB for the index (that's assuming 1 or 2 bytes per token).

Regarding single word queries, do you think, say, 0.5 sec/query to return 7M score-ranked IDs is possible/reasonable in this scenario?

Well, the average compressed posting list will be at least 80MB that needs to be read from the NAS and decoded and ranked. Since the size is exponentially distributed, common terms will be much bigger and rarer terms much smaller.

You want to return all 7M ids for every query? That in itself would be 100's of MB of xml to generate, transfer, and parse.

0.5s seems a little optimistic.

Since your queries are so simple, I think it might be better to use lucene directly. You can read the matching doc ids for a term directly from the posting list in that case.

-Mike

Reply via email to