Re: Solr feasibility with terabyte-scale data

Mike Klaas Tue, 22 Jan 2008 12:38:04 -0800

On 22-Jan-08, at 11:05 AM, Phillip Farber wrote:

Currently 1M docs @ ~1.4M/doc. Scaling to 7M docs. This is OCR sowe are talking perhaps 50K words total to index so as you point outthe index might not be too big. It's the *data* that is big notthe *index*, right?. So I don't think SOLR-303 (distributedsearch) is required here.

Obviously as the number of documents increase the index size mustincrease to some degree -- I think linearly? But what index sizewill result for 7M documents over 50K words where we're talkingjust 2 fields per doc: 1 id field and one OCR field of ~1.4M?Ballpark?

That's 280K tokens per document, assuming ~5 chars/word. That's 2trillion tokens. Lucene's posting list compression is decent, butyou're still talking about a minimum of 2-4TB for the index (that'sassuming 1 or 2 bytes per token).

Regarding single word queries, do you think, say, 0.5 sec/query toreturn 7M score-ranked IDs is possible/reasonable in this scenario?

Well, the average compressed posting list will be at least 80MB thatneeds to be read from the NAS and decoded and ranked. Since the sizeis exponentially distributed, common terms will be much bigger andrarer terms much smaller.

You want to return all 7M ids for every query? That in itself wouldbe 100's of MB of xml to generate, transfer, and parse.


0.5s seems a little optimistic.

Since your queries are so simple, I think it might be better to uselucene directly. You can read the matching doc ids for a termdirectly from the posting list in that case.


-Mike

Re: Solr feasibility with terabyte-scale data

Reply via email to