The HathiTrust Large Search indexes the OCR from 5 million volumes, with an
average of 200-300 pages per volume. So the total number of pages indexed
would be over 1 billion. However, we are not using pages as Solr documents,
we are using the entire book, so we only have 5 million rather than 1
billion Solr documents.  

We also are not storing the OCRed text.  Since the total size of the index
for 5 million volumes is over 2 terrabytes, we split the index into 10
shards, each indexing about 1/2 million documents.

Given all that, our indexes are about 250-300GB for each 500,000 books. 
About 85% of that is the *prx position index.   Unless you have enough
memory on the OS to get a significant amount of the index into the disk OS
cache, disk I/O is the big bottleneck, especially for phrase queries with
common words.  
 See   http://www.hathitrust.org/blogs/large-scale-search
http://www.hathitrust.org/blogs/large-scale-search  for more details.

Have you considered storing the OCR separately rather than in the Solr index
or does your use case require storing the OCR in the index?


Tom Burton-West
Digital Library Production Service
University of Michigan



Wick2804 wrote:
> 
> We are thinking of creating a Lucene Solr project to store 50million full
> text OCRed A4 pages. Is there anyone out there who could provide some kind
> of guidance on the size of index we are likely to generate, and are there
> any gotchas in the standard analysis engines for load and query that will
> cause us issues. Do large indexes cause memory issues on servers?  Any
> help or advice greatly appreciated.
> 

-- 
View this message in context: 
http://old.nabble.com/Solr-Performance-and-Scalability-tp27552013p27553353.html
Sent from the Solr - Dev mailing list archive at Nabble.com.

Reply via email to