The HathiTrust Large Search indexes the OCR from 5 million volumes, with an average of 200-300 pages per volume. So the total number of pages indexed would be over 1 billion. However, we are not using pages as Solr documents, we are using the entire book, so we only have 5 million rather than 1 billion Solr documents.
We also are not storing the OCRed text. Since the total size of the index for 5 million volumes is over 2 terrabytes, we split the index into 10 shards, each indexing about 1/2 million documents. Given all that, our indexes are about 250-300GB for each 500,000 books. About 85% of that is the *prx position index. Unless you have enough memory on the OS to get a significant amount of the index into the disk OS cache, disk I/O is the big bottleneck, especially for phrase queries with common words. See http://www.hathitrust.org/blogs/large-scale-search http://www.hathitrust.org/blogs/large-scale-search for more details. Have you considered storing the OCR separately rather than in the Solr index or does your use case require storing the OCR in the index? Tom Burton-West Digital Library Production Service University of Michigan Wick2804 wrote: > > We are thinking of creating a Lucene Solr project to store 50million full > text OCRed A4 pages. Is there anyone out there who could provide some kind > of guidance on the size of index we are likely to generate, and are there > any gotchas in the standard analysis engines for load and query that will > cause us issues. Do large indexes cause memory issues on servers? Any > help or advice greatly appreciated. > -- View this message in context: http://old.nabble.com/Solr-Performance-and-Scalability-tp27552013p27553353.html Sent from the Solr - Dev mailing list archive at Nabble.com.