Thank you all for the insight and help. Our SOLR instance has multiple collections. Do you know if the spreadsheet at LucidWorks ( https://lucidworks.com/blog/2011/09/14/estimating-memory-and-storage-for-lucenesolr/) is meant to be used to calculate sizing per collection or is it meant to be used for the whole SOLR instance (that contains multiple collections).
The reason I am asking this question is, there are some defaults like "Transient (MB)" (with a value 10 MB) specified "Disk Space Estimator" sheet; I am not sure if these default values are per collection or the whole SOLR instance. Thanks, Vasu On Thu, Oct 6, 2016 at 9:42 PM, Walter Underwood <wun...@wunderwood.org> wrote: > The square-root rule comes from a short paper draft (unpublished) that I > can’t find right now. But this paper gets the same result: > > http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html < > http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html> > > Perfect OCR would follow this rule, but even great OCR has lots of errors. > 95% accuracy is good OCR performance, but that makes a huge, pathological > long tail of non-language terms. > > I learned about the OCR problems from the Hathi Trust. They hit the Solr > vocabulary limit of 2.4 billion terms, then when that was raise, they hit > memory management issues. > > https://www.hathitrust.org/blogs/large-scale-search/too-many-words < > https://www.hathitrust.org/blogs/large-scale-search/too-many-words> > https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again < > https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again> > > wunder > Walter Underwood > wun...@wunderwood.org > http://observer.wunderwood.org/ (my blog) > > > > On Oct 6, 2016, at 8:05 AM, Rick Leir <rl...@leirtech.com> wrote: > > > > I am curious to know where the square-root assumption is from, and why > OCR (without errors) would break it. TIA > > > > cheers - - Rick > > > > On 2016-10-04 10:51 AM, Walter Underwood wrote: > >> No, we don’t have OCR’ed text. But if you do, it breaks the assumption > that vocabulary size > >> is the square root of the text size. > >> > >> wunder > >> Walter Underwood > >> wun...@wunderwood.org > >> http://observer.wunderwood.org/ (my blog) > >> > >> > >>> On Oct 4, 2016, at 7:14 AM, Rick Leir <rl...@leirtech.com> wrote: > >>> > >>> OCR’ed text can have large amounts of garbage such as '';,-d'." > particularly when there is poor image quality or embedded graphics. Is that > what is causing your huge vocabularies? I filtered the text, removing any > word with fewer than 3 alphanumerics or more than 2 non-alphas. > >>> > >>> > >>> On 2016-10-03 09:30 PM, Walter Underwood wrote: > >>>> That approach doesn’t work very well for estimates. > >>>> > >>>> Some parts of the index size and speed scale with the vocabulary > instead of the number of documents. > >>>> Vocabulary usually grows at about the square root of the total amount > of text in the index. OCR’ed text > >>>> breaks that estimate badly, with huge vocabularies. > >>>> > >>>> > > > >