The square-root rule comes from a short paper draft (unpublished) that I can’t find right now. But this paper gets the same result:
http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html <http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html> Perfect OCR would follow this rule, but even great OCR has lots of errors. 95% accuracy is good OCR performance, but that makes a huge, pathological long tail of non-language terms. I learned about the OCR problems from the Hathi Trust. They hit the Solr vocabulary limit of 2.4 billion terms, then when that was raise, they hit memory management issues. https://www.hathitrust.org/blogs/large-scale-search/too-many-words <https://www.hathitrust.org/blogs/large-scale-search/too-many-words> https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again <https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again> wunder Walter Underwood wun...@wunderwood.org http://observer.wunderwood.org/ (my blog) > On Oct 6, 2016, at 8:05 AM, Rick Leir <rl...@leirtech.com> wrote: > > I am curious to know where the square-root assumption is from, and why OCR > (without errors) would break it. TIA > > cheers - - Rick > > On 2016-10-04 10:51 AM, Walter Underwood wrote: >> No, we don’t have OCR’ed text. But if you do, it breaks the assumption that >> vocabulary size >> is the square root of the text size. >> >> wunder >> Walter Underwood >> wun...@wunderwood.org >> http://observer.wunderwood.org/ (my blog) >> >> >>> On Oct 4, 2016, at 7:14 AM, Rick Leir <rl...@leirtech.com> wrote: >>> >>> OCR’ed text can have large amounts of garbage such as '';,-d'." >>> particularly when there is poor image quality or embedded graphics. Is that >>> what is causing your huge vocabularies? I filtered the text, removing any >>> word with fewer than 3 alphanumerics or more than 2 non-alphas. >>> >>> >>> On 2016-10-03 09:30 PM, Walter Underwood wrote: >>>> That approach doesn’t work very well for estimates. >>>> >>>> Some parts of the index size and speed scale with the vocabulary instead >>>> of the number of documents. >>>> Vocabulary usually grows at about the square root of the total amount of >>>> text in the index. OCR’ed text >>>> breaks that estimate badly, with huge vocabularies. >>>> >>>> >