The square-root rule comes from a short paper draft (unpublished) that I can’t 
find right now. But this paper gets the same result:

http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html 
<http://nflrc.hawaii.edu/rfl/April2005/chujo/chujo.html>

Perfect OCR would follow this rule, but even great OCR has lots of errors. 95% 
accuracy is good OCR performance, but that makes a huge, pathological long tail 
of non-language terms.

I learned about the OCR problems from the Hathi Trust. They hit the Solr 
vocabulary limit of 2.4 billion terms, then when that was raise, they hit 
memory management issues.

https://www.hathitrust.org/blogs/large-scale-search/too-many-words 
<https://www.hathitrust.org/blogs/large-scale-search/too-many-words>
https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again 
<https://www.hathitrust.org/blogs/large-scale-search/too-many-words-again>

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


> On Oct 6, 2016, at 8:05 AM, Rick Leir <rl...@leirtech.com> wrote:
> 
> I am curious to know where the square-root assumption is from, and why OCR 
> (without errors) would break it. TIA
> 
> cheers - - Rick
> 
> On 2016-10-04 10:51 AM, Walter Underwood wrote:
>> No, we don’t have OCR’ed text. But if you do, it breaks the assumption that 
>> vocabulary size
>> is the square root of the text size.
>> 
>> wunder
>> Walter Underwood
>> wun...@wunderwood.org
>> http://observer.wunderwood.org/  (my blog)
>> 
>> 
>>> On Oct 4, 2016, at 7:14 AM, Rick Leir <rl...@leirtech.com> wrote:
>>> 
>>> OCR’ed text can have large amounts of garbage such as '';,-d'." 
>>> particularly when there is poor image quality or embedded graphics. Is that 
>>> what is causing your huge vocabularies? I filtered the text, removing any 
>>> word with fewer than 3 alphanumerics or more than 2 non-alphas.
>>> 
>>> 
>>> On 2016-10-03 09:30 PM, Walter Underwood wrote:
>>>> That approach doesn’t work very well for estimates.
>>>> 
>>>> Some parts of the index size and speed scale with the vocabulary instead 
>>>> of the number of documents.
>>>> Vocabulary usually grows at about the square root of the total amount of 
>>>> text in the index. OCR’ed text
>>>> breaks that estimate badly, with huge vocabularies.
>>>> 
>>>> 
> 

Reply via email to