I am curious to know where the square-root assumption is from, and why OCR (without errors) would break it. TIA

cheers - - Rick

On 2016-10-04 10:51 AM, Walter Underwood wrote:
No, we don’t have OCR’ed text. But if you do, it breaks the assumption that 
vocabulary size
is the square root of the text size.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)


On Oct 4, 2016, at 7:14 AM, Rick Leir <rl...@leirtech.com> wrote:

OCR’ed text can have large amounts of garbage such as '';,-d'." particularly 
when there is poor image quality or embedded graphics. Is that what is causing your 
huge vocabularies? I filtered the text, removing any word with fewer than 3 
alphanumerics or more than 2 non-alphas.


On 2016-10-03 09:30 PM, Walter Underwood wrote:
That approach doesn’t work very well for estimates.

Some parts of the index size and speed scale with the vocabulary instead of the 
number of documents.
Vocabulary usually grows at about the square root of the total amount of text 
in the index. OCR’ed text
breaks that estimate badly, with huge vocabularies.



Reply via email to