Re: SOLR Sizing

Rick Leir Thu, 06 Oct 2016 08:05:41 -0700

I am curious to know where the square-root assumption is from, and whyOCR (without errors) would break it. TIA


cheers - - Rick


On 2016-10-04 10:51 AM, Walter Underwood wrote:

No, we don’t have OCR’ed text. But if you do, it breaks the assumption that 
vocabulary size
is the square root of the text size.

wunder
Walter Underwood
wun...@wunderwood.org
http://observer.wunderwood.org/  (my blog)

On Oct 4, 2016, at 7:14 AM, Rick Leir <rl...@leirtech.com> wrote:

OCR’ed text can have large amounts of garbage such as '';,-d'." particularly 
when there is poor image quality or embedded graphics. Is that what is causing your 
huge vocabularies? I filtered the text, removing any word with fewer than 3 
alphanumerics or more than 2 non-alphas.


On 2016-10-03 09:30 PM, Walter Underwood wrote:

That approach doesn’t work very well for estimates.

Some parts of the index size and speed scale with the vocabulary instead of the 
number of documents.
Vocabulary usually grows at about the square root of the total amount of text 
in the index. OCR’ed text
breaks that estimate badly, with huge vocabularies.

Re: SOLR Sizing

Reply via email to