Erick Erickson wrote:
I'm surprised, as you are, by the non-linearity. Out of curiosity, what is
your MaxFieldLength? By default only the first 10,000 tokens are added
to a field per document. If you haven't set this higher, that could account
for it.

We set it to a very large number so we index the entire document.


As far as I know, optimization shouldn't really affect the index size if you
are not deleting documents, but I'm no expert in that area.

I've indexed OCR data and it's no fun for the reasons you cite. We had
better results searching if we cleaned the data at index time. By "cleaning"
I mean we took out all of the characters that *couldn't* be indexed. What
*can't* be indexed depends upon your requirements, but in our case we
could just use the low ASCII characters by folding all the accented
characters
into their low-ascii counterparts because we had no need for native-
language support. And we also replaced most non-printing characters
with spaces. A legitimate question is whether indexing single characters
makes sense (in our case, genealogy, it actually does. Siiiiggghhh)

Fortunatel non-printing characters are not a problem but we need native language query support so limiting to US-ASCII will not work for us. One possibility is to identify the dominant language in the document and use dictionaries to remove junk however proper names are a big problem with that approach. Another might be to use heuristics like removing "words" with numbers in the middle of them. Whatever we do wil lhave to be fast.


In a mixed-language environment, this provided surprisingly good results
given how crude the transformations were. Of course it's totally
unacceptable to so mangle non-English text this crudely if you must
support native-language searching.

yes.


I'd be interested in how this changes your index size if you do decide
to try it. There's nothing like having somebody else do research for
me <G>.


Best
Erick

On Wed, Aug 13, 2008 at 1:45 PM, Phillip Farber <[EMAIL PROTECTED]> wrote:

We're indexing the ocr for a large number of books.  Our experimental
schema is simple and id field and an ocr text field (not stored).

Currently we just have two data points:

3005 documents = 723 MB index
174237 documents = 51460 MB index

These indexes are not optimized.

If the index size were a linear function of number of documents, based on
just these two data points, you'd expect the index for 174237 docs to be
approximately 57.98 times larger that 723 MB or about 41921 MB. Actually
it's 51460 or about 22% bigger.

I suspect the non-linear increase is due to dirty ocr that continually
increases the number of unique words that need to be indexed.

Another possibility is that the larger index has a higher proportion of
documents containing characters from non-Latin alphabets thereby increasing
the number of unique words.  I can't verify that at this point.

Are these reasonable assumptions or am I missing other factors that could
contribute to the non-linear growth in index size?

Regards,

Phil



Reply via email to