I've got about 5MB of PDF files I'm trying to index.  I'm using pdftotext
for this, and I thought it was working fine, since a majority of the text
was indexed correctly.  However, one of the things we need to do is search
for numbers, specifically percentages, but just the number part would be
fine.  I used pdftotext from the command line and the test number I've
been searching for shows up in the resulting output.  However, when using
htdig, the number doesn't show up.  I've set allow_numbers: true,
minimum_word_length: 2, and max_doc_size to something like 10MB. I've
tried using extra_word_characters and valid_punctuation to no avail.

So, I'm guessing this is either a problem with the percent sign (25%,etc),
or not having _all_ words indexed.

Has anyone else run into this?

Phil Varner


--

A distributed system is one in which the failure of a computer you
didn't even know existed can render your own computer unusable.
-- Leslie Lamport


------------------------------------
To unsubscribe from the htdig mailing list, send a message to
[EMAIL PROTECTED]
You will receive a message to confirm this.
List archives:  <http://www.htdig.org/mail/menu.html>
FAQ:            <http://www.htdig.org/FAQ.html>

Reply via email to