[htdig] pdf files not being processed

Richard Peskin Fri, 04 Mar 2005 14:29:52 -0800

I have 3 directories of pdf files (about 100 files each); these directories are at the same level and are at the top level of my "start_url". The start_url directory is Indexed in Apache. Running "rundig" with verbose output it is clear that many files are not being processed (text extraction). Yet if a pick a file (any file), and manually run "pdftotext" on that file I do get a ".txt" output from pdftotext. If I leave that .txt file in place I can successfully search words from that file.

I have no clue as to what is happening here. If I can manually run pdftotext , why is this not being done by rundig? Particularly given that the directories are indexed. The only hint I seem to see of problems is the " :NOT HTML" message as rundig is running.

Any help is appreciated.

--dick peskin

____________________________________
<x-tad-smaller>Richard L. Peskin, RLP Consulting, Londonderry, VT
http://www.rlpcon.com
http://www.caip.rutgers.edu/~peskin</x-tad-smaller>

[htdig] pdf files not being processed

Reply via email to