|
Hi all! Here’s someone seeking help. About a month ago I
started mission impossible. We have been using htdig on Linux for as long as I
work for this company, but we’d never used it on Windows before. Then one
of our clients wanted an intranet, hosted on their Windows server. Since the
Microsoft Indexing Server wasn’t what we we’re looking for we
decided to try an use htdig. It wasn’t easy to get it running on our
testing machine (Windows 2003) but at the end I pulled it off. Htdig is
indexing a website, including linked Word and Pdf files. BUT, there is also a
problem there. This is what happens. The client uploaded a Pdf file, which
contains 1 page. The page was no longer than 100 words. After being indexed, it
was found when searching on one of the words inside the Pdf the document was
found and shown in the results page. But some of the words didn’t got indexed.
There were a couple of words used in de Pdf file that no other document
contained. Using those words resulted in a ‘no results’ page.
Running the same scripts from on the original document gave me the content of
the Pdf in plain text. All words where there, no errors. I’ve tweaked the scripts a bit, so the temporary
downloaded file, which htdig converts and indexes were not deleted. I’ve
found the right tempfile, renamed it to Pdf and tried to open it. It gave me a
couple of errors and didn’t return the whole document. Can anyone help me with this problem? I’ve come a long
way, and I really like htdig so I do not want to have to change to an other
digger ;) It’s probably a minor thing which I am overlooking, but any
suggestions are welcome! In case of altering code please point me to the right
place. Recompiling can be done if needed. Thanks for you help! Marco Houtman PS: INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO!
INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO!
INFO! Htdig version: 3.2.0b6 OS version: Windows
2003 Compiled with: Cygwin
( no idea which version, installed ages ago on my pc, but recently upgraded) Here is my configfile (i’ve stripped out all comments) ====================================== template_dir:
c:/htdig/templates start_url: http://intranet/ limit_urls_to: ${start_url} common_url_parts: ${limit_urls_to}
.html .htm .shtml .php exclude_urls: /cgi-bin/
.cgi bad_extensions: .wav
.gz .z .sit .au .zip .tar .hqx .exe .com .gif .jpg .jpeg .aiff .class .map .ram
.tgz .bin .rpm .mpg .mov .avi .css maintainer: [EMAIL PROTECTED] max_head_length: 1000000 max_doc_size:
20000000 no_excerpt_show_top: false search_algorithm: exact:1
prefix:0.5 synonyms:0.5 endings:0.1 template_map: Long long ${common_dir}/long.html \
Short short ${common_dir}/short.html \
Intranet intranet ${template_dir}/intranet.html template_name: intranet nothing_found_file: ${template_dir}/nomatch.html search_results_footer: ${template_dir}/footer.html search_results_header: ${template_dir}/header.html next_page_text: next no_next_page_text: prev_page_text: prev # Extra settings: matches_per_page: 3000 maximum_pages: 100 external_parsers: application/msword->text/html
"c:/htdig/bin/doc2html.bat" \ application/pdf->text/html
"c:/htdig/bin/pdf2html.bat" user_agent: Intranet_digger max_hop_count:
20 max_prefix_matches: 100 minimum_prefix_length: 2 prefix_match_character: * sort:
score ====================================== Pdfinfo on the original document: Producer: PDFXC Library (vesion
1.0). CreationDate: Fri Nov 12 12:57:41 2004 Tagged: no Pages:
1 Encrypted: no Page size: 595.2 x 841.92 pts (A4) File size: 5777 bytes Optimized: no PDF version: 1.3 Pdfinfo on the htdig tempfile: Error (0): PDF file is damaged – attempting to
reconstruct xref table... //
oops! Producer: PDFXC Library (vesion
1.0). CreationDate: Fri Nov 12 12:57:41 2004 Tagged: no Pages:
1 Encrypted: no Page size: 595.2 x 841.92 pts (A4) File size: 5786 bytes //
hmm, file is slightly bigger than original Optimized: no PDF version: 1.3 |
- Re: [htdig] pdf and htdig on windows: mission impossi... Marco Houtman
- Re: [htdig] pdf and htdig on windows: mission im... Massimiliano Ferrero

