Hi all!

 

Here’s someone seeking help. About a month ago I started mission impossible. We have been using htdig on Linux for as long as I work for this company, but we’d never used it on Windows before. Then one of our clients wanted an intranet, hosted on their Windows server. Since the Microsoft Indexing Server wasn’t what we we’re looking for we decided to try an use htdig. It wasn’t easy to get it running on our testing machine (Windows 2003) but at the end I pulled it off. Htdig is indexing a website, including linked Word and Pdf files. BUT, there is also a problem there.

 

This is what happens. The client uploaded a Pdf file, which contains 1 page. The page was no longer than 100 words. After being indexed, it was found when searching on one of the words inside the Pdf the document was found and shown in the results page. But some of the words didn’t got indexed. There were a couple of words used in de Pdf file that no other document contained. Using those words resulted in a ‘no results’ page. Running the same scripts from on the original document gave me the content of the Pdf in plain text. All words where there, no errors.

 

I’ve tweaked the scripts a bit, so the temporary downloaded file, which htdig converts and indexes were not deleted. I’ve found the right tempfile, renamed it to Pdf and tried to open it. It gave me a couple of errors and didn’t return the whole document.

 

Can anyone help me with this problem? I’ve come a long way, and I really like htdig so I do not want to have to change to an other digger ;) It’s probably a minor thing which I am overlooking, but any suggestions are welcome! In case of altering code please point me to the right place. Recompiling can be done if needed.

 

Thanks for you help!

 

Marco Houtman

 

PS:

 

INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO! INFO!

 

Htdig version:                3.2.0b6

OS version:                   Windows 2003

Compiled with:              Cygwin ( no idea which version, installed ages ago on my pc, but recently upgraded)

 

 

Here is my configfile (i’ve stripped out all comments)

======================================
database_dir:                c:/htdig/var/htdig

template_dir:                 c:/htdig/templates

start_url:                       http://intranet/

limit_urls_to:                 ${start_url}

common_url_parts:        ${limit_urls_to} .html .htm .shtml .php

exclude_urls:                /cgi-bin/ .cgi

bad_extensions:                        .wav .gz .z .sit .au .zip .tar .hqx .exe .com .gif .jpg .jpeg .aiff .class .map .ram .tgz .bin .rpm .mpg .mov .avi .css

maintainer:                    [EMAIL PROTECTED]

max_head_length:        1000000

max_doc_size:           20000000

no_excerpt_show_top:    false

search_algorithm:       exact:1 prefix:0.5 synonyms:0.5 endings:0.1

template_map: Long long ${common_dir}/long.html \

              Short short ${common_dir}/short.html \

              Intranet intranet ${template_dir}/intranet.html

template_name: intranet

nothing_found_file: ${template_dir}/nomatch.html

search_results_footer: ${template_dir}/footer.html

search_results_header: ${template_dir}/header.html

next_page_text:                        next

no_next_page_text:

prev_page_text:             prev

 

# Extra settings:

matches_per_page:       3000

maximum_pages:                      100

external_parsers:          application/msword->text/html "c:/htdig/bin/doc2html.bat" \

                                    application/pdf->text/html "c:/htdig/bin/pdf2html.bat"

user_agent:     Intranet_digger

max_hop_count:          20

max_prefix_matches:     100

minimum_prefix_length:  2

prefix_match_character: *

sort:                   score

======================================

Pdfinfo on the original document:

 

Producer:       PDFXC Library (vesion 1.0).

CreationDate:   Fri Nov 12 12:57:41 2004

Tagged:         no

Pages:          1

Encrypted:      no

Page size:      595.2 x 841.92 pts (A4)

File size:      5777 bytes

Optimized:      no

PDF version:    1.3

 

 

Pdfinfo on the htdig tempfile:

 

Error (0): PDF file is damaged – attempting to reconstruct xref table...                              // oops!

Producer:       PDFXC Library (vesion 1.0).

CreationDate:   Fri Nov 12 12:57:41 2004

Tagged:         no

Pages:          1

Encrypted:      no

Page size:      595.2 x 841.92 pts (A4)

File size:      5786 bytes                                                                                              // hmm, file is slightly bigger than original

Optimized:      no

PDF version:    1.3

 

 

Reply via email to