[htdig] doc2html - indexed but no hits

CHUN KI SHIN Thu, 10 May 2007 05:43:24 -0700

I've been trying to index .pdf and .doc documents in v. 3.2.0b withdoc2html/catdoc/pdf2html.I can see both types indexed fine (though I'm not sure why log doesn't tellwhich words and tags have been indexed). See below:


pick: devserverxxx.com, # servers = 1

devserverxxx.com with a traditional HTTP connection

316:33:2:https://devserverxxx.com/library/ADJA/docs/portlet-1_0-fr-spec.pdf:Making HTTPS request onhttps://devserverxxx.com/library/ADJA/docs/portlet-1_0-fr-spec.pdf

Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:19:01 GMT
Header line: Server: Apache
Header line: Last-Modified: Mon, 07 May 2007 14:08:26 GMT
Header line: ETag: "1f841c-6af5b-d5aeea80"
Discarded header line: ETag: "1f841c-6af5b-d5aeea80"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/pdf
Header line: Content-Length: 438107
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain 2.3.345)

Discarded header line: Via: 1.1 ichainserver.devserverxxx.com (iChain2.3.345)Retrieving document /library/ADJA/docs/portlet-1_0-fr-spec.pdf on host:devserverxxx.com:443

Http version      : HTTP/1.1
Server            : HTTP/1.1
Status Code       : 200
Reason            : OK
Access Time       : Wed, 09 May 2007 21:19:01 GMT
Modification Time : Mon, 07 May 2007 14:08:26 GMT
Content-type      : application/pdf
Request time: 0 secs
size = 438107


pick: devserverxxx.com, # servers = 1

devserverxxx.com with a traditional HTTP connection

96:39:2:https://devserverxxx.com/library/ADJA/forms/Indexing_Form.doc:Making HTTPS request onhttps://devserverxxx.com/library/ADJA/forms/Indexing_Form.doc

Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:18:28 GMT
Header line: Server: Apache
Header line: Last-Modified: Tue, 30 Aug 2005 20:19:58 GMT
Header line: ETag: "224003-6a00-55fc3780"
Discarded header line: ETag: "224003-6a00-55fc3780"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/msword
Header line: Content-Length: 27136
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain 2.3.345)
Discarded header line: Via: 1.1 ichainserver.devserver.com (iChain 2.3.345)

Retrieving document /library/ADJA/forms/Indexing_Form.doc on host:devserverxxx.com:443

Http version      : HTTP/1.1
Server            : HTTP/1.1
Status Code       : 200
Reason            : OK
Access Time       : Wed, 09 May 2007 21:18:28 GMT
Modification Time : Tue, 30 Aug 2005 20:19:58 GMT
Content-type      : application/msword
Request time: 0 secs
size = 27136

After indexing, I tried to search some terms which are definitely in bothpdf and doc documents, but no hits!

So, I've tried using parse_doc.pl instead of doc2html with the same path tocatdoc and pdftotext/pdfinfo. Indexed fine and returned some valid hits.


Could anyone help me to figure out why I get no hits under doc2html?

By the way, here is my config file:

# for doc2html
external_parsers:   application/msword->text/html /path/to/doc2html.pl \
                   application/pdf->text/html /path/to/doc2html.pl

# for parse_doc.pl
#external_parsers:   application/msword  /path/to/parse_doc.cgi \
#                    application/pdf /path/to/parse_doc.cgi

_________________________________________________________________

Catch suspicious messages before you open themwith Windows Live Hotmail.http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_protection_0507

-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/

_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

[htdig] doc2html - indexed but no hits

Reply via email to