I've been trying to index .pdf and .doc documents in v. 3.2.0b with doc2html/catdoc/pdf2html. I can see both types indexed fine (though I'm not sure why log doesn't tell which words and tags have been indexed). See below:

pick: devserverxxx.com, # servers = 1
devserverxxx.com with a traditional HTTP connection
316:33:2:https://devserverxxx.com/library/ADJA/docs/portlet-1_0-fr-spec.pdf: Making HTTPS request on https://devserverxxx.com/library/ADJA/docs/portlet-1_0-fr-spec.pdf
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:19:01 GMT
Header line: Server: Apache
Header line: Last-Modified: Mon, 07 May 2007 14:08:26 GMT
Header line: ETag: "1f841c-6af5b-d5aeea80"
Discarded header line: ETag: "1f841c-6af5b-d5aeea80"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/pdf
Header line: Content-Length: 438107
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain 2.3.345)
Discarded header line: Via: 1.1 ichainserver.devserverxxx.com (iChain 2.3.345) Retrieving document /library/ADJA/docs/portlet-1_0-fr-spec.pdf on host: devserverxxx.com:443
Http version      : HTTP/1.1
Server            : HTTP/1.1
Status Code       : 200
Reason            : OK
Access Time       : Wed, 09 May 2007 21:19:01 GMT
Modification Time : Mon, 07 May 2007 14:08:26 GMT
Content-type      : application/pdf
Request time: 0 secs
size = 438107

pick: devserverxxx.com, # servers = 1
devserverxxx.com with a traditional HTTP connection
96:39:2:https://devserverxxx.com/library/ADJA/forms/Indexing_Form.doc: Making HTTPS request on https://devserverxxx.com/library/ADJA/forms/Indexing_Form.doc
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:18:28 GMT
Header line: Server: Apache
Header line: Last-Modified: Tue, 30 Aug 2005 20:19:58 GMT
Header line: ETag: "224003-6a00-55fc3780"
Discarded header line: ETag: "224003-6a00-55fc3780"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/msword
Header line: Content-Length: 27136
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain 2.3.345)
Discarded header line: Via: 1.1 ichainserver.devserver.com (iChain 2.3.345)
Retrieving document /library/ADJA/forms/Indexing_Form.doc on host: devserverxxx.com:443
Http version      : HTTP/1.1
Server            : HTTP/1.1
Status Code       : 200
Reason            : OK
Access Time       : Wed, 09 May 2007 21:18:28 GMT
Modification Time : Tue, 30 Aug 2005 20:19:58 GMT
Content-type      : application/msword
Request time: 0 secs
size = 27136


After indexing, I tried to search some terms which are definitely in both pdf and doc documents, but no hits!

So, I've tried using parse_doc.pl instead of doc2html with the same path to catdoc and pdftotext/pdfinfo. Indexed fine and returned some valid hits.

Could anyone help me to figure out why I get no hits under doc2html?

By the way, here is my config file:

# for doc2html
external_parsers:   application/msword->text/html /path/to/doc2html.pl \
                   application/pdf->text/html /path/to/doc2html.pl

# for parse_doc.pl
#external_parsers:   application/msword  /path/to/parse_doc.cgi \
#                    application/pdf /path/to/parse_doc.cgi

_________________________________________________________________
Catch suspicious messages before you open them—with Windows Live Hotmail. http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_protection_0507


-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general

Reply via email to