I've been trying to index .pdf and .doc documents in v. 3.2.0b with
doc2html/catdoc/pdf2html.
I can see both types indexed fine (though I'm not sure why log doesn't tell
which words and tags have been indexed). See below:
pick: devserverxxx.com, # servers = 1
devserverxxx.com with a traditional HTTP connection
316:33:2:https://devserverxxx.com/library/ADJA/docs/portlet-1_0-fr-spec.pdf:
Making HTTPS request on
https://devserverxxx.com/library/ADJA/docs/portlet-1_0-fr-spec.pdf
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:19:01 GMT
Header line: Server: Apache
Header line: Last-Modified: Mon, 07 May 2007 14:08:26 GMT
Header line: ETag: "1f841c-6af5b-d5aeea80"
Discarded header line: ETag: "1f841c-6af5b-d5aeea80"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/pdf
Header line: Content-Length: 438107
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain 2.3.345)
Discarded header line: Via: 1.1 ichainserver.devserverxxx.com (iChain
2.3.345)
Retrieving document /library/ADJA/docs/portlet-1_0-fr-spec.pdf on host:
devserverxxx.com:443
Http version : HTTP/1.1
Server : HTTP/1.1
Status Code : 200
Reason : OK
Access Time : Wed, 09 May 2007 21:19:01 GMT
Modification Time : Mon, 07 May 2007 14:08:26 GMT
Content-type : application/pdf
Request time: 0 secs
size = 438107
pick: devserverxxx.com, # servers = 1
devserverxxx.com with a traditional HTTP connection
96:39:2:https://devserverxxx.com/library/ADJA/forms/Indexing_Form.doc:
Making HTTPS request on
https://devserverxxx.com/library/ADJA/forms/Indexing_Form.doc
Header line: HTTP/1.1 200 OK
Header line: Date: Wed, 09 May 2007 21:18:28 GMT
Header line: Server: Apache
Header line: Last-Modified: Tue, 30 Aug 2005 20:19:58 GMT
Header line: ETag: "224003-6a00-55fc3780"
Discarded header line: ETag: "224003-6a00-55fc3780"
Header line: Accept-Ranges: bytes
Discarded header line: Accept-Ranges: bytes
Header line: Content-Type: application/msword
Header line: Content-Length: 27136
Header line: Via: 1.1 ichainserver.devserverxxx.com (iChain 2.3.345)
Discarded header line: Via: 1.1 ichainserver.devserver.com (iChain 2.3.345)
Retrieving document /library/ADJA/forms/Indexing_Form.doc on host:
devserverxxx.com:443
Http version : HTTP/1.1
Server : HTTP/1.1
Status Code : 200
Reason : OK
Access Time : Wed, 09 May 2007 21:18:28 GMT
Modification Time : Tue, 30 Aug 2005 20:19:58 GMT
Content-type : application/msword
Request time: 0 secs
size = 27136
After indexing, I tried to search some terms which are definitely in both
pdf and doc documents, but no hits!
So, I've tried using parse_doc.pl instead of doc2html with the same path to
catdoc and pdftotext/pdfinfo. Indexed fine and returned some valid hits.
Could anyone help me to figure out why I get no hits under doc2html?
By the way, here is my config file:
# for doc2html
external_parsers: application/msword->text/html /path/to/doc2html.pl \
application/pdf->text/html /path/to/doc2html.pl
# for parse_doc.pl
#external_parsers: application/msword /path/to/parse_doc.cgi \
# application/pdf /path/to/parse_doc.cgi
_________________________________________________________________
Catch suspicious messages before you open themwith Windows Live Hotmail.
http://imagine-windowslive.com/hotmail/?locale=en-us&ocid=TXT_TAGHM_migration_HM_mini_protection_0507
-------------------------------------------------------------------------
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
_______________________________________________
ht://Dig general mailing list: <[email protected]>
ht://Dig FAQ: http://htdig.sourceforge.net/FAQ.html
List information (subscribe/unsubscribe, etc.)
https://lists.sourceforge.net/lists/listinfo/htdig-general