Re: [htdig3-dev] indexing .ps, .doc plain text files

Gilles Detillieux Thu, 19 Aug 1999 08:49:41 -0700

According to Dr. Bernd Souvignier:
> first of all I think that ht://Dig is very nicely done
> and easy to install.
> I have only some short questions:
> At the moment, htdig seems to accept only html-documents and
> rejects .doc, .ps and plain (ASCII) text documents as
> "not authorized".

If htdig reports "not authorized", it's because the server returned
a 401 error code when trying to access these documents.  They must be
in a restricted directory.  If they're password protected using basic
authentication, you can provide the username and password via htdig's
-u option.

If you're using local_urls to bypass the HTTP server, that only works
for files with a .html or .htm suffix - everything else is read via
the server, to get it to determine the content-type of the document.

> I installed the parse_doc.pl script and it seems to work
> (after some tiny modifications, for example in the PostScript
> identification there seem to be no blanks around the '=' in
> ENTER LANGUAGE = POSTSCRIPT),

I didn't know blanks were allowed there.  I guess they're optional.
Time for another patch, it seems.  Thanks for the info.

> but the .doc and .ps documents are still rejected.
> What do I have to do to get these?
> And how can I index plain text documents (for example .tex,
> or pure ASCII).

For any document of type text/* (other than text/html), htdig will index
it as plain text, as long as the server allows access to it, and tags
it with the right content-type.  (See your server's mime.types file,
or equivalent, for the MIME types assigned to various file name suffixes.)

The .doc, .ps and .tex suffixes may be more of a challenge.  Again, you
need to determine what content-type your server will return for them,
and define the external_parsers for these types.  My server assigns
application/x-tex to .tex files, and has no default type for .doc files.

Once the server and external_parsers attribute agree on the MIME types
for all these files, it's just a matter of adding support for any new
types to parse_doc.pl.  E.g., for .tex files, you could index them as
plain text by using "cat" as the document to text converter, but then
you'd index all the TeX tags as words - it would make more sense to
convert the document to plain text using whatever TeX tool does the job.

First, though, you need to get past the 401 errors from your server.

-- 
Gilles R. Detillieux              E-mail: <[EMAIL PROTECTED]>
Spinal Cord Research Centre       WWW:    http://www.scrc.umanitoba.ca/~grdetil
Dept. Physiology, U. of Manitoba  Phone:  (204)789-3766
Winnipeg, MB  R3E 3J7  (Canada)   Fax:    (204)789-3930

------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig3-dev] indexing .ps, .doc plain text files

Reply via email to