On Friday, December 6, 2002, at 10:48 AM, Jay West wrote:

I want to index the words in a set of files, with the file names in which
certain words appear.

It would appear that htdig MUST have an index.html file, even though this
technically isn't web content. So I wrote a php script that goes through the
An index file is not a requirement. You can specify a list of individual URLs in the start_url attribute. Or you can use backticks with start_url to provide a regular file containing a list of individual URLs (e.g. start_url: `/path/to/list_file`).

directory, and creates an index.html file. However, I want this index.html
file to reside OUTSIDE of the directory structure of the files being
indexed. More to the point, here's the real example.

htdig.conf:
database_dir: /u1/index/database/hdc
local_urls_only: true
local_urls: http://localhost/=/u1/index/html/
local_default_doc: hdc.index.html
start_url: http://localhost/
...
How for the error message. Running htdig on the hdc.index.html file with -vv
gives lots of messages like this:
pick: localhost, # servers = 1
66:66:1:http://localhost/u1/xfer/hdc/cat_copy/02128.txt: Trying local files
tried local file /u1/index/html/u1/xfer/hdc/cat_copy/02128.txt
not found

So what is happening is, htdig is taking the "documentroot" file path
(/u1/index/html/) and prepending it to the absolute paths reference in the
html file (/u1/xfer/hdc/cat_copy/.....)... thus coming up with a totally
bogus path and hence the failure.
Though not what you want, I think this is the proper behavior. Even though htdig is going directly to the file system, it is still happening within a web server context. In that context, the document root is by definition the root from which all other paths are built. In essence, you have explicitly defined / to be equivalent to /u1/index/html/.

I've beat my head against this for a long time... can someone offer a
suggestion? I could move the index.html file to sit above the *.txt files
and use relative paths in the hrefs, but for other reasons I'd prefer not to
If you want to stick with some sort of index page and keep everything where it is, the only thing I can think of is defining http://localhost/ to map to / and changing start_url accordingly. Depending on your environment, this might be a bad idea in terms of security.

If you primary goal here is just full path names for the files into the database, you might also want to take a look at the following two attributes that support manipulation of the URLs.

http://www.htdig.org/attrs.html#url_rewrite_rules
http://www.htdig.org/attrs.html#url_part_aliases

Jim



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html


Reply via email to