Re: [htdig] Problems indexing non-web content with absolute paths

Jay West Mon, 09 Dec 2002 10:37:56 -0800

Jim wrote...
> An index file is not a requirement. You can specify a list of
> individual URLs in the start_url attribute. Or you can use backticks
> with start_url to provide a regular file containing a list of
> individual URLs (e.g. start_url: `/path/to/list_file`).
Right, but... the backtic to point start_url at a file STILL wants you to
point (via the file list of URL's), to URL's. If I point to the directory
where all the text files are that I want to index, htdig STILL looks for a
startup file (index.html) to decide "what is on the website" and thus what
should be indexed. Is there something I'm missing here?


> Though not what you want, I think this is the proper behavior. Even
> though htdig is going directly to the file system, it is still
> happening within a web server context. In that context, the document
> root is by definition the root from which all other paths are built. In
> essence, you have explicitly defined / to be equivalent to
> /u1/index/html/.
I would disagree with that... htdig is not acting like a webserver to do
it's job. It's acting like a web client. The concept you mention about all
references being with regards to a "DocumentRoot" is a webserver concept,
not a client concept. For example, lynx, ie, netscape, and mozilla, ALL will
do this correctly if I point it at my "content" area using "file://". Only
HtDig does it differently. And HtDig does it "like a webserver" would with
regards to automatically prepending the "documentroot". So I agree that
Htdig acts "logically" if you think of HtDig as a webserver. However, I
believe that it is acting like a client browser in practice. I think HtDig
should let the webserver put the document root on the front, and if you're
going through local files, it shouldn't do this.

> If you want to stick with some sort of index page and keep everything
> where it is, the only thing I can think of is defining
> http://localhost/ to map to / and changing start_url accordingly.
> Depending on your environment, this might be a bad idea in terms of
> security.
>
> If you primary goal here is just full path names for the files into the
> database, you might also want to take a look at the following two
> attributes that support manipulation of the URLs.
>
> http://www.htdig.org/attrs.html#url_rewrite_rules
> http://www.htdig.org/attrs.html#url_part_aliases
I'll play around with those suggestions today and see if it gets me closer
to what I want.

Thanks a million!

Jay West

---
[This E-mail scanned for viruses by Declude Virus]



-------------------------------------------------------
This sf.net email is sponsored by:ThinkGeek
Welcome to geek heaven.
http://thinkgeek.com/sf
_______________________________________________
htdig-general mailing list <[EMAIL PROTECTED]>
To unsubscribe, send a message to <[EMAIL PROTECTED]> with a 
subject of unsubscribe
FAQ: http://htdig.sourceforge.net/FAQ.html

Re: [htdig] Problems indexing non-web content with absolute paths

Reply via email to