According to [EMAIL PROTECTED]: > My start urls will be various sites belonging to separate users but > on the same server... e.g., > http://my.school.edu/~user1 (that will get the index page from the > stus public_html folder in that home dir) > http://my.school.edu/~user2, http://my.school.edu/~user3 and so on, > up to 50 users. > > I do NOT have access permissions to the server except to those public > http pages and to run my htdig which is completely installed in my > home dir with htsearch in my cgi-bin and using cgi-wrap to run htdig. > So, correct me if I am wrong but I have to access/index the sites by > http (i think that's what you guys have called it), i.e., i can't set > it up, say, just one host url or something. > Am i clear? and am I correct?
I think you're confusing two separate issues here. One issue is the transport that's used to get the documents from the server into htdig, and the other is the means by which htdig will figure out or be told which documents to get. Let's look at these two in isolation. 1) With htdig 3.1.5, only http URLs are allowed, and so the only transport allowed is the HTTP protocol, by which a client (e.g. htdig) requests documents from a web server over the network. However, if htdig is running on the same server as the HTTP server, and you know how http URLs map onto directories on this server, you can bypass the HTTP server using the local_urls and local_user_urls attributes and get files directly from the filesystem. You're still using http URLs, but this local_urls machanism allows htdig to side-step the HTTP server for static files, which speeds things up. The 3.2 betas complicate this a little bit, because they support other transports as well, such as file:// URLs, news: URLs, and with an external transport defined, ftp:// URLs too. However, the local_urls mechanism still works the same way, allowing htdig to side-step these transports and go to the local filesystem. Now, with what you're doing, I get the impression that you are indeed running htdig on the web server, even though you don't have complete access to it. If that's the case, you may still be able to define local_user_urls to get at the web pages directly, provided you know where the user directories are, and they use a consistent location for home directories on this server. All you need is read access to the users' web pages, which you ought to have as normally web pages tend to be world-readable. So, even though you're using http URLs, you might not have to use HTTP server much to get at the files, as long as the files fit the restrictions that local_urls handling imposes (read the docs). However, htdig will fall back to HTTP if the local fetching fails, so either way it should be able to get the pages. 2) There are many means of getting URLs into htdig, and the ones that are most appropriate for you depend on what you're indexing, and how these pages are linked. This is covered at fairly great length in the FAQ, especially question 5.25 and 5.18, but also in other related questions (follow the links). This is independent of which transport htdig uses, but depends on how pages are linked to each other. The more "coverage" you have in links from one page to another, the less individual pages you have to feed into htdig's start_url. On a well constructed site, you should only have to give htdig the site's main page as the start_url, and it'll find everything from there. Because you're indexing user pages, which frequently aren't all listed in a central index page, and not necessarily all that well crosslinked, you'll likely need to do more. At the very least, you'd probably need to list each user's home page in start_url, and then let htdig spider its way down to other pages linked from their home pages. -- Gilles R. Detillieux E-mail: <[EMAIL PROTECTED]> Spinal Cord Research Centre WWW: http://www.scrc.umanitoba.ca/~grdetil Dept. Physiology, U. of Manitoba Phone: (204)789-3766 Winnipeg, MB R3E 3J7 (Canada) Fax: (204)789-3930 _______________________________________________ htdig-general mailing list <[EMAIL PROTECTED]> To unsubscribe, send a message to <[EMAIL PROTECTED]> with a subject of unsubscribe FAQ: http://htdig.sourceforge.net/FAQ.html

