Hi, thanks for the responses so far about a potential mod_sitemap OR using the current sitemap_gen tool. Since there are some questions about why this is relevant applicable to httpd or apache-running web masters, some thoughts on where we are coming from (apologies for the longer email):

1. Some of us believe that webservers have two diferent audiences (a) regular users with browsers who come to a site requesting a single page or do some small browsing activity, and (b) webcrawlers that visit these servers to crawl through them and periodically check if pages have changed. Current webservers are excellent at servicing the 1st kind -- you know a URL, and you get the page back. However since there is no real support for crawlers that visit these sites regularly, so crawlers do dumb things like "follow-links" like a regular random surfer and "periodically-check-if-page-changed."

2. Why not have a listing service on all webservers, so that crawlers can go and check somewhere the list of ALL URLs that are available and corresponding metadata? (Metadata that can be easily computed automatically, like lastmod date, etc.). This is clearly not a new idea. This is what ftp servers do :)

What is a Sitemap? A text file (like robots.txt) that gets auto-computed with a listing of all known URLs and lastmod times. The text file is based on XML for some structure and as an easy way to have required and optional attributes. Also, it is structured so that it is scalable from a few to millions of URLs (without having massive d/l-able files), and has log-structured semantics to support a variety of use cases (in terms of generation and updates). Currently, it is structured as a text file (and using disk space) rather than materializing this at run-time (which could be expensive) on a request.

If a webserver has an auto-computed sitemap, a crawler can know about all the list of URls that a webserver has, can crawl them and index the best relevant pages (instead of random pages that are linked thro hrefs). Also, the crawler can put less load on the webserver by only requesting pages that have changed ((for ex, using lastmod date).

A few questions about validity of above argument:
1. Are webcrawlers useful as an audience to a webserver? We think all search engines in the world should be comprehensive and parse thro all pages. W/o some sort of listing support like sitemaps, search engines will be incomplete. (And we could debate if search engines are useful or not, in terms of getting people to a webserver in the 1st place.)

2. Are webcrawlers sending that much traffic to webservers, compared to a regular web user? We think there is a lot of crawler activity on the web right now, and have anecdotal evidence that crawlers are a pretty large fraction of webserver activity. Am curious what you guys think from your own experience -- perhaps stats on apache.org will be useful.

comments/insults?
- shiva


On 10/7/05, Joshua Slive <[EMAIL PROTECTED]> wrote:
Greg Stein wrote:

> Ignore the mod_sitemap suggestion for now. As Shiva stated in his
> note, there is also sitemap_gen.py and its related docco [which exists
> today]. What are the group's thoughts on that?

I think the basic question is: how would this benefit our users?  It
seems like sitemap_gen.py is easy enough to grab from google.

Joshua.

Reply via email to