>>The author of such a page's HTML code could, however, add a list of
>>meta-information with links "hidden" in the document. When creating the
>>index, htdig would add these links to its list of pages to crawl:
>>
>><META NAME="htdig-follow" CONTENT="/home.html">
>><META NAME="htdig-follow" CONTENT="/news/overview.html">
>><META NAME="htdig-follow" CONTENT="/contact/mail.html">
>
>Nice idea, but: Will people who are unaware of the cons of solely using
>JavaScript or other nifty stuff for navigational concepts be disciplined
>enough to implement the META infos and - more important - maintain them?
>
>Moreover you can have the same effect by adding those files to the htdig.conf
>which gives you even more control over the pages which are indexed (e.g.
if you
>use scripting and stuff in order to keep other crawlers from indexing).
Of course you can do the same with the configuration file, already.
The jobs in web agencies have always been very specialized. By experience,
I can tell you that most screen designers and people doing HTML will
_never_ touch a configuration file. They don't feel comfortable doing that
- as soon as something looks like "programming", their instincts tell them
not to touch it... :-) But with this feature, they could maintain these
data without needing to understand the configuration file.
I know it's sad and I often try to show screen designers that configuration
files are not some kind of Voodoo. But that's how most of them react. So
why not help them?
Of course, it means that site designers must be disciplined enough. But in
web agencies, they have to be disciplined enough to do a lot of other
things, too. E. g. it is a common task to create "keyword" meta tags for
individual pages. And we are about to have rules for other meta tags too,
e.g. mentioning the name of the page author, the company and the creation
date.
(And yes, I know that javascript-only navigation is evil, but there is a
fine line between telling your client what is good for them and being told
what to do by your client. We explain our clients the negative aspects of
these features and sometimes, they still go with that.)
>>[link check]
>
>Great stuff for site maintenance ;-)
>But it should be possible to turn this off/on for those who are concerned
>about the traffic that it produces. Best would be to have a configuration
>directive to set a default and a command line parameter for override.
>
>That way (and using cron) you can check broken links at larger cycles than
>you're indexing a site.
As I proposed using an additional program for creating the error report
(and, if desired, check external links), you don't need to override: Let
cron run htdig/htmerge daily, but htcheck weekly.
(Btw, is "htcheck" a good name? Or should this feature be added to htnotify?)
Anyway, I think you can override most configuration attributes with command
line parameters, so I guess this one would be implemented just like that.
The one thing that I did not mention in my initial mail - it would be nice
if the crawler would use the "referer" attribute for its HEAD request. That
way, people wondering why their site is being visited by htdig on a daily
or weekly basis can see why.
Most crawlers do not mention their referer. They do not have to, but I
always think it's kind of inpolite.
Greetings,
Hanno
--
Hanno M�ller, +49-40-5603170, http://www.hanno.de
"We were lookin' for an echo, an answer to our sound."
http://casa.home.pages.de
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.