Hi,
I wrote previously:
>I'd like to contribute a few _very small_ ideas that I found useful in my
>own application (and will try to implement most of them myself).
>These are mostly details in the search result output and a few META-Tags. I
>am currently reading the CVS source tree, to avoid making a fool of myself
>by suggesting things that are already under way.
Geoff told me that I should go ahead, anyway, so here are few things that I
found useful with my own little (ugly) search engine and some additional
ideas. In no particular order...
* new configuration attribute "script_name"
Pro: Makes wrappers obsolete for .shtml / SSI files.
The first thing that I looked for when I installed ht://dig on one of our
servers was a way to override the SCRIPT_NAME value of the environment.
ht://dig uses this value to set its internal $(CGI) variable. It is being
used to create the URLs that are in $(PAGELIST).
The .shtml / SSI instruction "exec cgi" hands CGI parameters that were used
on the .shtml page over to the called cgi script.
My current wrapper script (used on http://www.politik-digital.de) does only
one thing: it sets the SCRIPT_NAME value of the environment to the name of
the .shtml-page that calls the wrapper. With such a configuration
attribute, the wrapper script would be obsolete.
* a different $(PAGELIST)
Pro: Easier browsing.
Con: Would break the old configuration options used to create $(PAGELIST)
such as page_number_text; not backward compatible. And some might say that
a long list of search result pages isn't worth browsing all the way down to
the final page, anyway.
I am not very happy with the $(PAGELIST). It is only able to display direct
links for the first 10 pages, but no link for page 11. (I know, you can
change it to display 20 links, but again, you would not have a link to page
21).
On a few sites I worked on (e. g. http://www.edgar.de), I used a different
and very small approach to page number links.
If the last page is, say, 20, you could have values of $(PAGELIST) like this:
[1] 2 3 4 5 6 7 8 9 ... 20
1 ... 5 6 7 8 [9] 10 11 12 13 ... 20
1 ... 12 13 14 [15] 16 17 18 19 20
You get the idea, I guess. The good thing is that the user can always jump
to the first or last page and also to pages near to the current page.
I don't really know how to implement graphical page numbers with this
scheme, yet (one of the neat features of the current page_number_text
option is that you can use images as page links).
* a new META-tag: "htdig-follow"
Pro: Helps htdig find links that it could not "see" otherwise.
Many sites use javascript for navigation, using pull-up menus etc. Some
other sites use <select>-forms that call a redirect-script.
These tricks make it hard for a crawler to find a link, since these links
are "hidden" as javascript-strings or as select-values. I have even seen
sites were the target link was constructed by a javascript, using string
operations...
The author of such a page's HTML code could, however, add a list of
meta-information with links "hidden" in the document. When creating the
index, htdig would add these links to its list of pages to crawl:
<META NAME="htdig-follow" CONTENT="/home.html">
<META NAME="htdig-follow" CONTENT="/news/overview.html">
<META NAME="htdig-follow" CONTENT="/contact/mail.html">
* a full report of broken links
Pro: htdig crawls the site, anyway, so why not use these data for a few
useful extra results?
Con: Makes it necessairy that htdig stores referer information, thus
enlarging the result files. But since this report is optional, you could
simply turn this off.
Since htdig crawls the entire site, it could also create a small report of
broken links that it found.
If requested, my own little perl-based search engine makes a "HEAD" request
to all external links, thus checking that all external references are
intact. The result is written as a small web page with a list of all broken
links and pages referring to it. If there were broken links, my search
engine finally sends a little eMail to the maintainer, telling him to visit
the error report page on the server.
Since checking external links makes crawling slower (and is only necessairy
for the broken link report, not for the actual search engine), a new
"htcheck" program might be useful:
- htdig crawls all internal links.
- htmerge creates the index.
- htcheck checks all external links (if requested), writes an error report
and sends out a notification if necessairy.
As authors of individual web pages can use the META-tag "htdig-email", this
proposed "htcheck" could even notifiy these authors if their particular
page had broken links, without telling them about other broken links of the
site.
* A site map
Pro: Another set of information that can be extracted from the information
that htdig collects.
Con: It isn't exactly easy to do. And who needs a fully automated site map,
anyway?
Most, if not all websites use a directory structure that reflects the
site's content. I triet to implement such a dynamic sitemap with my own
small search engine. This feature is still heavily beta, check out
http://www.veba.de/suchen/uebersicht.shtml?konfig=d&map=1. But it works.
Somewhat.
So "/news/index.html" with the title "Current News Reports" and
"/news/archive.html" with the title "Older Articles" etc. would be
translated to a HTML file like this
- Homepage
- News:
- Current News Reports
- Older Articles
- Products
- Hardware
- Spoon Selection
- Fork Selection
- Knife Selection
- Software
- Meat
- Fruit
- Vegetables
- Contact Information
- Mail Your Suggestions
If a new page is added to the site, the next update of the index will have
it listed in the site map automatically.
The configuration file for the site map titles looks like this - note that
it is in the order of the sitemap.
/
/news News
/prod Products
/prod/hard Hardware
/prod/soft Software
/contact Contact Information
Since there always exceptions to the rule (e.g. a CGI script that is used
to create dynamic pages of the "News" section), a "htdig-section" meta-tag
can be used to override the page's original URL information. When this
meta-tag's content is set to "", the page is not shown in the site map, at
all.
The thing that makes it diffictult is that site designers usually do not
want to have the pages within a section listed by their title. E.g., the
index page must be listed first. I used a small numbering scheme for that,
but wasn't very happy with it.
Thoughts? Suggestions?
Greetings,
Hanno
--
Hanno M�ller, +49-40-5603170, http://www.hanno.de
"We were lookin' for an echo, an answer to our sound."
http://casa.home.pages.de
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.