Re: [htdig3-dev] Off-beat feature ideas

Torsten Neuer Thu, 8 Jul 1999 00:26:08 -0700

According to Hanno Mueller:
>Hi,
>
>
>I wrote previously:
>
>>I'd like to contribute a few _very small_ ideas that I found useful in my
>>own application (and will try to implement most of them myself).
>>These are mostly details in the search result output and a few META-Tags. I
>>am currently reading the CVS source tree, to avoid making a fool of myself
>>by suggesting things that are already under way.
>
>
>Geoff told me that I should go ahead, anyway, so here are few things that I
>found useful with my own little (ugly) search engine and some additional
>ideas. In no particular order...
>
>
>
>* new configuration attribute "script_name"
>
>Pro: Makes wrappers obsolete for .shtml / SSI files.
>
>
>The first thing that I looked for when I installed ht://dig on one of our
>servers was a way to override the SCRIPT_NAME value of the environment.
>ht://dig uses this value to set its internal $(CGI) variable. It is being
>used to create the URLs that are in $(PAGELIST).
>
>The .shtml / SSI instruction "exec cgi" hands CGI parameters that were used
>on the .shtml page over to the called cgi script.
>
>My current wrapper script (used on http://www.politik-digital.de) does only
>one thing: it sets the SCRIPT_NAME value of the environment to the name of
>the .shtml-page that calls the wrapper. With such a configuration
>attribute, the wrapper script would be obsolete.

Nice for tweaking, IMO.  Should be implementable easily, I think ;-)


>* a different $(PAGELIST)
>
>Pro: Easier browsing.
>
>Con: Would break the old configuration options used to create $(PAGELIST)
>such as page_number_text; not backward compatible. And some might say that
>a long list of search result pages isn't worth browsing all the way down to
>the final page, anyway.

This would not neccessarily break backward compatibility if you're using
a new name for that.  Sites that want to stick with the old PAGELIST can
do so (might be easier for them to configure).

I'd suggest something like DYNAMIC_PAGELIST for that.

>I am not very happy with the $(PAGELIST). It is only able to display direct
>links for the first 10 pages, but no link for page 11. (I know, you can
>change it to display 20 links, but again, you would not have a link to page
>21).
>
>On a few sites I worked on (e. g. http://www.edgar.de), I used a different
>and very small approach to page number links.
>
>If the last page is, say, 20, you could have values of $(PAGELIST) like this:
>
>[1] 2 3 4 5 6 7 8 9 ... 20
>
>1 ... 5 6 7 8 [9] 10 11 12 13 ... 20
>
>1 ... 12 13 14 [15] 16 17 18 19 20
>
>You get the idea, I guess. The good thing is that the user can always jump
>to the first or last page and also to pages near to the current page.
>
>I don't really know how to implement graphical page numbers with this
>scheme, yet (one of the neat features of the current page_number_text
>option is that you can use images as page links).

Easy, if you're using a configuration directive that resembles some sort
of template, e.g.

dynamic_page_number:    <img src="/gfx/pgnum%04d.gif" alt="%d">

or for dynamic images

dynamic_page_number:    <img src"/cgi-bin/pgnum.pl?%d" alt="%d">

There should be a way to get this going with sprintf, vsprintf or
regexp search-and-replace ;-)


>* a new META-tag: "htdig-follow"
>
>Pro: Helps htdig find links that it could not "see" otherwise.
>
>
>Many sites use javascript for navigation, using pull-up menus etc. Some
>other sites use <select>-forms that call a redirect-script.
>
>These tricks make it hard for a crawler to find a link, since these links
>are "hidden" as javascript-strings or as select-values. I have even seen
>sites were the target link was constructed by a javascript, using string
>operations...
>
>The author of such a page's HTML code could, however, add a list of
>meta-information with links "hidden" in the document. When creating the
>index, htdig would add these links to its list of pages to crawl:
>
><META NAME="htdig-follow" CONTENT="/home.html">
><META NAME="htdig-follow" CONTENT="/news/overview.html">
><META NAME="htdig-follow" CONTENT="/contact/mail.html">

Nice idea, but:  Will people who are unaware of the cons of solely using
JavaScript or other nifty stuff for navigational concepts be disciplined
enough to implement the META infos and - more important - maintain them?

Moreover you can have the same effect by adding those files to the htdig.conf
which gives you even more control over the pages which are indexed (e.g. if you
use scripting and stuff in order to keep other crawlers from indexing).


>* a full report of broken links
>
>Pro: htdig crawls the site, anyway, so why not use these data for a few
>useful extra results?
>
>Con: Makes it necessairy that htdig stores referer information, thus
>enlarging the result files. But since this report is optional, you could
>simply turn this off.
>
>
>Since htdig crawls the entire site, it could also create a small report of
>broken links that it found.
>
>If requested, my own little perl-based search engine makes a "HEAD" request
>to all external links, thus checking that all external references are
>intact. The result is written as a small web page with a list of all broken
>links and pages referring to it. If there were broken links, my search
>engine finally sends a little eMail to the maintainer, telling him to visit
>the error report page on the server.
>
>Since checking external links makes crawling slower (and is only necessairy
>for the broken link report, not for the actual search engine), a new
>"htcheck" program might be useful:
>
>- htdig crawls all internal links.
>
>- htmerge creates the index.
>
>- htcheck checks all external links (if requested), writes an error report
>and sends out a notification if necessairy.
>
>As authors of individual web pages can use the META-tag "htdig-email", this
>proposed "htcheck" could even notifiy these authors if their particular
>page had broken links, without telling them about other broken links of the
>site.

Great stuff for site maintenance ;-)
But it should be possible to turn this off/on for those who are concerned
about the traffic that it produces.  Best would be to have a configuration
directive to set a default and a command line parameter for override.

That way (and using cron) you can check broken links at larger cycles than
you're indexing a site.


>* A site map
>
>Pro: Another set of information that can be extracted from the information
>that htdig collects.
>
>Con: It isn't exactly easy to do. And who needs a fully automated site map,
>anyway?
>
>
>Most, if not all websites use a directory structure that reflects the
>site's content. I triet to implement such a dynamic sitemap with my own
>small search engine. This feature is still heavily beta, check out
>http://www.veba.de/suchen/uebersicht.shtml?konfig=d&map=1. But it works.
>Somewhat.
>
>So "/news/index.html" with the title "Current News Reports" and
>"/news/archive.html" with the title "Older Articles" etc. would be
>translated to a HTML file like this
>
>- Homepage
>- News:
>       - Current News Reports
>       - Older Articles
>- Products
>       - Hardware
>               - Spoon Selection
>               - Fork Selection
>               - Knife Selection
>       - Software
>               - Meat
>               - Fruit
>               - Vegetables
>- Contact Information
>       - Mail Your Suggestions
>
>If a new page is added to the site, the next update of the index will have
>it listed in the site map automatically.
>
>The configuration file for the site map titles looks like this - note that
>it is in the order of the sitemap.
>
>/                      
>/news          News
>/prod          Products
>/prod/hard     Hardware
>/prod/soft     Software
>/contact               Contact Information
>
>Since there always exceptions to the rule (e.g. a CGI script that is used
>to create dynamic pages of the "News" section), a "htdig-section" meta-tag
>can be used to override the page's original URL information. When this
>meta-tag's content is set to "", the page is not shown in the site map, at
>all.
>
>The thing that makes it diffictult is that site designers usually do not
>want to have the pages within a section listed by their title. E.g., the
>index page must be listed first. I used a small numbering scheme for that,
>but wasn't very happy with it.

Sitemap generation is something that is much more complex and not very useful
at all if you want your very special nifty little map with graphics and table
formatting etc.

It is fairly easy to have a nested list generated with the additions you
mention above, but most people won't be happy with that.

If there's no way to have the digger working with a nifty template in order
to generate the map, then the implementation would IMO not be worth the effort.


cheers,
  Torsten

--
InWise - Wirtschaftlich-Wissenschaftlicher Internet Service GmbH
Waldhofstra�e 14                            Tel: +49-4101-403605
D-25474 Ellerbek                            Fax: +49-4101-403606
E-Mail: [EMAIL PROTECTED]            Internet: http://www.inwise.de
------------------------------------
To unsubscribe from the htdig3-dev mailing list, send a message to
[EMAIL PROTECTED] containing the single word "unsubscribe" in
the SUBJECT of the message.
Re: [htdig3-dev] Off-beat feature ideas

Reply via email to