Thanks to all who helped with suggestions. Here is version 1 of my solution. The short shell script that does all the work is an attachment. -Rick For a long time I've wanted to turn my vast email archives into HTML and index *every single word* in those archives. Then, I wanted to blast those archives onto CD-ROMs for permanent archival storage. I have been eyeing a very nice, free indexing package called "htdig", as the search engine of choice for this purpose (http://www.htdig.org). It will index every word in a collection of text and HTML files. The problem with htdig is that the conventional usage is to index an entire web site on a single machine into a single database. I wanted to index several collections independantly, and wanted to be able to easily move those collections and their indexes between machines and onto CD-ROM without having to do a lot of work to "install" the database onto each machine. I worked out a shell script to do what I wanted to do. I have attached said shell script "digdir". As a proof of concept, I decided to use the latest copy of the Internet RFC collection as a test. I started with the 2700 or so RFC's in text form. I stored these into directory /home/httpd/html/rfc. I then run the "digdir" shell script thusly: $ cd /home/httpd/html $ digdir rfc After about 5 minutes, the shell script finishes the indexing process. It adds a number of new files under /home/httpd/html/rfc, but does not add or modify any other files on the computer. These new files include a "search.html" search form used for submitting queries, and the indexed database generated by htdig. It is possible to now blast this entire directory onto CD-R, and you could mount this CD-ROM on another machine under /home/httpd/html and it would work (assuming you have previously installed the stock htdig RPM package). To see the results, open this URL (will work only on the Digi intranet): http://digifax.digi.com/rfc/search.html In the search form, type "url" or anything else you'd like to search for. Enjoy. -Rick -- Rick Richardson [EMAIL PROTECTED] http://RickRichardson.freeservers.com/ My current CI is 28. I'm 41. I need 14 more cylinders by my next birthday. Two PWC's and an SUV ought to do it. Thats my new goal.
#!/bin/sh # # digidir # # Creates a standalone HTDIG searchable index of a directory of web or # text documents that can be put onto CD-ROM or traded with other # machines. If put on a CD-ROM, the CD-ROM can be mounted directly # under /home/httpd/html with no further configuration required. # # Usage: # # $ cd /home/httpd/html # $ mkdir documents # $ [fill "documents" dorectory with text or HTML files ] # $ digdir documents # # At this point, the directory "documents" contains the # original documents themselves, as well as a search.html # search form, and all "htdig" config and database files # (hidden under documents/.htdig). This directory can # be moved to any other machine. # # NOTE: as of htdig 3.1.x, the HTDIG database is architecture # dependant, so the database can only be moved to machines # of like architecture. This is a design flaw in htdig. # # Requirments: # On the machine which is used to initially generate the index, # HTDIG 3.1.x must be installed. Get it from http://www.htdig.org/ # # On the machine which is use to search the database, only # /home/httpd/html/cgi-bin/htsearch (from HTDIG 3.1.x) must exist. # # Author: # Rick Richardson, [EMAIL PROTECTED], November 1999. # # This software is donated to the PUBLIC DOMAIN and may be used for # any purpose without restriction. No warrantees expressed or # implied. Your mileage may vary. # error() { echo "digdir error: $*" exit 1 } DIR="$1" PATH=$PATH:/usr/sbin [ -d "$DIR" ] || error "directory name missing or non-existant" case "$DIR" in /*) error "directory name must be relative to current directory";; esac cd $DIR || error "can't chdir to $DIR" HERE=`pwd` # # Remove the old database and copy in a fresh set of HTDIG # distribution files. # rm -f search.html rm -rf .htdig mkdir .htdig || error "can't make directory $HERE/.htdig" ( cd /var/lib/htdig; find common ! -name 'db.*' ! -name '*.db' | cpio -pudm $HERE/.htdig ) mkdir .htdig/db || error "can't make directory $HERE/.htdig" # # Make a copy of the matching htsearch binary, so that # somebody who gets a copy of this index and doesn't # have a matching htsearch binary handy can just grab # it from here and stash it in cgi-bin. Also copy this # shell script in case somebody wants to regen the index. # cp -a /home/httpd/cgi-bin/htsearch $0 .htdig/ || error "can't find htsearch binary" # # Create two config files, one for htdig and one for htsearch # # Using two config files allows us to eliminate any appearance # of an absolute URL (one with a domain name, even localhost) # in the results, thus making the database portable. # # We convert the output URL to ../$DIR because the browsers # idea of the current directory will be cgi-bin. # DCONF=$HERE/.htdig/htdig.conf SCONF=$HERE/.htdig/htsearch.conf cp /etc/htdig/htdig.conf $DCONF cp /etc/htdig/htdig.conf $SCONF cat <<-EOF >> $DCONF database_dir: $HERE/.htdig/db common_dir: $HERE/.htdig/common start_url: http://localhost/$DIR/ local_urls: http://localhost/$DIR/=/home/httpd/html/$DIR/ local_user_urls: http:/=/home/,/public_html/ url_part_aliases: http://localhost/$DIR *$DIR EOF cat <<-EOF >> $SCONF database_dir: $HERE/.htdig/db common_dir: $HERE/.htdig/common start_url: http://localhost/$DIR/ local_urls: http://localhost/$DIR/=/home/httpd/html/$DIR/ local_user_urls: http:/=/home/,/public_html/ url_part_aliases: http:../$DIR *$DIR EOF # # Generate the database using HTDIG # htdig -v -c $DCONF -i htmerge -c $DCONF htnotify -c $DCONF htfuzzy -c $DCONF endings htfuzzy -c $DCONF synonyms # # Create the initial search page # CGI="http:/cgi-bin/htsearch?-c$SCONF" cat <<-EOF > search.html <html> <head> <title>ht://Dig WWW Search of $DIR</title> </head> <body bgcolor="#eef7ff"> <h1> <a href="http://www.htdig.org"> <IMG SRC="/htdig/htdig.gif" align=bottom alt="ht://Dig" border=0></a> WWW Site Search</H1> <hr noshade size=4> This search will allow you to search the contents of all documents under this directory. <br> <p> <form method="post" action="$CGI"> <font size=-1> Match: <select name=method> <option value=and>All <option value=or>Any </select> Format: <select name=format> <option value=builtin-long>Long <option value=builtin-short>Short </select> Sort by: <select name=sort> <option value=score>Score <option value=time>Time <option value=title>Title <option value=revscore>Reverse Score <option value=revtime>Reverse Time <option value=revtitle>Reverse Title </select> </font> <input type=hidden name=config value=htdig> <input type=hidden name=restrict value=""> <input type=hidden name=exclude value=""> <br> Search: <input type="text" size="30" name="words" value=""> <input type="submit" value="Search"> </form> <hr noshade size=4> </body> </html> EOF # # Fixup the templates that create the refine page, etc. # # Change these from method GET to POST so that the # -c$SCONF option will work. # for i in header nomatch syntax wrapper do ex .htdig/common/$i.html <<-EOF g#.(CGI)#s##$CGI# g#method=.get.#s##method="post"# w q EOF done
