Thanks to all who helped with suggestions.

Here is version 1 of my solution.  The short shell script
that does all the work is an attachment.

-Rick

For a long time I've wanted to turn my vast email archives into HTML
and index *every single word* in those archives.  Then, I wanted to
blast those archives onto CD-ROMs for permanent archival storage.

I have been eyeing a very nice, free indexing package called "htdig",
as the search engine of choice for this purpose (http://www.htdig.org).
It will index every word in a collection of text and HTML files.

The problem with htdig is that the conventional usage is to index an
entire web site on a single machine into a single database.

I wanted to index several collections independantly, and wanted to be
able to easily move those collections and their indexes between
machines and onto CD-ROM without having to do a lot of work to
"install" the database onto each machine.

I worked out a shell script to do what I wanted to do.  I have
attached said shell script "digdir".

As a proof of concept, I decided to use the latest copy of the
Internet RFC collection as a test.

I started with the 2700 or so RFC's in text form.  I stored these into
directory /home/httpd/html/rfc.

I then run the "digdir" shell script thusly:

        $ cd /home/httpd/html
        $ digdir rfc

After about 5 minutes, the shell script finishes the indexing process.
It adds a number of new files under /home/httpd/html/rfc, but does not
add or modify any other files on the computer.  These new files
include a "search.html" search form used for submitting queries, and
the indexed database generated by htdig.

It is possible to now blast this entire directory onto CD-R, and you
could mount this CD-ROM on another machine under /home/httpd/html and
it would work (assuming you have previously installed the stock htdig
RPM package).

To see the results, open this URL (will work only on the Digi intranet):

        http://digifax.digi.com/rfc/search.html

In the search form, type "url" or anything else you'd like to search for.

Enjoy.

-Rick

-- 
Rick Richardson  [EMAIL PROTECTED]   http://RickRichardson.freeservers.com/

My current CI is 28.  I'm 41.  I need 14 more cylinders by my next
birthday.  Two PWC's and an SUV ought to do it.  Thats my new goal.
#!/bin/sh

#
# digidir
#
#       Creates a standalone HTDIG searchable index of a directory of web or
#       text documents that can be put onto CD-ROM or traded with other
#       machines.  If put on a CD-ROM, the CD-ROM can be mounted directly
#       under /home/httpd/html with no further configuration required.
#
# Usage:
#
#       $ cd /home/httpd/html
#       $ mkdir documents
#       $ [fill "documents" dorectory with text or HTML files ]
#       $ digdir documents
#
#       At this point, the directory "documents" contains the
#       original documents themselves, as well as a search.html
#       search form, and all "htdig" config and database files
#       (hidden under documents/.htdig).  This directory can
#       be moved to any other machine.
#
#       NOTE: as of htdig 3.1.x, the HTDIG database is architecture
#       dependant, so the database can only be moved to machines
#       of like architecture.  This is a design flaw in htdig.
#
# Requirments:
#       On the machine which is used to initially generate the index,
#       HTDIG 3.1.x must be installed.  Get it from http://www.htdig.org/
#
#       On the machine which is use to search the database, only
#       /home/httpd/html/cgi-bin/htsearch (from HTDIG 3.1.x) must exist.
#
# Author:
#       Rick Richardson, [EMAIL PROTECTED], November 1999.
#
#       This software is donated to the PUBLIC DOMAIN and may be used for
#       any purpose without restriction.  No warrantees expressed or
#       implied.  Your mileage may vary.
#

error() {
        echo "digdir error: $*"
        exit 1
}

DIR="$1"
PATH=$PATH:/usr/sbin

[ -d "$DIR" ] || error "directory name missing or non-existant"

case "$DIR" in
/*) error "directory name must be relative to current directory";;
esac

cd $DIR || error "can't chdir to $DIR"

HERE=`pwd`

#
#       Remove the old database and copy in a fresh set of HTDIG
#       distribution files.
#
rm -f search.html
rm -rf .htdig
mkdir .htdig || error "can't make directory $HERE/.htdig"

(
        cd /var/lib/htdig; find common ! -name 'db.*' ! -name '*.db' |
                cpio -pudm $HERE/.htdig
)

mkdir .htdig/db || error "can't make directory $HERE/.htdig"

#
#       Make a copy of the matching htsearch binary, so that
#       somebody who gets a copy of this index and doesn't
#       have a matching htsearch binary handy can just grab
#       it from here and stash it in cgi-bin.  Also copy this
#       shell script in case somebody wants to regen the index.
#
cp -a /home/httpd/cgi-bin/htsearch $0 .htdig/ ||
        error "can't find htsearch binary"

#
#       Create two config files, one for htdig and one for htsearch
#
#       Using two config files allows us to eliminate any appearance
#       of an absolute URL (one with a domain name, even localhost)
#       in the results, thus making the database portable.
#
#       We convert the output URL to ../$DIR because the browsers
#       idea of the current directory will be cgi-bin.
#
DCONF=$HERE/.htdig/htdig.conf
SCONF=$HERE/.htdig/htsearch.conf
cp /etc/htdig/htdig.conf $DCONF
cp /etc/htdig/htdig.conf $SCONF

cat <<-EOF >> $DCONF
        database_dir:           $HERE/.htdig/db
        common_dir:             $HERE/.htdig/common
        start_url:              http://localhost/$DIR/
        local_urls:             http://localhost/$DIR/=/home/httpd/html/$DIR/
        local_user_urls:        http:/=/home/,/public_html/
        url_part_aliases:       http://localhost/$DIR *$DIR
EOF

cat <<-EOF >> $SCONF
        database_dir:           $HERE/.htdig/db
        common_dir:             $HERE/.htdig/common
        start_url:              http://localhost/$DIR/
        local_urls:             http://localhost/$DIR/=/home/httpd/html/$DIR/
        local_user_urls:        http:/=/home/,/public_html/
        url_part_aliases:       http:../$DIR *$DIR
EOF

#
#       Generate the database using HTDIG
#
htdig -v -c $DCONF -i
htmerge -c $DCONF
htnotify -c $DCONF
htfuzzy -c $DCONF endings
htfuzzy -c $DCONF synonyms

#
#       Create the initial search page
#
CGI="http:/cgi-bin/htsearch?-c$SCONF"

cat <<-EOF > search.html
        <html>
        <head>
        <title>ht://Dig WWW Search of $DIR</title>
        </head>
        <body bgcolor="#eef7ff">
        <h1>
        <a href="http://www.htdig.org">
        <IMG SRC="/htdig/htdig.gif" align=bottom alt="ht://Dig" border=0></a>
        WWW Site Search</H1>
        <hr noshade size=4>
        This search will allow you to search the contents of
        all documents under this directory.
        <br>
        <p>
        <form method="post" action="$CGI">
        <font size=-1>
        Match: <select name=method>
        <option value=and>All
        <option value=or>Any
        </select>
        Format: <select name=format>
        <option value=builtin-long>Long
        <option value=builtin-short>Short
        </select>
        Sort by: <select name=sort>
        <option value=score>Score
        <option value=time>Time
        <option value=title>Title
        <option value=revscore>Reverse Score
        <option value=revtime>Reverse Time
        <option value=revtitle>Reverse Title
        </select>
        </font>
        <input type=hidden name=config value=htdig>
        <input type=hidden name=restrict value="">
        <input type=hidden name=exclude value="">
        <br>
        Search:
        <input type="text" size="30" name="words" value="">
        <input type="submit" value="Search">
        </form>
        <hr noshade size=4>
        </body>
        </html>
EOF

#
#       Fixup the templates that create the refine page, etc.
#
#       Change these from method GET to POST so that the
#       -c$SCONF option will work.
#
for i in header nomatch syntax wrapper
do
        ex .htdig/common/$i.html <<-EOF
                g#.(CGI)#s##$CGI#
                g#method=.get.#s##method="post"#
                w
                q
        EOF
done

Reply via email to