Common Lisp HyperSpec index lookup

Kragen Sitaker Wed, 06 Feb 2002 00:22:07 -0800

Learning Common Lisp, and especially reading Paul Graham's
just-published-on-the-Web _On Lisp_, I often wish for a quick online
help program that will tell me what a particular Lisp idiom means.
Kent Pitman made this awesomely cool thing called the Common Lisp
HyperSpec, which weighs in at 18 megs, is densely hyperlinked, and is
thoroughly and meticulously indexed with some 3000 index entries, but
the indices have to be navigated by pointy-clicky web browsing.  This
is nice for when I'm looking for something I don't know about, but
it's a nuisance when I want to know the semantics of mapcan or the
argument list of map.


So here's a command-line program that looks stuff up in the index and
points your web browser at it.  One of these days I might hook it up
to the web button toolbar so I can double-click a word in my xterm and
click the appropriate button.


#!/usr/bin/python
"""Look up a term in the Common Lisp HyperSpec, version 6-0.

Requires Python 1.5.2 or newer, BSD DB support, and either a local
copy of the HyperSpec or Internet access.  Try running it with the
argument "mapcar".

The first time you run this program, assuming all goes well, it will
fetch 27 files totaling half a megabyte from xanalys.com, and from
them it will create a 370-kilobyte Berkeley DB file called
'hsindex.db' in your home directory.  On a 56K modem, this will
probably take three minutes or so.

The details of the index files will probably change again in the next
version, and this program won't work anymore unless you saved a copy
of the version 6-0 HyperSpec.

It would be nice to use the 'webbrowser' objects in Python 2.x, but
unfortunately they don't default to links the way I'd like them to,
and the Mozilla interface doesn't work properly when opening a new
Mozilla.  (My old Mozilla version doesn't accept URLs on the command
line.)  But using them is as simple as

>>> import webbrowser
>>> webbrowser.open('http://www.yahoo.com/')

This program builds a Berkeley DB file, which was completely
unnecessary in the situation I started with --- I had version 4-0 on
my local disk, so reading all the HTML index files took 0.4 seconds
instead of 0.08 seconds.  "What a stupid waste of coding effort," I
thought.  Then I realized it would be worthwhile if only it worked
against the online version, which would be as simple as changing
'open' to 'urllib.urlopen', and providing a better place to keep the
index db file when using a remote copy.

But online version was version 6-0, so I hacked this program to work
with 6-0, which has several times as many index entries and a
different HTML format in the index, and which also necessitated adding
subhead support.  Now the DB file is a win even for local queries, but
only because Python is so slow (and because my HTML parsing code is so
slow).

This was probably a case of throwing good money after bad, or
irrationally trying to rescue sunk costs.  But it was fun.

I am unhappy with the broken Berkeley DB iteration API, which
conflates setting a position and getting a value (leading to
error-handling and iteration-terminating code being far more
complicated than it needs to be) and conflates iterators and database
handles (meaning that finding the strings x such that both a=x and b=x
are in the database is unnecessarily difficult --- not that I was
doing that.).

"""

import bsddb, sys, os, string, urlparse, urllib

# magic constants
topdir = "/home/kragen/docs/hyperspec/HyperSpec"  # where the hyperspec is
if not os.path.exists(topdir):
    topdir = "http://www.xanalys.com/software_tools/reference/HyperSpec";
browsercmd = "links %s"                            # how to launch a browser

# stuff for the index file format
wordmarker = '<B>'                                 # where an index term starts
urlmarker = '  <A REL='                            # marker for index URLs
urlstart = '../Body/'                              # how the URL starts

def startswith(seq, prefix):
    """Returns true if seq starts with prefix.

    Like string.startswith in Python 2.x."""
    return seq[:len(prefix)] == prefix

def html_unesc(ss):
    """Convert HTML to ordinary ASCII text by removing entities."""
    return string.replace(string.replace(string.replace(string.replace(
        ss,
        '&lt;', '<'),
                                                        '&gt;', '>'),
                                         '&quot;', '"'),
                          '&amp;', '&')

def build_index(topdir, dbfilename):
    """Munge the HTML of the HyperSpec index to make a list of index terms."""
    print "Building index of %s in %s" % (topdir, dbfilename)
    files = map(lambda x, topdir=topdir:
                '%s/Front/X_Mast_%s.htm' % (topdir, x),
                map(chr, range(ord('A'), ord('Z') + 1)) + ['9'])
    dbfile = bsddb.btopen(dbfilename, 'c')
    try:
        file = None
        for filename in files:
            file = urllib.urlopen(filename)
            try:
                word = urltail = None
                for line in file.readlines():
                    if startswith(line, wordmarker):
                        ws = len(wordmarker)
                        # this will die if there's no ending < on the line
                        # html_unesc here is so things like &rest will work.
                        word = string.lower(
                            html_unesc(line[ws:string.index(line, '<', ws)]))
                    elif startswith(line, urlmarker):
                        if word is None:
                            raise "URL before a word"
                        # this part here should *really* use a regex!
                        start = len(urlmarker)
                        # this will die if the expected strings aren't found
                        us = (string.index(line, urlstart, start) +
                              len(urlstart))
                        ue = string.index(line, '">', us)     # url end
                        urltail = line[us:ue]
                        subtermend = string.index(line, '<', ue)
                        subterm = string.lower(
                            html_unesc(line[ue+2:subtermend]))
                        wholeterm = "%s -- %s" % (word, subterm)
                        dbfile[wholeterm] = "Body/%s" % urltail
                if word is None:
                    raise "No words in %s" % filename
                elif urltail is None:
                    raise "No urltails in %s" % filename
            finally:
                file.close()
        if file is None:
            raise "Couldn't open any files"
    except:
        dbfile.close()
        os.unlink(dbfilename)
        raise
    else:
        dbfile.close()

def get_index(topdir):
    """Return an open Berkeley DB file containing an index of topdir.

    topdir is a URL containing the HyperSpec, preferably on your local
    filesystem.

    The index is created if necessary; no check is made to see if it's
    out of date.

    """
    scheme, host, path, _, _, _ = urlparse.urlparse(topdir)
    if scheme in ['', 'file']:
        # We have a local copy, so put the index in it...
        dir = path
    else:
        # store it in home dir or, failing that, root dir
        dir = os.environ.get('HOME', '')
    dbfilename = "%s/hsindex.db" % dir
    # No up-to-date check.  Delete the index yourself if you update
    # the HyperSpec.  urllib is too primitive to tell us how old a file is...
    if not os.path.exists(dbfilename): build_index(topdir, dbfilename)
    return bsddb.btopen(dbfilename)

def getwords(term):
    """Return the list of indexed terms that match the requested term."""
    term = string.lower(term)
    dbfile = get_index(topdir)
    try:
        try:
            # stupidity: set_location('') breaks
            if term != '': key, value = dbfile.set_location(term)
            else: key, value = dbfile.first()
        except KeyError:
            # more stupidity: set_location to something past the end gives
            # a KeyError
            return []
        rv = []
        while 1:
            # special case: complete, but not unique
            # (no longer useful since addition of subheads)
            if key == term: return [(key, value)]
            if not startswith(key, term): return rv
            rv.append((key, value))
            try:
                # still more stupidity: next() off the end of the file gives
                # a KeyError
                key, value = dbfile.next()
            except KeyError: return rv
    finally:
        dbfile.close()

def getwords_fancy(term):
    """Like getwords, but sometimes more selective.

    If the specified term is a main index item, return only the items
    found under that item, not all the main index items that start with it.

    This way, things like 'map' and 'handle' send you to the (single)
    appropriate item instead of giving you a list of possibilities:
    map, mapcar, mapcan, etc., or handle, handler, handler-bind, etc.

    """
    rv = getwords(term)
    exact_term_matches = []
    wanted = "%s -- " % term
    for found in rv:
        if startswith(found[0], wanted):
            exact_term_matches.append(found)
    if len(exact_term_matches) > 0:
        return exact_term_matches
    else:
        return rv

def shrepr(ss):
    """Shell representation of a string.

    Works on Unix, but probably not elsewhere.
    Returns a string which, if fed to a shell, will produce a sequence of
    arguments which, when rejoined by spaces, produces the original string.

    """
    rv = []
    needquotes = 0
    lastcharwsp = 1
    safechars = string.uppercase + string.lowercase + string.digits + ' :-,/'
    for char in str(ss):
        if char not in safechars:
            needquotes = 1
            # The only way to put "don't" in a single-quoted csh string
            # is 'don'\''t'.  sh is saner.
            rv.append("'\\%s'" % char)
        else:
            rv.append(char)
        if char in string.whitespace:
            if lastcharwsp: needquotes = 1
        lastcharwsp = (char in string.whitespace)
    rvs = string.join(rv, '')
    if needquotes: rvs = "'%s'" % rvs
    return rvs
        

term = string.join(sys.argv[1:], ' ')
words = getwords_fancy(term)
me = os.path.basename(sys.argv[0])
if words == []:
    print "%s: No matches for '%s'" % (me, term)
    sys.exit(1)
elif len(words) == 1:
    sys.exit(os.system(browsercmd % "%s/%s" % (topdir, words[0][1])))
else:
    print "%s: '%s' is ambiguous; possibilities follow:" % (me, term)
    for word in words:
        # Stuff the user can cut and paste into their shell.
        print " %s   %s" % (me, shrepr(word[0]))
    sys.exit(1)

Common Lisp HyperSpec index lookup

Reply via email to