I've just checked in a lot of changes to the parser; see below for a list.
The most visible user changes are more informative status messages,
the ability to handle WIDTH and HEIGHT attributes in IMG tags, the
inclusion of my suggested UNICODE_CHAR function code, and the ability
to get indented text paragraphs rather than vspace-separated ones.
Note that the SECTION attribute also works for images (not sure it did
previously). This allows you to take just a section of an image,
rather than the whole thing, by specifying
section="WIDTHxHEIGHT+ULX+ULY".
Internally, I've removed the use of URLs to bind pages together,
replacing them with registered PluckerDocument ids. This means that
the same URL can be processed more than once if necessary, with
different attributes (think of a small image and its ALT_MAX*
version). I've removed calls to 'print' and 'sys.stderr.write',
replacing them with new functions 'message' and 'error'.
Bill
General: Changed all exceptions to class-style.
Replaced many 'print' and 'sys.stderr.write' messages with calls to
'UtilFns.message' and 'UtilFns.error'.
Removed all calls to Profiling and obsoleted Profiling.py.
Changed status messages to be more informative and regular.
Added parameter 'status_line_length' to control "squeezing" of URLs on
status lines -- default to original 60.
ImageParser.py: Fixed PNM header format pattern to handle monochrome images.
Handle valid 'nn%' pattern in WIDTH and HEIGHT attributes (badly).
Handle 'related images' -- alt versions of included images.
PluckerDocs.py: Improved document registration -- now also used for links.
Added support for UNICODE_CHAR function code with alternate text.
Added get_documents() method to PluckerDocument.
Profiling.py: Removed all calls to contained methods, and emptied module of code.
Spider.py: Stopped putting bad parent attributes in child SpiderLinks.
Added notion of page 'key' string, which combines the URL with the
attributes.
Added _queue_keys attribute to Spider queue to keep track of page key.
Use keys to identify pages, so that an image file may be parsed more than
once, with different attributes.
TextParser.py: Moved HTML color names to hash table.
Added support for UNICODE_CHAR function code with alternate text.
Added support for boolean config option 'indent_paragraphs'.
Writer.py: Fixed some bugs in Mapper class.
UtilFns.py: Adds 'show_exception', 'message', and 'error' functions. See code for
doc strings.