Re: sgmlop: malformed charrefs?
Magnus Lie Hetland wrote: According to The Sgmlop Module Handbook [1], the handle_entityref() callback is called for malformed character entities. What does that mean, exactly? What is a malformed character entity? I've tried mis-spelling them (e.g., dropping the semicolon), but then they're (quite naturally) treated as text/data, with handle_data(). I've tried to use number that is too great, or (equivalently, it turns out) to use names instead of numbers, such as #foo;. In these cases, I only get an exception, because the number is too high... So -- how can I produce a malformed character entity? with sgmlop 1.1, the following script class entity_handler: def handle_entityref(self, entityref): print ENTITY, repr(entityref) parser = sgmlop.XMLParser() parser.register(entity_handler()) parser.feed(-10;/()=?;) prints: ENTITY '-10' ENTITY '/()=?' And another thing... For the case where a numeric reference is too high (i.e. it can't be translated into a Unicode character) -- is it possible to ignore it (or replace it, as with encode/decode)? if you don't do anything, it is ignored. if you specify a handle_charref hook, the part between # and ; is passed to that method. if you have a handle_entityref hook, but no handle_charref, the part between and ; is passed to handle_entityref. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmlop: malformed charrefs?
In article [EMAIL PROTECTED], Fredrik Lundh wrote: Magnus Lie Hetland wrote: [snip] with sgmlop 1.1, the following script class entity_handler: def handle_entityref(self, entityref): print ENTITY, repr(entityref) parser = sgmlop.XMLParser() parser.register(entity_handler()) parser.feed(-10;/()=?;) prints: ENTITY '-10' ENTITY '/()=?' OK, thanks. I guess I just wasn't creative enough in my entity naming :) And another thing... For the case where a numeric reference is too high (i.e. it can't be translated into a Unicode character) -- is it possible to ignore it (or replace it, as with encode/decode)? if you don't do anything, it is ignored. if you specify a handle_charref hook, the part between # and ; is passed to that method. I see -- it's just if the default behaviour of transforming it to text kicks in that there is trouble? (That makes sense, of course.) if you have a handle_entityref hook, but no handle_charref, the part between and ; is passed to handle_entityref. Strange. It doesn't seem to work that way for me... Here is an example: .. from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser class Handler: def handle_data(self, data): print 'DATA', data def handle_entityref(self, data): print 'ENTITY', data for parser in [SGMLParser(), XMLParser(), XMLUnicodeParser()]: parser.register(Handler()) try: parser.feed('#99;') except Exception, e: print e .. When I run this, I get: character reference #x540be3ff; exceeds ASCII range character reference #x540be3ff; exceeds ASCII range character reference #x540be3ff; exceeds sys.maxunicode (0x) If I remove the handle_data, nothing happens. /F -- Magnus Lie Hetland Time flies like the wind. Fruit flies http://hetland.org like bananas. -- Groucho Marx -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmlop: malformed charrefs?
Magnus Lie Hetland wrote: if you have a handle_entityref hook, but no handle_charref, the part between and ; is passed to handle_entityref. Strange. It doesn't seem to work that way for me... Here is an example: from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've forked the code (there's no UnicodeParser in the effbot.org edition), and I have no idea how things work in the fork. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmlop: malformed charrefs?
In article [EMAIL PROTECTED], Fredrik Lundh wrote: [snip] are the PyXML folks shipping the latest sgmlop? I don't know. The last history entry marked fl is from 2000-07-05... Perhaps I should just get the effbot version. (And perhaps file a bug report about this behaviour in PyXML.) I'm pretty sure they've forked the code (there's no UnicodeParser in the effbot.org edition), Does it deal with Unicode at all? I.e., can I, for example, feed it a Unicode object? and I have no idea how things work in the fork. I see. -- Magnus Lie Hetland Time flies like the wind. Fruit flies http://hetland.org like bananas. -- Groucho Marx -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmlop: malformed charrefs?
Fredrik Lundh wrote: are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've forked the code (there's no UnicodeParser in the effbot.org edition), and I have no idea how things work in the fork. As we've forked the code, the answer is a clear yes :-) It certainly is the latest release of the fork. Regards, Martin -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmlop: malformed charrefs?
Martin v. Löwis wrote: are the PyXML folks shipping the latest sgmlop? I'm pretty sure they've forked the code (there's no UnicodeParser in the effbot.org edition), and I have no idea how things work in the fork. As we've forked the code, the answer is a clear yes :-) It certainly is the latest release of the fork. if the 2000-07-05 date is correct, there has been at least eight public releases of the original sgmlop distribution since the fork. /F -- http://mail.python.org/mailman/listinfo/python-list