Re: sgmlop: malformed charrefs?

2005-03-17 Thread Fredrik Lundh
Magnus Lie Hetland wrote:
 According to The Sgmlop Module Handbook [1], the handle_entityref()
 callback is called for malformed character entities. What does that
 mean, exactly? What is a malformed character entity? I've tried
 mis-spelling them (e.g., dropping the semicolon), but then they're
 (quite naturally) treated as text/data, with handle_data(). I've tried
 to use number that is too great, or (equivalently, it turns out) to
 use names instead of numbers, such as #foo;. In these cases, I only
 get an exception, because the number is too high...

 So -- how can I produce a malformed character entity?

with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print ENTITY, repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed(-10;/()=?;)

prints:

ENTITY '-10'
ENTITY '/()=?'

 And another thing... For the case where a numeric reference is too
 high (i.e. it can't be translated into a Unicode character) -- is it
 possible to ignore it (or replace it, as with encode/decode)?

if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between # and ; is passed
to that method.

if you have a handle_entityref hook, but no handle_charref, the part between
 and ; is passed to handle_entityref.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sgmlop: malformed charrefs?

2005-03-17 Thread Magnus Lie Hetland
In article [EMAIL PROTECTED],
Fredrik Lundh wrote:
Magnus Lie Hetland wrote:
[snip]
with sgmlop 1.1, the following script

class entity_handler:
def handle_entityref(self, entityref):
print ENTITY, repr(entityref)

parser = sgmlop.XMLParser()
parser.register(entity_handler())
parser.feed(-10;/()=?;)

prints:

ENTITY '-10'
ENTITY '/()=?'

OK, thanks. I guess I just wasn't creative enough in my entity naming
:)

 And another thing... For the case where a numeric reference is too
 high (i.e. it can't be translated into a Unicode character) -- is it
 possible to ignore it (or replace it, as with encode/decode)?

if you don't do anything, it is ignored.

if you specify a handle_charref hook, the part between # and ; is passed
to that method.

I see -- it's just if the default behaviour of transforming it to text
kicks in that there is trouble? (That makes sense, of course.)

if you have a handle_entityref hook, but no handle_charref, the part between
 and ; is passed to handle_entityref.

Strange. It doesn't seem to work that way for me... Here is an example:

..
from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

class Handler:

def handle_data(self, data):
print 'DATA', data

def handle_entityref(self, data):
print 'ENTITY', data

for parser in [SGMLParser(), XMLParser(), XMLUnicodeParser()]:
parser.register(Handler())
try:
parser.feed('#99;')
except Exception, e:
print e
..

When I run this, I get:

character reference #x540be3ff; exceeds ASCII range
character reference #x540be3ff; exceeds ASCII range
character reference #x540be3ff; exceeds sys.maxunicode (0x)

If I remove the handle_data, nothing happens.

/F 

-- 
Magnus Lie Hetland   Time flies like the wind. Fruit flies
http://hetland.org   like bananas. -- Groucho Marx
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sgmlop: malformed charrefs?

2005-03-17 Thread Fredrik Lundh
Magnus Lie Hetland wrote:

if you have a handle_entityref hook, but no handle_charref, the part between
 and ; is passed to handle_entityref.

 Strange. It doesn't seem to work that way for me... Here is an example:

 from xml.parsers.sgmlop import SGMLParser, XMLParser, XMLUnicodeParser

are the PyXML folks shipping the latest sgmlop?  I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sgmlop: malformed charrefs?

2005-03-17 Thread Magnus Lie Hetland
In article [EMAIL PROTECTED],
Fredrik Lundh wrote:
[snip]
are the PyXML folks shipping the latest sgmlop?

I don't know. The last history entry marked fl is from 2000-07-05...

Perhaps I should just get the effbot version. (And perhaps file a bug
report about this behaviour in PyXML.)

 I'm pretty sure they've forked the code (there's no UnicodeParser in
 the effbot.org edition),

Does it deal with Unicode at all? I.e., can I, for example, feed it a
Unicode object?

 and I have no idea how things work in the fork.

I see.

-- 
Magnus Lie Hetland   Time flies like the wind. Fruit flies
http://hetland.org   like bananas. -- Groucho Marx
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: sgmlop: malformed charrefs?

2005-03-17 Thread Martin v. Löwis
Fredrik Lundh wrote:
are the PyXML folks shipping the latest sgmlop?  I'm pretty sure they've
forked the code (there's no UnicodeParser in the effbot.org edition), and
I have no idea how things work in the fork.
As we've forked the code, the answer is a clear yes :-) It certainly
is the latest release of the fork.
Regards,
Martin
--
http://mail.python.org/mailman/listinfo/python-list


Re: sgmlop: malformed charrefs?

2005-03-17 Thread Fredrik Lundh
Martin v. Löwis wrote:

 are the PyXML folks shipping the latest sgmlop?  I'm pretty sure they've
 forked the code (there's no UnicodeParser in the effbot.org edition), and
 I have no idea how things work in the fork.

 As we've forked the code, the answer is a clear yes :-) It certainly
 is the latest release of the fork.

if the 2000-07-05 date is correct, there has been at least eight public releases
of the original sgmlop distribution since the fork.

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list