Re: sgmllib parser keeps old tag data?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Fri, Feb 13, 2009 at 03:41:52PM +, MRAB wrote: Berend van Berkum wrote: Yes.. tested that and SGMLParser won't let me override __init__, (SGMLParser vars are uninitialized even with sgmllib.SGMLParser(self) call). OK, so SGMLParser needs to be initialised: def __init__(self): sgmllib.SGMLParser.__init__(self) self.content = '' self.markup = [] self.span_stack = [] argh.. forgot about old-style supercall syntax used sgmllib.SGMLParser(self), did not investigate the error gonna be ashamed for a while now.. - -- web, http://dotmpe.com ()ASCII Ribbon email, berend.van.ber...@gmail.com /\ icq, 26727647; irc, berend/mpe at irc.oftc.net -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFJlsStn70fkTNDJRgRArzdAJ9QhTaIcx0Kgps8rHe0oGnf6qQm+QCeJSh/ +pMOged64wmns1HvoV+u4fA= =O4QU -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
sgmllib parser keeps old tag data?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi everyone, I read the source, made numerous tests, but SGMLParser's keeps returning *tag* data from previous parser instances. I'm totally confused why.. The content data it returns is ok. E.g.:: sp = MyParser() sp.feed('testt /Test/test') print sp.content, sp.markup sp.close() sp = MyParser() sp.feed('xml\n/xml\r\n') print sp.content, sp.markup sp.close() gives:: ('Test', [{'t': ({}, (0, 0))}, {'test': ({}, (0, 4))}]) ('\n\r\n', [{'t': ({}, (0, 0))}, {'test': ({}, (0, 4))}, {'xml': ({}, (0, 1))}]) It keeps the tags from the previous session, while i'm sure the stack etc. should be clean.. Any ideas? regards, Berend - import sgmllib class MyParser(sgmllib.SGMLParser): content = '' markup = [] span_stack = [] def handle_data(self, data): self.content += data def unknown_starttag(self, tag, attr): stack = { tag: ( dict(attr), ( len(self.content), ) ) } self.span_stack.append(stack) def unknown_endtag(self, tag): prev_tag, ( attr, ( offset, ) ) = self.span_stack.pop().items()[0] if tag: # close all tags on stack until it finds a matching end tag # XXX: need to return to LEVEL, not same tag name while tag != prev_tag: span = { prev_tag: ( attr, ( offset, 0 ) ) } self.markup.append( span ) prev_tag, ( attr, ( offset, ) ) = self.span_stack.pop().items()[0] length = len( self.content ) - offset span = { tag: ( attr, ( offset, length ) ) } self.markup.append( span ) def do_unknown_tag(self, tag, attr): assert not tag and not attr, do_unknown_tag %s, %s % (tag, attr) def close(self): sgmllib.SGMLParser.close(self) self.content = '' self.markup = [] self.span_stack = [] def parse_data(data): sp = MyParser() sp.feed(data) r = sp.content, sp.markup sp.close() return r print parse_data('testt /Test/test') print parse_data('xml\n/xml\r\n') print parse_data('sgmlsTest 3/s/sgml') - -- web, http://dotmpe.com ()ASCII Ribbon email, berend.van.ber...@gmail.com /\ icq, 26727647; irc, berend/mpe at irc.oftc.net -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFJlXxrn70fkTNDJRgRArWwAKCbhe/FwOu3/XtAja7+rbvIv29HEQCgwtf3 k3eiwfD0yw6t+giXJy1nako= =afE6 -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmllib parser keeps old tag data?
Berend van Berkum wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi everyone, I read the source, made numerous tests, but SGMLParser's keeps returning *tag* data from previous parser instances. I'm totally confused why.. The content data it returns is ok. E.g.:: sp = MyParser() sp.feed('testt /Test/test') print sp.content, sp.markup sp.close() sp = MyParser() sp.feed('xml\n/xml\r\n') print sp.content, sp.markup sp.close() gives:: ('Test', [{'t': ({}, (0, 0))}, {'test': ({}, (0, 4))}]) ('\n\r\n', [{'t': ({}, (0, 0))}, {'test': ({}, (0, 4))}, {'xml': ({}, (0, 1))}]) It keeps the tags from the previous session, while i'm sure the stack etc. should be clean.. Any ideas? regards, Berend - import sgmllib class MyParser(sgmllib.SGMLParser): content = '' markup = [] span_stack = [] These are in the _class_ itself, so they will be shared by all its instances. You should so something like this instead: def __init__(self): self.content = '' self.markup = [] self.span_stack = [] def handle_data(self, data): self.content += data def unknown_starttag(self, tag, attr): stack = { tag: ( dict(attr), ( len(self.content), ) ) } self.span_stack.append(stack) def unknown_endtag(self, tag): prev_tag, ( attr, ( offset, ) ) = self.span_stack.pop().items()[0] if tag: # close all tags on stack until it finds a matching end tag # XXX: need to return to LEVEL, not same tag name while tag != prev_tag: span = { prev_tag: ( attr, ( offset, 0 ) ) } self.markup.append( span ) prev_tag, ( attr, ( offset, ) ) = self.span_stack.pop().items()[0] length = len( self.content ) - offset span = { tag: ( attr, ( offset, length ) ) } self.markup.append( span ) def do_unknown_tag(self, tag, attr): assert not tag and not attr, do_unknown_tag %s, %s % (tag, attr) def close(self): sgmllib.SGMLParser.close(self) self.content = '' self.markup = [] self.span_stack = [] def parse_data(data): sp = MyParser() sp.feed(data) r = sp.content, sp.markup sp.close() return r print parse_data('testt /Test/test') print parse_data('xml\n/xml\r\n') print parse_data('sgmlsTest 3/s/sgml') -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmllib parser keeps old tag data?
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Fri, Feb 13, 2009 at 02:31:40PM +, MRAB wrote: Berend van Berkum wrote: import sgmllib class MyParser(sgmllib.SGMLParser): content = '' markup = [] span_stack = [] These are in the _class_ itself, so they will be shared by all its instances. You should so something like this instead: def __init__(self): self.content = '' self.markup = [] self.span_stack = [] Yes.. tested that and SGMLParser won't let me override __init__, (SGMLParser vars are uninitialized even with sgmllib.SGMLParser(self) call). Tried some but not the following: with a differently named init function and one boolean class var 'initialized' it can check 'if self.initialized' in front of each handler. Does the trick. Confusion dissolved :) thanks. - -- web, http://dotmpe.com ()ASCII Ribbon email, berend.van.ber...@gmail.com /\ icq, 26727647; irc, berend/mpe at irc.oftc.net -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) iD8DBQFJlYjVn70fkTNDJRgRAhFRAJ9XDPaR2zb8EjKfTACDjtzwI7z/9ACgzcmB Ms1QZ9IoB2s6RJ+tdXJtzfs= =itBb -END PGP SIGNATURE- -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmllib parser keeps old tag data?
Berend van Berkum wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 On Fri, Feb 13, 2009 at 02:31:40PM +, MRAB wrote: Berend van Berkum wrote: import sgmllib class MyParser(sgmllib.SGMLParser): content = '' markup = [] span_stack = [] These are in the _class_ itself, so they will be shared by all its instances. You should so something like this instead: def __init__(self): self.content = '' self.markup = [] self.span_stack = [] Yes.. tested that and SGMLParser won't let me override __init__, (SGMLParser vars are uninitialized even with sgmllib.SGMLParser(self) call). OK, so SGMLParser needs to be initialised: def __init__(self): sgmllib.SGMLParser.__init__(self) self.content = '' self.markup = [] self.span_stack = [] Tried some but not the following: with a differently named init function and one boolean class var 'initialized' it can check 'if self.initialized' in front of each handler. Does the trick. Confusion dissolved :) thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmllib parser keeps old tag data?
you are declaring class variables, not instance variables. you need to declare these in an __init__ method. RTFM. http://docs.python.org/tutorial/classes.html#a-first-look-at-classes Berend van Berkum wrote: class MyParser(sgmllib.SGMLParser): content = '' markup = [] span_stack = [] -- http://mail.python.org/mailman/listinfo/python-list
Re: sgmllib parser keeps old tag data?
Sorry, this reply was delayed (trying to use usenet...) and so now seems (even more) bad tempered than needed. Andrew andrew cooke wrote: you are declaring class variables, not instance variables. you need to declare these in an __init__ method. RTFM. http://docs.python.org/tutorial/classes.html#a-first-look-at-classes Berend van Berkum wrote: class MyParser(sgmllib.SGMLParser): content = '' markup = [] span_stack = [] -- http://mail.python.org/mailman/listinfo/python-list -- http://mail.python.org/mailman/listinfo/python-list