Re: sgmllib parser keeps old tag data?

2009-02-14 Thread Berend van Berkum
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Fri, Feb 13, 2009 at 03:41:52PM +, MRAB wrote:
 Berend van Berkum wrote:
 Yes.. tested that and SGMLParser won't let me override __init__, 
 (SGMLParser vars are uninitialized even with sgmllib.SGMLParser(self) 
 call).
 
 OK, so SGMLParser needs to be initialised:
 
   def __init__(self):
   sgmllib.SGMLParser.__init__(self)
   self.content = ''
   self.markup = []
   self.span_stack = []

argh.. forgot about old-style supercall syntax
used sgmllib.SGMLParser(self), did not investigate the error
gonna be ashamed for a while now..

- -- 
 web, http://dotmpe.com  ()ASCII Ribbon
 email, berend.van.ber...@gmail.com  /\
 icq, 26727647;  irc, berend/mpe at irc.oftc.net

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFJlsStn70fkTNDJRgRArzdAJ9QhTaIcx0Kgps8rHe0oGnf6qQm+QCeJSh/
+pMOged64wmns1HvoV+u4fA=
=O4QU
-END PGP SIGNATURE-
--
http://mail.python.org/mailman/listinfo/python-list


sgmllib parser keeps old tag data?

2009-02-13 Thread Berend van Berkum
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Hi everyone,

I read the source, made numerous tests, but SGMLParser's keeps returning *tag* 
data 
from previous parser instances. I'm totally confused why.. The content data it
returns is ok.

E.g.::

sp = MyParser()
sp.feed('testt /Test/test')
print sp.content, sp.markup
sp.close()

sp = MyParser()
sp.feed('xml\n/xml\r\n')
print sp.content, sp.markup
sp.close()

gives::

('Test', [{'t': ({}, (0, 0))}, {'test': ({}, (0, 4))}]) 
('\n\r\n', [{'t': ({}, (0, 0))}, {'test': ({}, (0, 4))}, {'xml': ({}, (0, 
1))}])

It keeps the tags from the previous session, while i'm sure the stack etc.
should be clean..

Any ideas?


regards, Berend

- 

import sgmllib


class MyParser(sgmllib.SGMLParser):

content = ''
markup = []
span_stack = []

def handle_data(self, data):
self.content += data

def unknown_starttag(self, tag, attr):
stack = { tag: ( dict(attr), ( len(self.content), ) ) }
self.span_stack.append(stack)

def unknown_endtag(self, tag):
prev_tag, ( attr, ( offset, ) ) = 
self.span_stack.pop().items()[0]

if tag:
# close all tags on stack until it finds a matching end 
tag
# XXX: need to return to LEVEL, not same tag name
while tag != prev_tag:
span = { prev_tag: ( attr, ( offset, 0 ) ) }
self.markup.append( span )

prev_tag, ( attr, ( offset, ) ) = 
self.span_stack.pop().items()[0]

length = len( self.content ) - offset
span = { tag: ( attr, ( offset, length ) ) }
self.markup.append( span )

def do_unknown_tag(self, tag, attr):
assert not tag and not attr, do_unknown_tag %s, %s % (tag, 
attr)

def close(self):
sgmllib.SGMLParser.close(self)
self.content = ''
self.markup = []
self.span_stack = []


def parse_data(data):
sp = MyParser()
sp.feed(data)
r = sp.content, sp.markup
sp.close()
return r

print parse_data('testt /Test/test')
print parse_data('xml\n/xml\r\n')
print parse_data('sgmlsTest 3/s/sgml')



- -- 
 web, http://dotmpe.com  ()ASCII Ribbon
 email, berend.van.ber...@gmail.com  /\
 icq, 26727647;  irc, berend/mpe at irc.oftc.net

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFJlXxrn70fkTNDJRgRArWwAKCbhe/FwOu3/XtAja7+rbvIv29HEQCgwtf3
k3eiwfD0yw6t+giXJy1nako=
=afE6
-END PGP SIGNATURE-
--
http://mail.python.org/mailman/listinfo/python-list


Re: sgmllib parser keeps old tag data?

2009-02-13 Thread MRAB

Berend van Berkum wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1


Hi everyone,

I read the source, made numerous tests, but SGMLParser's keeps returning *tag* data 
from previous parser instances. I'm totally confused why.. The content data it

returns is ok.

E.g.::

sp = MyParser()
sp.feed('testt /Test/test')
print sp.content, sp.markup
sp.close()

sp = MyParser()
sp.feed('xml\n/xml\r\n')
print sp.content, sp.markup
sp.close()

gives::

('Test', [{'t': ({}, (0, 0))}, {'test': ({}, (0, 4))}]) 
('\n\r\n', [{'t': ({}, (0, 0))}, {'test': ({}, (0, 4))}, {'xml': ({}, (0, 1))}])


It keeps the tags from the previous session, while i'm sure the stack etc.
should be clean..

Any ideas?


regards, Berend

- 

import sgmllib


class MyParser(sgmllib.SGMLParser):

content = ''
markup = []
span_stack = []


These are in the _class_ itself, so they will be shared by all its
instances. You should so something like this instead:

def __init__(self):
self.content = ''
self.markup = []
self.span_stack = []


def handle_data(self, data):
self.content += data

def unknown_starttag(self, tag, attr):
stack = { tag: ( dict(attr), ( len(self.content), ) ) }
self.span_stack.append(stack)

def unknown_endtag(self, tag):
prev_tag, ( attr, ( offset, ) ) = 
self.span_stack.pop().items()[0]

if tag:
# close all tags on stack until it finds a matching end 
tag
# XXX: need to return to LEVEL, not same tag name
while tag != prev_tag:
span = { prev_tag: ( attr, ( offset, 0 ) ) }
self.markup.append( span )

prev_tag, ( attr, ( offset, ) ) = 
self.span_stack.pop().items()[0]

length = len( self.content ) - offset
span = { tag: ( attr, ( offset, length ) ) }
self.markup.append( span )

def do_unknown_tag(self, tag, attr):
assert not tag and not attr, do_unknown_tag %s, %s % (tag, 
attr)

def close(self):
sgmllib.SGMLParser.close(self)
self.content = ''
self.markup = []
self.span_stack = []


def parse_data(data):
sp = MyParser()
sp.feed(data)
r = sp.content, sp.markup
sp.close()
return r

print parse_data('testt /Test/test')
print parse_data('xml\n/xml\r\n')
print parse_data('sgmlsTest 3/s/sgml')



--
http://mail.python.org/mailman/listinfo/python-list


Re: sgmllib parser keeps old tag data?

2009-02-13 Thread Berend van Berkum
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Fri, Feb 13, 2009 at 02:31:40PM +, MRAB wrote:
 Berend van Berkum wrote:
 
 import sgmllib
 
 
 class MyParser(sgmllib.SGMLParser):
 
  content = ''
  markup = []
  span_stack = []
 
 These are in the _class_ itself, so they will be shared by all its
 instances. You should so something like this instead:
 
   def __init__(self):
   self.content = ''
   self.markup = []
   self.span_stack = []
 

Yes.. tested that and SGMLParser won't let me override __init__, 
(SGMLParser vars are uninitialized even with sgmllib.SGMLParser(self) call).
Tried some but not the following:
with a differently named init function and one boolean class var 'initialized'
it can check 'if self.initialized' in front of each handler. Does the trick.

Confusion dissolved :)
thanks.

- -- 
 web, http://dotmpe.com  ()ASCII Ribbon
 email, berend.van.ber...@gmail.com  /\
 icq, 26727647;  irc, berend/mpe at irc.oftc.net

-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.6 (GNU/Linux)

iD8DBQFJlYjVn70fkTNDJRgRAhFRAJ9XDPaR2zb8EjKfTACDjtzwI7z/9ACgzcmB
Ms1QZ9IoB2s6RJ+tdXJtzfs=
=itBb
-END PGP SIGNATURE-
--
http://mail.python.org/mailman/listinfo/python-list


Re: sgmllib parser keeps old tag data?

2009-02-13 Thread MRAB

Berend van Berkum wrote:

-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

On Fri, Feb 13, 2009 at 02:31:40PM +, MRAB wrote:

Berend van Berkum wrote:

import sgmllib


class MyParser(sgmllib.SGMLParser):

content = ''
markup = []
span_stack = []


These are in the _class_ itself, so they will be shared by all its
instances. You should so something like this instead:

def __init__(self):
self.content = ''
self.markup = []
self.span_stack = []



Yes.. tested that and SGMLParser won't let me override __init__, 
(SGMLParser vars are uninitialized even with sgmllib.SGMLParser(self) call).


OK, so SGMLParser needs to be initialised:

def __init__(self):
sgmllib.SGMLParser.__init__(self)
self.content = ''
self.markup = []
self.span_stack = []


Tried some but not the following:
with a differently named init function and one boolean class var 'initialized'
it can check 'if self.initialized' in front of each handler. Does the trick.

Confusion dissolved :)
thanks.


--
http://mail.python.org/mailman/listinfo/python-list


Re: sgmllib parser keeps old tag data?

2009-02-13 Thread andrew cooke

you are declaring class variables, not instance variables.  you need to
declare these in an __init__ method.  RTFM. 
http://docs.python.org/tutorial/classes.html#a-first-look-at-classes

Berend van Berkum wrote:
 class MyParser(sgmllib.SGMLParser):
 
 content = ''
 markup = []
 span_stack = []


--
http://mail.python.org/mailman/listinfo/python-list


Re: sgmllib parser keeps old tag data?

2009-02-13 Thread andrew cooke

Sorry, this reply was delayed (trying to use usenet...) and so now seems
(even more) bad tempered than needed.  Andrew

andrew cooke wrote:
 you are declaring class variables, not instance variables.  you need to
 declare these in an __init__ method.  RTFM.
 http://docs.python.org/tutorial/classes.html#a-first-look-at-classes
 
 Berend van Berkum wrote:
 class MyParser(sgmllib.SGMLParser):
 
 content = ''
 markup = []
 span_stack = []
 
 
 --
 http://mail.python.org/mailman/listinfo/python-list


--
http://mail.python.org/mailman/listinfo/python-list