On Mar 24, 2005, at 8:35 AM, [EMAIL PROTECTED] wrote:
David Reed wrote:
There's probably a better mailing list with XML parsing experts. I'm certainly not an expert but have done a little XML parsing. I've always followed the pattern of using startElement, characters and endElement to grab all the data. In the startElement method you set a instance variable to keep track of the current tag you are processing. You use the characters method to build up the values and then in the endElement method you store the data in your data structure. See the pyxml HOWTO for an example - specifically this section: http://pyxml.sourceforge.net/topics/howto/node14.html
Yes, sure. Thanks, but that's not what I wanted to know. Perhaps I wasn't clear enough. It's not really so much XML related...
def startElement(self, name, attrs): self._queue.append(name) # keep the order of processed tags handler = str('_start_'+name) if hasattr(self, handler): self.__class__.__dict__[handler](self, attrs)
Is there a better syntax for self.__class__.__dict__[handler]?
You should be able to use getattr to get the method and then call it. That's a little cleaner IMO.
And where should the "output" go to? All examples use print statements in the element handlers.
I'm not certain we are clear. Instead of output statements you set store the data in some instance variable - in your case it appears self.pages is your instance variable containing the data. So your endElement method would set something in self.pages based on the tag indicated and the data built up from the characters method and any of the attrs from the start tag. If all your data is in the attrs that you get in the startElement tag then there's no need to do anything in the characters or endElement methods. If you want to use the startElement/characters/endElement approach, I can try to find a small example I've written and send it to you off-list.
I wrote those get... methods - but I guess they don't belong in the XML handler, but perhaps in the parser or somewhere else.
It works, but I don't think it's good design.
def getPages(self): return self.pages.getSortedArray()
def getPage(self, no): return self.pages[no]
parser = xml.sax.make_parser() parser.setFeature(xml.sax.handler.feature_namespaces, 0) pxh = MyHandler() parser.setContentHandler(pxh) parser.parse(dateiname) for p in pxh.getPages(): ...
I should ask the last question on the twisted ML, I guess:
Further, if I'd like to use it in a twisted driven asynchronous app, would I let the parser run in a thread? (Or how can I make the parser non-blocking?)
I've never looked into twister so I can't answer this.
Dave
_______________________________________________ Pythonmac-SIG maillist - Pythonmac-SIG@python.org http://mail.python.org/mailman/listinfo/pythonmac-sig