expat having problems with entities (amp;)
I need expat to parse this block of xml: datafield tag=991 subfield code=bc-Pamp;P/subfield subfield code=hLOT 3677/subfield subfield code=m(F)/subfield /datafield I need to parse the xml and return a dictionary that follows roughly the same layout as the xml. Currently the code for the class handling this is: class XML2Map(): def __init__(self): self.parser = expat.ParserCreate() self.parser.StartElementHandler = self.start_element self.parser.EndElementHandler = self.end_element self.parser.CharacterDataHandler = self.char_data self.map = [] #not a dictionary self.current_tag = '' self.current_subfields = [] self.current_sub = '' self.current_data = '' def parse_xml(self, xml_text): self.parser.Parse(xml_text, 1) def start_element(self, name, attrs): if name == 'datafield': self.current_tag = attrs['tag'] elif name == 'subfield': self.current_sub = attrs['code'] def char_data(self, data): self.current_data = data def end_element(self, name): if name == 'subfield': self.current_subfields.append([self.current_sub, self.current_data]) elif name == 'datafield': self.map.append({'tag': self.current_tag, 'subfields': self.current_subfields}) self.current_subfields = [] #resetting the values for next subfields Right now my problem is that when it's parsing the subfield element with the data c-Pamp;P, it's not taking the whole data, but instead it's breaking it into c-P, , P. i'm not an expert with expat, and I couldn't find a lot of information on how it handles specific entities. In the resulting map, instead of: {'tag': u'991', 'subfields': [[u'b', u'c-PP'], [u'h', u'LOT 3677'], [u'm', u'(F)']], 'inds': [u' ', u' ']} I get this: {'tag': u'991', 'subfields': [[u'b', u'P'], [u'h', u'LOT 3677'], [u'm', u'(F)']], 'inds': [u' ', u' ']} In the debugger, I can see that current_data gets assigned c-P, then , and then P. Any ideas on any expat tricks I'm missing out on? I'm also inclined to try another parser that can keep the string together when there are entities, or at least ampersands. -- http://mail.python.org/mailman/listinfo/python-list
Re: expat having problems with entities (amp;)
On Dec 11, 4:23 pm, nnguyen nguy...@gmail.com wrote: I need expat to parse this block of xml: datafield tag=991 subfield code=bc-Pamp;P/subfield subfield code=hLOT 3677/subfield subfield code=m(F)/subfield /datafield I need to parse the xml and return a dictionary that follows roughly the same layout as the xml. Currently the code for the class handling this is: class XML2Map(): def __init__(self): self.parser = expat.ParserCreate() self.parser.StartElementHandler = self.start_element self.parser.EndElementHandler = self.end_element self.parser.CharacterDataHandler = self.char_data self.map = [] #not a dictionary self.current_tag = '' self.current_subfields = [] self.current_sub = '' self.current_data = '' def parse_xml(self, xml_text): self.parser.Parse(xml_text, 1) def start_element(self, name, attrs): if name == 'datafield': self.current_tag = attrs['tag'] elif name == 'subfield': self.current_sub = attrs['code'] def char_data(self, data): self.current_data = data def end_element(self, name): if name == 'subfield': self.current_subfields.append([self.current_sub, self.current_data]) elif name == 'datafield': self.map.append({'tag': self.current_tag, 'subfields': self.current_subfields}) self.current_subfields = [] #resetting the values for next subfields Right now my problem is that when it's parsing the subfield element with the data c-Pamp;P, it's not taking the whole data, but instead it's breaking it into c-P, , P. i'm not an expert with expat, and I couldn't find a lot of information on how it handles specific entities. In the resulting map, instead of: {'tag': u'991', 'subfields': [[u'b', u'c-PP'], [u'h', u'LOT 3677'], [u'm', u'(F)']], 'inds': [u' ', u' ']} I get this: {'tag': u'991', 'subfields': [[u'b', u'P'], [u'h', u'LOT 3677'], [u'm', u'(F)']], 'inds': [u' ', u' ']} In the debugger, I can see that current_data gets assigned c-P, then , and then P. Any ideas on any expat tricks I'm missing out on? I'm also inclined to try another parser that can keep the string together when there are entities, or at least ampersands. I forgot, ignore the 'inds':... in the output above, it's just another part of the xml I had to parse that isn't important to this discussion. -- http://mail.python.org/mailman/listinfo/python-list
Re: expat having problems with entities (amp;)
On Dec 11, 4:39 pm, Rami Chowdhury rami.chowdh...@gmail.com wrote: On Fri, Dec 11, 2009 at 13:23, nnguyen nguy...@gmail.com wrote: Any ideas on any expat tricks I'm missing out on? I'm also inclined to try another parser that can keep the string together when there are entities, or at least ampersands. IIRC expat explicitly does not guarantee that character data will be handed to the CharacterDataHandler in complete blocks. If you're certain you want to stay at such a low level, I would just modify your char_data method to append character data to self.current_data rather than replacing it. Personally, if I had the option (e.g. Python 2.5+) I'd use ElementTree... Well the appending trick worked. From some logging I figured out that it was reading through those bits of current_data before getting to the subfield ending element (which is kinda obvious when you think about it). So I just used a += and made sure to clear out current_data when it hits a subfield ending element. Thanks! -- http://mail.python.org/mailman/listinfo/python-list