Re: [Tutor] module to parse XMLish text?
Hello, I did not see the XML code in details before I gave the code with ElementTree. In fact with unclosing tags you will get errors at parse time and it will give you the location of errors. You could use the module from Stefan which is way way superior than ElementTree which can validate against DTD or XSD and many many other features (speed, etc...) Regards Karim On 01/15/2011 07:53 AM, Stefan Behnel wrote: Wayne Werner, 15.01.2011 03:25: On Fri, Jan 14, 2011 at 4:42 PM, Terry Carroll wrote: On Fri, 14 Jan 2011, Karim wrote: from xml.etree.ElementTree import ElementTree I don't think straight XML parsing will work on this, as it's not valid XML; it just looks XML-like enough to cause confusion. It's worth trying out - most (good) parsers do the right thing even when they don't have strictly valid code. I don't know if xml.etree is one, but I'm fairly sure both lxml and BeautifulSoup would probably parse it correctly. They wouldn't. For the first tags, the text values would either not come out at all or they would be read as attributes and thus loose their order and potentially their whitespace as well. The other tags would likely get parsed properly, but the parser may end up nesting them as it hasn't found a closing tag for the previous tags yet. So, in any case, you'd end up with data loss and/or a structure that would be much harder to handle than the (relatively) simple file structure. Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] module to parse XMLish text?
Hello, *from xml.etree.ElementTree import ElementTree _/#Parsing:/_ doc = ElementTree() doc.parse(xmlFile) * /_*#Find tag element:*_/ *doc.find('mytag')* *_/#iteration over tag element:/_ lname = [] for lib in doc.iter('LibTag'): libName = lib.attrib['name'] lname.append(libName) * Regards Karim On 01/14/2011 03:55 AM, Terry Carroll wrote: Does anyone know of a module that can parse out text with XML-like tags as in the example below? I emphasize the -like in XML-like. I don't think I can parse this as XML (can I?). Sample text between the dashed lines:: - Blah, blah, blah AAA BING ZEBRA BANG ROOSTER BOOM GARBONZO BEAN BLIPSOMETHING ELSE/BLIP BASHSOMETHING DIFFERENT/BASH /AAA - I'd like to be able to have a dictionary (or any other structure, really; as long as I can get to the parsed-out pieces) that would look smoothing like: {BING : ZEBRA, BANG : ROOSTER BOOM : GARBONZO BEAN BLIP : SOMETHING ELSE BASH : SOMETHING DIFFERENT} The Blah, blah, blah can be tossed away, for all I care. The basic rule is that the tag either has an operand (e.g., BING ZEBRA), in which case the name is the first word and the content is everything else that follows in the tag; or else the tag has no operand, in which case it is matched to a corresponding closing tag (e.g., BLIPSOMETHING ELSE/BLIP), and the content is the material between the two tags. I think I can assume there are no nested tags. I could write a state machine to do this, I suppose, but life's short, and I'd rather not re-invent the wheel, if there's a wheel laying around somewhere. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] module to parse XMLish text?
Terry Carroll, 14.01.2011 03:55: Does anyone know of a module that can parse out text with XML-like tags as in the example below? I emphasize the -like in XML-like. I don't think I can parse this as XML (can I?). Sample text between the dashed lines:: - Blah, blah, blah AAA BING ZEBRA BANG ROOSTER BOOM GARBONZO BEAN BLIPSOMETHING ELSE/BLIP BASHSOMETHING DIFFERENT/BASH /AAA - You can't parse this as XML because it's not XML. The three initial child tags are not properly closed. If the format is really as you describe, i.e. one line per tag, regular expressions will work nicely. Something like (untested) import re parse_tag_and_text = re.compile( # accept a tag name and then either space+tag or ''+text+'/...' '^([^ ]+)(?: ([^]+)\s*|([^]+)/.*)$').match special_tags = set(['AAA']) result = {} for line in the_file: match = parse_tag_and_text(line) if match: if match.group(1) in special_tags: pass # do something special? else: # don't care which format, take whatever text group matched result[match.group(1)] = match.group(2) or match.group(3) Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] module to parse XMLish text?
On Fri, 14 Jan 2011, Stefan Behnel wrote: Terry Carroll, 14.01.2011 03:55: Does anyone know of a module that can parse out text with XML-like tags as in the example below? I emphasize the -like in XML-like. I don't think I can parse this as XML (can I?). Sample text between the dashed lines:: - Blah, blah, blah AAA BING ZEBRA BANG ROOSTER BOOM GARBONZO BEAN BLIPSOMETHING ELSE/BLIP BASHSOMETHING DIFFERENT/BASH /AAA - You can't parse this as XML because it's not XML. The three initial child tags are not properly closed. Yeah, that's what I figured. If the format is really as you describe, i.e. one line per tag, regular expressions will work nicely. Now there's an idea! I hadn't thought of using regexs, probably because I'm terrible at all but the most simple ones. As it happens, I'm only interested in four of the tags' contents, so I could probably manage to write a seried of regexes that even I could maintain, one for each of the pieces of data I want to extract; if I try to write a grand unified regex, I'm bound to shoot myself in the foot. Thanks very much. On Fri, 14 Jan 2011, Karim wrote: from xml.etree.ElementTree import ElementTree I don't think straight XML parsing will work on this, as it's not valid XML; it just looks XML-like enough to cause confusion. ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] module to parse XMLish text?
On Fri, Jan 14, 2011 at 4:42 PM, Terry Carroll carr...@tjc.com wrote: snip On Fri, 14 Jan 2011, Karim wrote: from xml.etree.ElementTree import ElementTree I don't think straight XML parsing will work on this, as it's not valid XML; it just looks XML-like enough to cause confusion. It's worth trying out - most (good) parsers do the right thing even when they don't have strictly valid code. I don't know if xml.etree is one, but I'm fairly sure both lxml and BeautifulSoup would probably parse it correctly. Only one way to find out ;) -Wayne ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor
Re: [Tutor] module to parse XMLish text?
Wayne Werner, 15.01.2011 03:25: On Fri, Jan 14, 2011 at 4:42 PM, Terry Carroll wrote: On Fri, 14 Jan 2011, Karim wrote: from xml.etree.ElementTree import ElementTree I don't think straight XML parsing will work on this, as it's not valid XML; it just looks XML-like enough to cause confusion. It's worth trying out - most (good) parsers do the right thing even when they don't have strictly valid code. I don't know if xml.etree is one, but I'm fairly sure both lxml and BeautifulSoup would probably parse it correctly. They wouldn't. For the first tags, the text values would either not come out at all or they would be read as attributes and thus loose their order and potentially their whitespace as well. The other tags would likely get parsed properly, but the parser may end up nesting them as it hasn't found a closing tag for the previous tags yet. So, in any case, you'd end up with data loss and/or a structure that would be much harder to handle than the (relatively) simple file structure. Stefan ___ Tutor maillist - Tutor@python.org To unsubscribe or change subscription options: http://mail.python.org/mailman/listinfo/tutor