On Dec 14, 3:06 am, "Gabriel Genellina" <[EMAIL PROTECTED]> wrote: > En Fri, 14 Dec 2007 06:06:21 -0300, Sean DiZazzo <[EMAIL PROTECTED]> > escribió: > > > > > On Dec 14, 12:04 am, Marc 'BlackJack' Rintsch <[EMAIL PROTECTED]> wrote: > >> On Thu, 13 Dec 2007 17:49:20 -0800, Sean DiZazzo wrote: > >> > I'm wrapping up a command line util that returns xml in Python. The > >> > util is flaky, and gives me back poorly formed xml with different > >> > problems in different cases. Anyway I'm making progress. I'm not > >> > very good at regular expressions though and was wondering if someone > >> > could help with initially splitting the tags from the stdout returned > >> > from the util. > > >> Flaky XML is often produced by programs that treat XML as ordinary text > >> files. If you are starting to parse XML with regular expressions you are > >> making the very same mistake. XML may look somewhat simple but > >> producing correct XML and parsing it isn't. Sooner or later you stumble > >> across something that breaks producing or parsing the "naive" way. > > > It's not really complicated xml so far, just tags with attributes. > > Still, using different queries against the program sometimes offers > > differing results...a few examples: > > > <id 123456 /> > > <tag name="foo" /> > > <tag2 name="foo" moreattrs="..." /tag2> > > <tag3 name="foo" moreattrs="..." tag3/> > > Ouch... only the second is valid xml. Most tools require at least a well > formed document. You may try using BeautifulStoneSoup, included with > BeautifulSouphttp://crummy.com/software/BeautifulSoup/ > > > I found something that works, although I couldn't tell you why it > > works. :) > > retag = re.compile(r'<.+?>', re.DOTALL) > > tags = retag.findall(retag) > > Why does that work? > > That means: "look for a less-than sign (<), followed by the shortest > sequence of (?) one or more (+) arbitrary characters (.), followed by a > greater-than sign (>)" > > If you never get nested tags, and never have a ">" inside an attribute, > that expression *might* work. But please try BeautifulStoneSoup, it uses a > lot of heuristics trying to guess the right structure. Doesn't work > always, but given your input, there isn't much one can do... > > -- > Gabriel Genellina
Thanks! I'll take a look at BeautifulStoneSoup today and see what I get. ~Sean -- http://mail.python.org/mailman/listinfo/python-list