On Jan 22, 10:57 am, Mike Driscoll <[EMAIL PROTECTED]> wrote: > Hi, > > I need to parse a fairly complex HTML page that has XML embedded in > it. I've done parsing before with the xml.dom.minidom module on just > plain XML, but I cannot get it to work with this HTML page. > > The XML looks like this: > ...
Once again (this IS HTML Day!), instead of parsing the HTML, pyparsing can help lift the interesting bits and leave the rest alone. Try this program out: from pyparsing import makeXMLTags,Word,nums,Combine,oneOf,SkipTo,withAttribute htmlWithEmbeddedXml = """ <HTML> <Body> <p> <b>Hey! this is really bold!</b> <Row status="o"> <Relationship>Owner</Relationship> <Priority>1</Priority> <StartDate>07/16/2007</StartDate> <StopsExist>No</StopsExist> <Name>Doe, John</Name> <Address>1905 S 3rd Ave , Hicksville IA 99999</Address> </Row> <Row status="o"> <Relationship>Owner</Relationship> <Priority>2</Priority> <StartDate>07/16/2007</StartDate> <StopsExist>No</StopsExist> <Name>Doe, Jane</Name> <Address>1905 S 3rd Ave , Hicksville IA 99999</Address> </Row> <table> <tr><Td>this is in a table, woo-hoo!</td> more HTML blah blah blah... """ # define pyparsing expressions for XML tags rowStart,rowEnd = makeXMLTags("Row") relationshipStart,relationshipEnd = makeXMLTags("Relationship") priorityStart,priorityEnd = makeXMLTags("Priority") startDateStart,startDateEnd = makeXMLTags("StartDate") stopsExistStart,stopsExistEnd = makeXMLTags("StopsExist") nameStart,nameEnd = makeXMLTags("Name") addressStart,addressEnd = makeXMLTags("Address") # define some useful expressions for data of specific types integer = Word(nums) date = Combine(Word(nums,exact=2)+"/"+ Word(nums,exact=2)+"/"+Word(nums,exact=4)) yesOrNo = oneOf("Yes No") # conversion parse actions integer.setParseAction(lambda t: int(t[0])) yesOrNo.setParseAction(lambda t: t[0]=='Yes') # could also define a conversion for date if you really wanted to # define format of a <Row>, plus assign results names for each data field rowRec = rowStart + \ relationshipStart + SkipTo(relationshipEnd)("relationship") + relationshipEnd + \ priorityStart + integer("priority") + priorityEnd + \ startDateStart + date("startdate") + startDateEnd + \ stopsExistStart + yesOrNo("stopsexist") + stopsExistEnd + \ nameStart + SkipTo(nameEnd)("name") + nameEnd + \ addressStart + SkipTo(addressEnd)("address") + addressEnd + \ rowEnd # set filtering parse action rowRec.setParseAction(withAttribute(relationship="Owner",priority=1)) # find all matching rows, matching grammar and filtering parse action rows = rowRec.searchString(htmlWithEmbeddedXml) # print the results (uncomment r.dump() statement to see full # result for each row) for r in rows: # print r.dump() print r.relationship print r.priority print r.startdate print r.stopsexist print r.name print r.address This prints: Owner 1 07/16/2007 False Doe, John 1905 S 3rd Ave , Hicksville IA 99999 In addition to parsing this data, some conversions were done at parse time, too - "1" was converted to the value 1, and "No" was converted to False. These were done by the conversion parse actions. The filtering just for Row's containing Relationship="Owner" and Priority=1 was done in a more global parse action, called withAttribute. If you comment this line out, you will see that both rows get retrieved. -- Paul (Find out more about pyparsing at http://pyparsing.wikispaces.com.) -- http://mail.python.org/mailman/listinfo/python-list