Martin, I can't thank you enough for taking the time to help and providing the detailed examples of how to get started. Now I know exactly how to proceed.
Thanks again, Roger -----Original Message----- From: Martin Morgan [mailto:[EMAIL PROTECTED] Sent: Friday, May 02, 2008 12:02 PM To: Bos, Roger Cc: r-help@r-project.org Subject: Re: [R] How to parse XML Hi Roger -- "Bos, Roger" <[EMAIL PROTECTED]> writes: > I would like to learn how to parse a mixed text/xml document I > downloaded from the sec.gov website (see example below). I would like I'm not sure of a more robust way to extract the XML, but from inspection I wrote > ftp <- "ftp://anonymous:[EMAIL PROTECTED]/edgar/data/1317493/0001144204-08-02122 1.txt" > txt <- readLines(ftp) > xmlInside <- grep("</*XML", txt) > xmlTxt <- txt[seq(xmlInside[1]+1, xmlInside[2]-1)] so that xmlTxt contains the part of the message that is XML > to parse this to get the value for each xml tag and then access it > within R, but I don't know much about xml so I don't even know where > to There are several ways to proceed. I personally like the xpath query language. to do this, one might > xml <- xmlTreeParse(xmlTxt, useInternal=TRUE) > head(unlist(xpathApply(xml, "//*", xmlName))) [1] "ownershipDocument" "schemaVersion" "documentType" [4] "periodOfReport" "notSubjectToSection16" "issuer" xpathApply takes an xml document and performs a query. The query '//*' says find all nodes mataching any character string (that's the *) that are located anywhere (that's the //) below the current (in this case root) node. This gives a list of nodes; xmlName extracts the name of the node. If I wanted all nodes not subject to section 16 (sounds ominmous) I'd extract all the nodes (a list0 > node <- xpathApply(xml, "//notSubjectToSection16") and then do something with them, e.g., look at them > lapply(node, saveXML) [[1]] [1] "<notSubjectToSection16>0</notSubjectToSection16>" (not so bad, looks like nothing is not subject to section 16, that's a relief) and extract their value > lapply(node, xmlValue) In one step: > xpathApply(xml, "//notSubjectToSection16", xmlValue) ?xpathApply is a good starting place, as is http://www.w3.org/TR/xpath, especially http://www.w3.org/TR/xpath#path-abbrev Martin > start debugging the errors I am getting in this example code. Can > anyone help me get started? > > Thanks, Roger > > ftp <- > "ftp://anonymous:[EMAIL PROTECTED]/edgar/data/1317493/0001144204-08-021 > 22 > 1.txt" > download.file(url=ftp, destfile="test2.txt") > xmlTreeParse("test2.txt") > > > ********************************************************************** > * This message is for the named person's use only. It may contain > confidential, proprietary or legally privileged information. No right > to confidential or privileged treatment of this message is waived or > lost by any error in transmission. If you have received this message > in error, please immediately notify the sender by e-mail, delete the > message and all copies from your system and destroy any hard copies. > You must not, directly or indirectly, use, disclose, distribute, print > or copy any part of this message if you are not the intended > recipient. > > ______________________________________________ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. -- Martin Morgan Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M2 B169 Phone: (206) 667-2793 ********************************************************************** * This message is for the named person's use only. It may contain confidential, proprietary or legally privileged information. No right to confidential or privileged treatment of this message is waived or lost by any error in transmission. If you have received this message in error, please immediately notify the sender by e-mail, delete the message and all copies from your system and destroy any hard copies. You must not, directly or indirectly, use, disclose, distribute, print or copy any part of this message if you are not the intended recipient. ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.