On 08/10/2012 03:46 PM, Frederic Fournier wrote:
Hello everyone,
I would like to parse very large xml files from MS/MS experiments and
create R objects from their content. (By very large, I mean going up to
5-10Gb, although I am using a 'small' 40M file to test my code.)
I'm not 100% sure of it's relevance, but
http://bioconductor.org/packages/2.10/bioc/html/MSnbase.html
There is a vignette here, for instance
http://bioconductor.org/packages/2.10/bioc/vignettes/MSnbase/inst/doc/MSnbase-io.pdf
If this is useful, then further questions might be directed to the
Bioconductor mailing list.
http://bioconductor.org/help/mailing-list/
Martin
My first attempt at parsing the 40M file, using the XML package, took more
than 2200 seconds and left me quite disappointed.
I managed to cut that down to around 40 seconds by:
-using the 'useInternalNodes' option of the XML package when parsing
the xml tree;
-vectorizing the parsing (i.e., replacing loops like "for(node in
group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}")
I gained another 5 seconds by making small changes to the functions used
(like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to
navigate to the children nodes).
Now I am blocked at around 35 seconds and I would still like to cut this
time by a 5x, but I have no clue what to do to achieve this gain. I'll try
to expose as briefly as possible the relevant structure of the xml file I
am parsing, the structure of the R object I want to create, and the type of
functions I am using to do it. I hope that one of you will be able to point
me towards a better and quicker way of doing the parsing!
Here is the (simplified) structure of the relevant nodes of the xml file:
<model> (many many nodes)
<protein> (a couple of proteins per model node)
<peptide> (1 per protein node)
<domain> (1 or more per peptide node)
<aa> (0 or more per domain node)
</aa>
</domain>
</peptide>
</protein>
</model>
Here is the basic structure of the R object that I want to create:
'result' object that contains:
-various attributes
-a list of 'protein' objects, each of which containing:
-various attributes
-a list of 'peptide' objects, each of which containing:
-various attributes
-a list of 'aa' objects, each of which consisting of a couple of
attributes.
Here is the basic structure of the code:
xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE)
result <- new('S4_result_class')
result@proteins <- xpathApply(xml.doc, "//model/protein",
function(protein.node) {
protein <- new('S4_protein_class')
## fill in a couple of attributes of the protein object using xmlValue
and xmlAttrs(protein.node)
protein@peptides <- xpathApply(protein.node, "./peptide",
function(peptide.node) {
peptide <- new('S4_peptide_class')
## fill in a couple of attributes of the peptide object using xmlValue
and xmlAttrs(peptide.node)
peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"),
function(aa.node) {
aa <- new('S4_aa_class')
## fill in a couple of attributes of the 'aa' object using xmlValue
and xmlAttrs(aa.node)
})
})
})
free(xml.doc)
Does anyone know a better and quicker way of doing this?
Sorry for the very long message and thank you very much for your time and
help!
Frederic
[[alternative HTML version deleted]]
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
Location: Arnold Building M1 B861
Phone: (206) 667-2793
______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.