Hello everyone, I would like to parse very large xml files from MS/MS experiments and create R objects from their content. (By very large, I mean going up to 5-10Gb, although I am using a 'small' 40M file to test my code.)
My first attempt at parsing the 40M file, using the XML package, took more than 2200 seconds and left me quite disappointed. I managed to cut that down to around 40 seconds by: -using the 'useInternalNodes' option of the XML package when parsing the xml tree; -vectorizing the parsing (i.e., replacing loops like "for(node in group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}") I gained another 5 seconds by making small changes to the functions used (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to navigate to the children nodes). Now I am blocked at around 35 seconds and I would still like to cut this time by a 5x, but I have no clue what to do to achieve this gain. I'll try to expose as briefly as possible the relevant structure of the xml file I am parsing, the structure of the R object I want to create, and the type of functions I am using to do it. I hope that one of you will be able to point me towards a better and quicker way of doing the parsing! Here is the (simplified) structure of the relevant nodes of the xml file: <model> (many many nodes) <protein> (a couple of proteins per model node) <peptide> (1 per protein node) <domain> (1 or more per peptide node) <aa> (0 or more per domain node) </aa> </domain> </peptide> </protein> </model> Here is the basic structure of the R object that I want to create: 'result' object that contains: -various attributes -a list of 'protein' objects, each of which containing: -various attributes -a list of 'peptide' objects, each of which containing: -various attributes -a list of 'aa' objects, each of which consisting of a couple of attributes. Here is the basic structure of the code: xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE) result <- new('S4_result_class') result@proteins <- xpathApply(xml.doc, "//model/protein", function(protein.node) { protein <- new('S4_protein_class') ## fill in a couple of attributes of the protein object using xmlValue and xmlAttrs(protein.node) protein@peptides <- xpathApply(protein.node, "./peptide", function(peptide.node) { peptide <- new('S4_peptide_class') ## fill in a couple of attributes of the peptide object using xmlValue and xmlAttrs(peptide.node) peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"), function(aa.node) { aa <- new('S4_aa_class') ## fill in a couple of attributes of the 'aa' object using xmlValue and xmlAttrs(aa.node) }) }) }) free(xml.doc) Does anyone know a better and quicker way of doing this? Sorry for the very long message and thank you very much for your time and help! Frederic [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.