Re: [R] Parsing large XML documents in R - how to optimize the speed?
If this is an option for you: An xml database can handle (very) huge xml files and let you query nodes very efficiently. Then, you could query the xml databse from R (using REST) to do your statistics. There are some open source xquery/xml databases available. 2012/8/11 Frederic Fournier > Hello everyone, > > I would like to parse very large xml files from MS/MS experiments and > create R objects from their content. (By very large, I mean going up to > 5-10Gb, although I am using a 'small' 40M file to test my code.) > > My first attempt at parsing the 40M file, using the XML package, took more > than 2200 seconds and left me quite disappointed. > I managed to cut that down to around 40 seconds by: > -using the 'useInternalNodes' option of the XML package when parsing > the xml tree; > -vectorizing the parsing (i.e., replacing loops like "for(node in > group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}") > I gained another 5 seconds by making small changes to the functions used > (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to > navigate to the children nodes). > Now I am blocked at around 35 seconds and I would still like to cut this > time by a 5x, but I have no clue what to do to achieve this gain. I'll try > to expose as briefly as possible the relevant structure of the xml file I > am parsing, the structure of the R object I want to create, and the type of > functions I am using to do it. I hope that one of you will be able to point > me towards a better and quicker way of doing the parsing! > > > Here is the (simplified) structure of the relevant nodes of the xml file: > > (many many nodes) >(a couple of proteins per model node) > (1 per protein node) >(1 or more per peptide node) > (0 or more per domain node) > > > > > > > Here is the basic structure of the R object that I want to create: > > 'result' object that contains: > -various attributes > -a list of 'protein' objects, each of which containing: > -various attributes > -a list of 'peptide' objects, each of which containing: > -various attributes > -a list of 'aa' objects, each of which consisting of a couple of > attributes. > > Here is the basic structure of the code: > > xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE) > result <- new('S4_result_class') > result@proteins <- xpathApply(xml.doc, "//model/protein", > function(protein.node) { > protein <- new('S4_protein_class') > ## fill in a couple of attributes of the protein object using xmlValue > and xmlAttrs(protein.node) > protein@peptides <- xpathApply(protein.node, "./peptide", > function(peptide.node) { > peptide <- new('S4_peptide_class') > ## fill in a couple of attributes of the peptide object using xmlValue > and xmlAttrs(peptide.node) > peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"), > function(aa.node) { > aa <- new('S4_aa_class') > ## fill in a couple of attributes of the 'aa' object using xmlValue > and xmlAttrs(aa.node) > }) > }) > }) > free(xml.doc) > > > Does anyone know a better and quicker way of doing this? > > Sorry for the very long message and thank you very much for your time and > help! > > Frederic > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Parsing large XML documents in R - how to optimize the speed?
Hi Frederic You definitely want to be using xmlParse() (or equivalently xmlTreeParse( , useInternalNodes = TRUE)). This then allows use of getNodeSet() I would suggest you use Rprof() to find out where the bottlenecks arise, e.g. in the XML functions or in S4 code, or in your code that assembles the R objects from the XML. I'm happy to take a look at speeding it up if you can make the test file available and show me your code. D. On 8/10/12 3:46 PM, Frederic Fournier wrote: > Hello everyone, > > I would like to parse very large xml files from MS/MS experiments and > create R objects from their content. (By very large, I mean going up to > 5-10Gb, although I am using a 'small' 40M file to test my code.) > > My first attempt at parsing the 40M file, using the XML package, took more > than 2200 seconds and left me quite disappointed. > I managed to cut that down to around 40 seconds by: > -using the 'useInternalNodes' option of the XML package when parsing > the xml tree; > -vectorizing the parsing (i.e., replacing loops like "for(node in > group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}") > I gained another 5 seconds by making small changes to the functions used > (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to > navigate to the children nodes). > Now I am blocked at around 35 seconds and I would still like to cut this > time by a 5x, but I have no clue what to do to achieve this gain. I'll try > to expose as briefly as possible the relevant structure of the xml file I > am parsing, the structure of the R object I want to create, and the type of > functions I am using to do it. I hope that one of you will be able to point > me towards a better and quicker way of doing the parsing! > > > Here is the (simplified) structure of the relevant nodes of the xml file: > > (many many nodes) >(a couple of proteins per model node) > (1 per protein node) >(1 or more per peptide node) > (0 or more per domain node) > > > > > > > Here is the basic structure of the R object that I want to create: > > 'result' object that contains: > -various attributes > -a list of 'protein' objects, each of which containing: > -various attributes > -a list of 'peptide' objects, each of which containing: > -various attributes > -a list of 'aa' objects, each of which consisting of a couple of > attributes. > > Here is the basic structure of the code: > > xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE) > result <- new('S4_result_class') > result@proteins <- xpathApply(xml.doc, "//model/protein", > function(protein.node) { > protein <- new('S4_protein_class') > ## fill in a couple of attributes of the protein object using xmlValue > and xmlAttrs(protein.node) > protein@peptides <- xpathApply(protein.node, "./peptide", > function(peptide.node) { > peptide <- new('S4_peptide_class') > ## fill in a couple of attributes of the peptide object using xmlValue > and xmlAttrs(peptide.node) > peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"), > function(aa.node) { > aa <- new('S4_aa_class') > ## fill in a couple of attributes of the 'aa' object using xmlValue > and xmlAttrs(aa.node) > }) > }) > }) > free(xml.doc) > > > Does anyone know a better and quicker way of doing this? > > Sorry for the very long message and thank you very much for your time and > help! > > Frederic > > [[alternative HTML version deleted]] > > __ > R-help@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. > > __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Parsing large XML documents in R - how to optimize the speed?
On 08/10/2012 03:46 PM, Frederic Fournier wrote: Hello everyone, I would like to parse very large xml files from MS/MS experiments and create R objects from their content. (By very large, I mean going up to 5-10Gb, although I am using a 'small' 40M file to test my code.) I'm not 100% sure of it's relevance, but http://bioconductor.org/packages/2.10/bioc/html/MSnbase.html There is a vignette here, for instance http://bioconductor.org/packages/2.10/bioc/vignettes/MSnbase/inst/doc/MSnbase-io.pdf If this is useful, then further questions might be directed to the Bioconductor mailing list. http://bioconductor.org/help/mailing-list/ Martin My first attempt at parsing the 40M file, using the XML package, took more than 2200 seconds and left me quite disappointed. I managed to cut that down to around 40 seconds by: -using the 'useInternalNodes' option of the XML package when parsing the xml tree; -vectorizing the parsing (i.e., replacing loops like "for(node in group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}") I gained another 5 seconds by making small changes to the functions used (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to navigate to the children nodes). Now I am blocked at around 35 seconds and I would still like to cut this time by a 5x, but I have no clue what to do to achieve this gain. I'll try to expose as briefly as possible the relevant structure of the xml file I am parsing, the structure of the R object I want to create, and the type of functions I am using to do it. I hope that one of you will be able to point me towards a better and quicker way of doing the parsing! Here is the (simplified) structure of the relevant nodes of the xml file: (many many nodes) (a couple of proteins per model node) (1 per protein node) (1 or more per peptide node) (0 or more per domain node) Here is the basic structure of the R object that I want to create: 'result' object that contains: -various attributes -a list of 'protein' objects, each of which containing: -various attributes -a list of 'peptide' objects, each of which containing: -various attributes -a list of 'aa' objects, each of which consisting of a couple of attributes. Here is the basic structure of the code: xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE) result <- new('S4_result_class') result@proteins <- xpathApply(xml.doc, "//model/protein", function(protein.node) { protein <- new('S4_protein_class') ## fill in a couple of attributes of the protein object using xmlValue and xmlAttrs(protein.node) protein@peptides <- xpathApply(protein.node, "./peptide", function(peptide.node) { peptide <- new('S4_peptide_class') ## fill in a couple of attributes of the peptide object using xmlValue and xmlAttrs(peptide.node) peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"), function(aa.node) { aa <- new('S4_aa_class') ## fill in a couple of attributes of the 'aa' object using xmlValue and xmlAttrs(aa.node) }) }) }) free(xml.doc) Does anyone know a better and quicker way of doing this? Sorry for the very long message and thank you very much for your time and help! Frederic [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Computational Biology / Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N. PO Box 19024 Seattle, WA 98109 Location: Arnold Building M1 B861 Phone: (206) 667-2793 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Parsing large XML documents in R - how to optimize the speed?
Hello everyone, I would like to parse very large xml files from MS/MS experiments and create R objects from their content. (By very large, I mean going up to 5-10Gb, although I am using a 'small' 40M file to test my code.) My first attempt at parsing the 40M file, using the XML package, took more than 2200 seconds and left me quite disappointed. I managed to cut that down to around 40 seconds by: -using the 'useInternalNodes' option of the XML package when parsing the xml tree; -vectorizing the parsing (i.e., replacing loops like "for(node in group.of.nodes) {...}" by "sapply(group.of.node, function(node){...}") I gained another 5 seconds by making small changes to the functions used (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to navigate to the children nodes). Now I am blocked at around 35 seconds and I would still like to cut this time by a 5x, but I have no clue what to do to achieve this gain. I'll try to expose as briefly as possible the relevant structure of the xml file I am parsing, the structure of the R object I want to create, and the type of functions I am using to do it. I hope that one of you will be able to point me towards a better and quicker way of doing the parsing! Here is the (simplified) structure of the relevant nodes of the xml file: (many many nodes) (a couple of proteins per model node) (1 per protein node) (1 or more per peptide node) (0 or more per domain node) Here is the basic structure of the R object that I want to create: 'result' object that contains: -various attributes -a list of 'protein' objects, each of which containing: -various attributes -a list of 'peptide' objects, each of which containing: -various attributes -a list of 'aa' objects, each of which consisting of a couple of attributes. Here is the basic structure of the code: xml.doc <- xmlTreeParse("file", getDTD=FALSE, useInternalNodes=TRUE) result <- new('S4_result_class') result@proteins <- xpathApply(xml.doc, "//model/protein", function(protein.node) { protein <- new('S4_protein_class') ## fill in a couple of attributes of the protein object using xmlValue and xmlAttrs(protein.node) protein@peptides <- xpathApply(protein.node, "./peptide", function(peptide.node) { peptide <- new('S4_peptide_class') ## fill in a couple of attributes of the peptide object using xmlValue and xmlAttrs(peptide.node) peptide@aas <- sapply(xmlElementsByTagName(peptide.node, name="aa"), function(aa.node) { aa <- new('S4_aa_class') ## fill in a couple of attributes of the 'aa' object using xmlValue and xmlAttrs(aa.node) }) }) }) free(xml.doc) Does anyone know a better and quicker way of doing this? Sorry for the very long message and thank you very much for your time and help! Frederic [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.