Re: [R] Parsing large XML documents in R - how to optimize the speed?

2012-08-11 Thread Duncan Temple Lang

Hi Frederic

  You definitely want to be using xmlParse() (or equivalently
  xmlTreeParse( , useInternalNodes = TRUE)).

  This then allows use of getNodeSet()

  I would suggest you use Rprof() to find out where the bottlenecks arise,
   e.g. in the XML functions or in S4 code, or in your code that assembles the
R objects from the XML.

  I'm happy to take a look at speeding it up if you can make the test file 
available
and show me your code.

D.
On 8/10/12 3:46 PM, Frederic Fournier wrote:
 Hello everyone,
 
 I would like to parse very large xml files from MS/MS experiments and
 create R objects from their content. (By very large, I mean going up to
 5-10Gb, although I am using a 'small' 40M file to test my code.)
 
 My first attempt at parsing the 40M file, using the XML package, took more
 than 2200 seconds and left me quite disappointed.
 I managed to cut that down to around 40 seconds by:
 -using the 'useInternalNodes' option of the XML package when parsing
 the xml tree;
 -vectorizing the parsing (i.e., replacing loops like for(node in
 group.of.nodes) {...} by sapply(group.of.node, function(node){...})
 I gained another 5 seconds by making small changes to the functions used
 (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to
 navigate to the children nodes).
 Now I am blocked at around 35 seconds and I would still like to cut this
 time by a 5x, but I have no clue what to do to achieve this gain. I'll try
 to expose as briefly as possible the relevant structure of the xml file I
 am parsing, the structure of the R object I want to create, and the type of
 functions I am using to do it. I hope that one of you will be able to point
 me towards a better and quicker way of doing the parsing!
 
 
 Here is the (simplified) structure of the relevant nodes of the xml file:
 
 model (many many nodes)
   protein (a couple of proteins per model node)
 peptide (1 per protein node)
   domain (1 or more per peptide node)
 aa (0 or more per domain node)
 /aa
   /domain
 /peptide
   /protein
 /model
 
 Here is the basic structure of the R object that I want to create:
 
 'result' object that contains:
   -various attributes
   -a list of 'protein' objects, each of which containing:
   -various attributes
   -a list of 'peptide' objects, each of which containing:
 -various attributes
 -a list of 'aa' objects, each of which consisting of a couple of
 attributes.
 
 Here is the basic structure of the code:
 
 xml.doc - xmlTreeParse(file, getDTD=FALSE, useInternalNodes=TRUE)
 result - new('S4_result_class')
 result@proteins - xpathApply(xml.doc, //model/protein,
 function(protein.node) {
   protein - new('S4_protein_class')
   ## fill in a couple of attributes of the protein object using xmlValue
 and xmlAttrs(protein.node)
   protein@peptides - xpathApply(protein.node, ./peptide,
 function(peptide.node) {
 peptide - new('S4_peptide_class')
 ## fill in a couple of attributes of the peptide object using xmlValue
 and xmlAttrs(peptide.node)
 peptide@aas - sapply(xmlElementsByTagName(peptide.node, name=aa),
 function(aa.node) {
   aa - new('S4_aa_class')
   ## fill in a couple of attributes of the 'aa' object using xmlValue
 and xmlAttrs(aa.node)
 })
   })
 })
 free(xml.doc)
 
 
 Does anyone know a better and quicker way of doing this?
 
 Sorry for the very long message and thank you very much for your time and
 help!
 
 Frederic
 
   [[alternative HTML version deleted]]
 
 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.
 


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Parsing large XML documents in R - how to optimize the speed?

2012-08-11 Thread Erdal Karaca
If this is an option for you: An xml database can handle (very) huge xml
files and let you query nodes very efficiently.
Then, you could query the xml databse from R (using REST) to do your
statistics.

There are some open source xquery/xml databases available.

2012/8/11 Frederic Fournier frederic.bioi...@gmail.com

 Hello everyone,

 I would like to parse very large xml files from MS/MS experiments and
 create R objects from their content. (By very large, I mean going up to
 5-10Gb, although I am using a 'small' 40M file to test my code.)

 My first attempt at parsing the 40M file, using the XML package, took more
 than 2200 seconds and left me quite disappointed.
 I managed to cut that down to around 40 seconds by:
 -using the 'useInternalNodes' option of the XML package when parsing
 the xml tree;
 -vectorizing the parsing (i.e., replacing loops like for(node in
 group.of.nodes) {...} by sapply(group.of.node, function(node){...})
 I gained another 5 seconds by making small changes to the functions used
 (like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to
 navigate to the children nodes).
 Now I am blocked at around 35 seconds and I would still like to cut this
 time by a 5x, but I have no clue what to do to achieve this gain. I'll try
 to expose as briefly as possible the relevant structure of the xml file I
 am parsing, the structure of the R object I want to create, and the type of
 functions I am using to do it. I hope that one of you will be able to point
 me towards a better and quicker way of doing the parsing!


 Here is the (simplified) structure of the relevant nodes of the xml file:

 model (many many nodes)
   protein (a couple of proteins per model node)
 peptide (1 per protein node)
   domain (1 or more per peptide node)
 aa (0 or more per domain node)
 /aa
   /domain
 /peptide
   /protein
 /model

 Here is the basic structure of the R object that I want to create:

 'result' object that contains:
   -various attributes
   -a list of 'protein' objects, each of which containing:
   -various attributes
   -a list of 'peptide' objects, each of which containing:
 -various attributes
 -a list of 'aa' objects, each of which consisting of a couple of
 attributes.

 Here is the basic structure of the code:

 xml.doc - xmlTreeParse(file, getDTD=FALSE, useInternalNodes=TRUE)
 result - new('S4_result_class')
 result@proteins - xpathApply(xml.doc, //model/protein,
 function(protein.node) {
   protein - new('S4_protein_class')
   ## fill in a couple of attributes of the protein object using xmlValue
 and xmlAttrs(protein.node)
   protein@peptides - xpathApply(protein.node, ./peptide,
 function(peptide.node) {
 peptide - new('S4_peptide_class')
 ## fill in a couple of attributes of the peptide object using xmlValue
 and xmlAttrs(peptide.node)
 peptide@aas - sapply(xmlElementsByTagName(peptide.node, name=aa),
 function(aa.node) {
   aa - new('S4_aa_class')
   ## fill in a couple of attributes of the 'aa' object using xmlValue
 and xmlAttrs(aa.node)
 })
   })
 })
 free(xml.doc)


 Does anyone know a better and quicker way of doing this?

 Sorry for the very long message and thank you very much for your time and
 help!

 Frederic

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Parsing large XML documents in R - how to optimize the speed?

2012-08-10 Thread Martin Morgan

On 08/10/2012 03:46 PM, Frederic Fournier wrote:

Hello everyone,

I would like to parse very large xml files from MS/MS experiments and
create R objects from their content. (By very large, I mean going up to
5-10Gb, although I am using a 'small' 40M file to test my code.)


I'm not 100% sure of it's relevance, but

  http://bioconductor.org/packages/2.10/bioc/html/MSnbase.html

There is a vignette here, for instance


http://bioconductor.org/packages/2.10/bioc/vignettes/MSnbase/inst/doc/MSnbase-io.pdf

If this is useful, then further questions might be directed to the 
Bioconductor mailing list.


  http://bioconductor.org/help/mailing-list/

Martin



My first attempt at parsing the 40M file, using the XML package, took more
than 2200 seconds and left me quite disappointed.
I managed to cut that down to around 40 seconds by:
 -using the 'useInternalNodes' option of the XML package when parsing
the xml tree;
 -vectorizing the parsing (i.e., replacing loops like for(node in
group.of.nodes) {...} by sapply(group.of.node, function(node){...})
I gained another 5 seconds by making small changes to the functions used
(like replacing 'getNodeset' by 'xmlElementsByTagName' when I don't need to
navigate to the children nodes).
Now I am blocked at around 35 seconds and I would still like to cut this
time by a 5x, but I have no clue what to do to achieve this gain. I'll try
to expose as briefly as possible the relevant structure of the xml file I
am parsing, the structure of the R object I want to create, and the type of
functions I am using to do it. I hope that one of you will be able to point
me towards a better and quicker way of doing the parsing!


Here is the (simplified) structure of the relevant nodes of the xml file:

model (many many nodes)
   protein (a couple of proteins per model node)
 peptide (1 per protein node)
   domain (1 or more per peptide node)
 aa (0 or more per domain node)
 /aa
   /domain
 /peptide
   /protein
/model

Here is the basic structure of the R object that I want to create:

'result' object that contains:
   -various attributes
   -a list of 'protein' objects, each of which containing:
   -various attributes
   -a list of 'peptide' objects, each of which containing:
 -various attributes
 -a list of 'aa' objects, each of which consisting of a couple of
attributes.

Here is the basic structure of the code:

xml.doc - xmlTreeParse(file, getDTD=FALSE, useInternalNodes=TRUE)
result - new('S4_result_class')
result@proteins - xpathApply(xml.doc, //model/protein,
function(protein.node) {
   protein - new('S4_protein_class')
   ## fill in a couple of attributes of the protein object using xmlValue
and xmlAttrs(protein.node)
   protein@peptides - xpathApply(protein.node, ./peptide,
function(peptide.node) {
 peptide - new('S4_peptide_class')
 ## fill in a couple of attributes of the peptide object using xmlValue
and xmlAttrs(peptide.node)
 peptide@aas - sapply(xmlElementsByTagName(peptide.node, name=aa),
function(aa.node) {
   aa - new('S4_aa_class')
   ## fill in a couple of attributes of the 'aa' object using xmlValue
and xmlAttrs(aa.node)
 })
   })
})
free(xml.doc)


Does anyone know a better and quicker way of doing this?

Sorry for the very long message and thank you very much for your time and
help!

Frederic

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




--
Computational Biology / Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109

Location: Arnold Building M1 B861
Phone: (206) 667-2793

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.