I have a modest-size XML file (52MB) in a format suited to xmlToDataFrame
(package XML).
I have successfully read it into R by splitting the file 10 ways then
running xmlToDataFrame on each part, then rbind.fill (package plyr) on the
result. This takes about 530 s total, and results in a data.frame with 71k
rows and object.size of 21MB.
But trying to run xmlToDataFrame on the whole file takes forever (> 10000 s
so far). xmlParse of this file takes only 0.8 s.
I tried running xmlToDataFrame on the first 10% of the file, then the first
10% repeated twice, then three times (with the outer tags adjusted of
course). Timings:
1 copy: 111 s = 111 per copy
2 copy: 311 s = 155 " "
3 copy: 626 s = 209 " "
The runtime is superlinear. What is going on here? Is there a better
approach?
Thanks,
-s
[[alternative HTML version deleted]]
______________________________________________
[email protected] mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.