On Tue, 13 Jan 2015, Mike Miller wrote:
I have many pairs of data frames each with about 15 million records each and
about 10 million records in common. They are sorted by two of their fields
and will be merged by those same fields.
The fact that the data are sorted could be used to greatly speed up a merge,
but I have the impression that merge() cannot "know" in advance that the
fields are already sorted.
There are different versions of "merge". This sounds like a job for the
data.table package, which has its own way of doing merges that is likely
to be useful here. However, be warned that data.table takes some getting
used to, and if it can't figure out from your use of it how to use the
fast techniques then it will often fall back on the slower data.frame
approaches. [1] covers the single-column case... but multiple columns is
quite doable.
You might also find sqldf helpful if you are more comfortable with SQL
than data.table's way of doing things.
[1] http://stackoverflow.com/questions/17331684/fast-exists-in-data-table
I'm sure that I can use merge(), but I suspect that it is doing a lot of
unnecessary work and that it will take much more time than the job really
should require. Is that correct? Can anything be done about it?
The inspiration for my question comes partly from the way GNU comm works.
Not familiar with that.
If you have any ideas about this, I'd love to hear them.
Thanks in advance.
Mike
--
Michael B. Miller, Ph.D.
University of Minnesota
http://scholar.google.com/citations?user=EV_phq4AAAAJ
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
---------------------------------------------------------------------------
Jeff Newmiller The ..... ..... Go Live...
DCN:<jdnew...@dcn.davis.ca.us> Basics: ##.#. ##.#. Live Go...
Live: OO#.. Dead: OO#.. Playing
Research Engineer (Solar/Batteries O.O#. #.O#. with
/Software/Embedded Controllers) .OO#. .OO#. rocks...1k
______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.