Glad the post elicited some discussion. Haven�t played with feather. I�ve used data.table and it is indeed appreciably faster than base approaches for getting big csv�s into R. I also find dplyr (with say MonetDB) to be a solution for out-of-memory approaches to large data sets. But, for native R files, I�ve found RDS to be fastest.
Cheers Charles On May 6, 2016, at 9:01 PM, Simon Urbanek <simon.urba...@r-project.org> wrote: > > On May 6, 2016, at 6:03 PM, Brandon Hurr <bhiv...@gmail.com> wrote: > >> Simon, >> >> Absolutely was about RDS, but R is all about choices and the >> underlying issue was time to read in data which fread and feather are >> quite fast at. I assume when you say efficient you are referring to >> disk space? >> > > No, parsing data is always slower than native formats. Really fastest is > readBin (and similar direct I/O approaches), followed by feather and RDS (the > only reason RDS is not the fastest is that there is an extra copy in-memory) > -- unless you have slow disk, of course. > > >> I put together a script to look at this further with and without >> compression*. If speed is a priority over disk space then Feather and >> data.table (CSV) are good options**. CSV is portable to any system and >> feather can be used by python/Julia. RDS/RDA saves a lot of space and, >> but are slower to write and read due to compression. >> > > That's why I said uncompressed RDS [compress=FALSE] - you compress only if > you want to save space, not speed :). > > FWIW according to our benchmarks iotools is the fastest for reading CSV if > you want to get into that arena, but that's whole another story - my point > was that the question was NOT about CSV or anything parsed - and neither > about writing - which is why this is getting really OT. > > Cheers, > Simon > > > >> I hope that's helpful to those thinking about their priorities for >> file IO in R. >> >> Brandon >> >> * http://rpubs.com/bhive01/fileioinr >> ** writing a CSV with data.table is freaky fast if you can get OpenMP >> working on your machine >> https://github.com/Rdatatable/data.table/issues/1692 Reading that same >> CSV is comparable to RDS. >> >> >> On Fri, May 6, 2016 at 6:07 AM, Simon Urbanek >> <simon.urba...@r-project.org> wrote: >>> Brandon, >>> note that the post was about RDS which is more efficient than all the >>> options you list (in particular when not compressed). General advice is to >>> avoid strings. Numeric vectors are several orders of magnitude faster than >>> strings to load/save. >>> Cheers, >>> Simon >>> >>> >>>> On May 5, 2016, at 6:49 PM, Brandon Hurr <bhiv...@gmail.com> wrote: >>>> >>>> You might be interested in the speed wars that are happening in the >>>> file reading/writing space currently. >>>> >>>> Matt Dowle/Arun Srinivasan's data.table and Hadley Wickham/Wes >>>> McKinney's Feather have made huge speed advances in reading/writing >>>> large datasets from disks (mostly csv). >>>> >>>> Data Table fread()/fwrite(): >>>> https://github.com/Rdatatable/data.table >>>> https://stackoverflow.com/questions/35763574/fastest-way-to-read-in-100-000-dat-gz-files >>>> http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/ >>>> >>>> >>>> Feather read_feather()/write_feather() >>>> https://github.com/wesm/feather >>>> >>>> I don't often have big datasets (10s of MBs) so I don't see the >>>> benefits of these much, but you might. >>>> >>>> HTH, >>>> B >>>> >>>> On Thu, May 5, 2016 at 3:16 PM, Charles DiMaggio >>>> <charles.dimag...@gmail.com> wrote: >>>>> Been a while, but wanted to close the page on a previous post describing >>>>> R hanging on readRDS() and load() for largish (say 500MB or larger) >>>>> files. Tried again with recent release (3.3.0). Am able to read in large >>>>> files under El Cap. While the file is reading in, I get a disconcerting >>>>> spinning pinwheel of death and a check under Force Quit reports R is not >>>>> responding. But if I wait it out, it eventually reads in. Odd. But I >>>>> can live with it. >>>>> >>>>> Cheers >>>>> >>>>> Charles >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> Charles DiMaggio, PhD, MPH >>>>> Professor of Surgery and Population Health >>>>> Director of Injury Research >>>>> Department of Surgery >>>>> New York University School of Medicine >>>>> 462 First Avenue, NBV 15 >>>>> New York, NY 10016-9196 >>>>> charles.dimag...@nyumc.org >>>>> Office: 212.263.3202 >>>>> Mobile: 516.308.6426 >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> R-SIG-Mac mailing list >>>>> R-SIG-Mac@r-project.org >>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >>>> >>>> _______________________________________________ >>>> R-SIG-Mac mailing list >>>> R-SIG-Mac@r-project.org >>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac >>>> >>> >> > [[alternative HTML version deleted]]
_______________________________________________ R-SIG-Mac mailing list R-SIG-Mac@r-project.org https://stat.ethz.ch/mailman/listinfo/r-sig-mac