Glad the post elicited some discussion.  Haven�t played with feather.  I�ve 
used data.table and it is indeed appreciably faster than base approaches for 
getting big csv�s into R.  I also find dplyr (with say MonetDB) to be a 
solution for out-of-memory approaches to large data sets. But, for native R 
files, I�ve found RDS to be fastest.  


Cheers

Charles






On May 6, 2016, at 9:01 PM, Simon Urbanek <simon.urba...@r-project.org> wrote:

> 
> On May 6, 2016, at 6:03 PM, Brandon Hurr <bhiv...@gmail.com> wrote:
> 
>> Simon,
>> 
>> Absolutely was about RDS, but R is all about choices and the
>> underlying issue was time to read in data which fread and feather are
>> quite fast at. I assume when you say efficient you are referring to
>> disk space?
>> 
> 
> No, parsing data is always slower than native formats. Really fastest is 
> readBin (and similar direct I/O approaches), followed by feather and RDS (the 
> only reason RDS is not the fastest is that there is an extra copy in-memory) 
> -- unless you have slow disk, of course.
> 
> 
>> I put together a script to look at this further with and without
>> compression*. If speed is a priority over disk space then Feather and
>> data.table (CSV) are good options**. CSV is portable to any system and
>> feather can be used by python/Julia. RDS/RDA saves a lot of space and,
>> but are slower to write and read due to compression.
>> 
> 
> That's why I said uncompressed RDS [compress=FALSE] - you compress only if 
> you want to save space, not speed :).
> 
> FWIW according to our benchmarks iotools is the fastest for reading CSV if 
> you want to get into that arena, but that's whole another story - my point 
> was that the question was NOT about CSV or anything parsed - and neither 
> about writing - which is why this is getting really OT.
> 
> Cheers,
> Simon
> 
> 
> 
>> I hope that's helpful to those thinking about their priorities for
>> file IO in R.
>> 
>> Brandon
>> 
>> * http://rpubs.com/bhive01/fileioinr
>> **  writing a CSV with data.table is freaky fast if you can get OpenMP
>> working on your machine
>> https://github.com/Rdatatable/data.table/issues/1692 Reading that same
>> CSV is comparable to RDS.
>> 
>> 
>> On Fri, May 6, 2016 at 6:07 AM, Simon Urbanek
>> <simon.urba...@r-project.org> wrote:
>>> Brandon,
>>> note that the post was about RDS which is more efficient than all the 
>>> options you list (in particular when not compressed). General advice is to 
>>> avoid strings. Numeric vectors are several orders of magnitude faster than 
>>> strings to load/save.
>>> Cheers,
>>> Simon
>>> 
>>> 
>>>> On May 5, 2016, at 6:49 PM, Brandon Hurr <bhiv...@gmail.com> wrote:
>>>> 
>>>> You might be interested in the speed wars that are happening in the
>>>> file reading/writing space currently.
>>>> 
>>>> Matt Dowle/Arun Srinivasan's data.table and Hadley Wickham/Wes
>>>> McKinney's Feather have made huge speed advances in reading/writing
>>>> large datasets from disks (mostly csv).
>>>> 
>>>> Data Table fread()/fwrite():
>>>> https://github.com/Rdatatable/data.table
>>>> https://stackoverflow.com/questions/35763574/fastest-way-to-read-in-100-000-dat-gz-files
>>>> http://blog.h2o.ai/2016/04/fast-csv-writing-for-r/
>>>> 
>>>> 
>>>> Feather read_feather()/write_feather()
>>>> https://github.com/wesm/feather
>>>> 
>>>> I don't often have big datasets (10s of MBs) so I don't see the
>>>> benefits of these much, but you might.
>>>> 
>>>> HTH,
>>>> B
>>>> 
>>>> On Thu, May 5, 2016 at 3:16 PM, Charles DiMaggio
>>>> <charles.dimag...@gmail.com> wrote:
>>>>> Been a while, but wanted to close the page on a previous post describing 
>>>>> R hanging on readRDS() and load() for largish (say 500MB or larger) 
>>>>> files. Tried again with recent release (3.3.0).  Am able to read in large 
>>>>> files under El Cap.  While the file is reading in, I get a disconcerting 
>>>>> spinning pinwheel of death and a check under Force Quit reports R is not 
>>>>> responding.  But if I wait it out, it eventually reads in.  Odd.  But I 
>>>>> can live with it.
>>>>> 
>>>>> Cheers
>>>>> 
>>>>> Charles
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> Charles DiMaggio, PhD, MPH
>>>>> Professor of Surgery and Population Health
>>>>> Director of Injury Research
>>>>> Department of Surgery
>>>>> New York University School of Medicine
>>>>> 462 First Avenue, NBV 15
>>>>> New York, NY 10016-9196
>>>>> charles.dimag...@nyumc.org
>>>>> Office: 212.263.3202
>>>>> Mobile: 516.308.6426
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>> 
>>>>>      [[alternative HTML version deleted]]
>>>>> 
>>>>> _______________________________________________
>>>>> R-SIG-Mac mailing list
>>>>> R-SIG-Mac@r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>> 
>>>> _______________________________________________
>>>> R-SIG-Mac mailing list
>>>> R-SIG-Mac@r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-mac
>>>> 
>>> 
>> 
> 


        [[alternative HTML version deleted]]

_______________________________________________
R-SIG-Mac mailing list
R-SIG-Mac@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-mac

Reply via email to