Re: Use R to manage results from GNU Parallel

David Rosenberg Sun, 05 Jan 2014 21:27:52 -0800

>
>  Is it possible to do an automatic fall back onto, say, read.csv if
> data.table or plyr is not installed?



It could certainly be done...  There are the apply functions, and
especially tapply. There's also "by", but it's pretty slow.  But I don't
think any of these is quite a drop-in replacement.

Another approach, which would probably be fast, but wouldn't have the nice
automatic column type casting and error protection and such, would be to
read a single nonempty file to figure out how many columns there are, and
then just split all the strings on all the separator characters and reform
them into matrices by row.  Combine this with something like the approach
used for the newline version to generate a final matrix.  I think something
like that could work.



>
> Why:
>   rownames(raw) = 1:nrow(raw)
>
> Why not:
>   rownames(raw) = NULL
>

That line can be removed entirely.  It's a remnant from another version of
the function.



> > 2) When stdout is empty, I don't include any entries.  Another
> possibility
> > would be to include NAs, but that would take a few more lines of code.
>
> I am not sure what the correct R approach is. The UNIX approach would
> no entries. So only if there is an R tradition of returning NAs should
> you consider changing that.
>

Eh... if you want to represent missing data, you typically use NAs.  But I
don't think it's necessary in this case.  It wouldn't be hard to put the
NAs back in after the fact.


>
> One of the things that convince me is reproducible
> measurements/timings. I have too many times been tricked by the common
> wisdom that used to be true, but which no longer is true (Recent
> example UUOC: http://oletange.blogspot.dk/2013/10/useless-use-of-cat.html
> ).
>
> Yes, I've run that experiment as well.  I've also yet to see any speedup
from writing a big temporary file to "ram disk" rather than /tmp, and then
reading it back in, even when the file is pretty big. So I agree that
testing is the only way to go.


> My gut feeling is that if the data is not in disk cache, then disk I/O
> will be the limiting factor, but I would love to see numbers to
> (dis)prove this.
>

Maybe I'll have some spare time for this later in the week...  The initial
'out-of-R' approach I had in mind would start an awk program for every
file.  awk starts pretty fast, but if this is scaling to a million files,
that's probably not a great approach.  So the awk script gets more
complicated...



> I have commented the code and checked it in:
>
>   git clone git://git.savannah.gnu.org/parallel.git


Great.


David


>
>
> /Ole
>

Re: Use R to manage results from GNU Parallel

Reply via email to