Re: Use R to manage results from GNU Parallel

David Rosenberg Sun, 05 Jan 2014 09:04:03 -0800

The load_parallel_results_split_on_newline  you sent didn't seem to work
for me... In any case, here's my first approach.  I'm returning a matrix
instead of a data.table, since everything's the same type.



load_parallel_results_split_on_newline <- function(filenametable) {
  raw <- load_parallel_results_raw(filenametable);
  varnames = setdiff(colnames(raw), c("stdout","stderr"))
  header_cols = which(colnames(raw) %in% varnames)
  splits = strsplit(raw[,"stdout"], "\n")
  lens = sapply(splits, length)
  reps = rep(1:nrow(raw), lens) ##
  m = cbind(raw[reps, header_cols], unlist(splits))
  return(m)
}


>>   load_parallel_results_split_to_columns(filenametable)
> >
> > I'm happy to write these, though I'm limited on time.  Could you could
> write
> > a generator for test data?
>
> parallel --results my/results/dir --header : echo FOO={foo}
> BAR={bar}';'seq {bar} :::: <(echo foo; seq 1000) <(echo bar; seq 10)
>
> Can we do with multiple columns?


> I do not like the idea of shelling out simply to read a file. If we
> are talking tons of small files then spawning a shell will slow it
> down tremendously.
>

Very good point.  I think the fastest would be to do all the data
processing in a single shell with an (automatically generated) awk script
(run from parallel) that outputted data in such a way that we would have
the results we want with just a single read.table. But let's see what we
can do with R alone.

>
> I read that anything you can do on a connection (i.e. R's filehandle)
> you can also do on a string using textConnection. So I would suggest
> we make an efficient raw reader and use that and then use
> a=sub(newlinesep,"\n",a) to replace newline/tab and finally use R's
> builtin reader on a textConnection.
>

Sounds good.

Does every directory generated by parallel always has both stdout and
stderr?


David

Re: Use R to manage results from GNU Parallel

Reply via email to