[R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame

Jocelyn Ireson-Paine Thu, 12 Mar 2015 00:58:51 -0700

This is a fairly long question. It's about a problem that's easy tospecify in terms of sets, but that I found hard to solve in R by usingthem, because of the strange design of R data structures. In explainingit, I'm going to touch on the reshape2 library, dcast, sets, and thenon-orthogonality of R.

My problem stems from some drug-trial data that I've been analysing forthe Oxford Pain Research Unit. Here's an example. Imagine a data framerepresenting patients in a trial of pain-relief drugs. The trial lasts forten days. Each patient's pain is measured once a day, and the values arerecorded in a data frame, one row per patient per day. Like this:


  ID  Day  Pain
   1    1  10
   1    2   9
   1    4   7
   1    7   2
   2    2   8
   2    3   7
   3    1  10
   3    3   6
   3    4   6
   3    8   2

Unfortunately, many patients have measurements missing. Thus, in theexample above, patient 1 was only observed on days 1, 2, 4, and 7, ratherthan on the full ten days. But a patient's measurements are only useful tous if that patient has a certain minimum set of days, so I need to checkfor patients who lack those days. Let's assume that these days are numbers1, 4, and 9.

Such a question is trivial to state in terms of sets. Let D(i) denote theset of days on which patient i was measured: then I want to find out whichpatients p, or how many patients p, have a D(p) that contains the set{1,4,9}.

The obvious way to solve this is to write a function that tells me whetherone set is a superset of another. Then flatten my data frame so that itlooks like this:


  ID  Days
   1  {1,2,4,7}
   2  {2,3}
   3  {1,3,4,8}

And finally, filter it by some R translation of

  flattened[ includes( flattened$Days, {1,4,9} ), ]

I started with the built-in functions that operate on sets represented asvectors. These are described in

 https://stat.ethz.ch/R-manual/R-devel/library/base/html/sets.html ,
"Set Operations". For example:

  > union( c(1,2,3), c(2,4,6) )
  [1] 1 2 3 4 6
  > intersect( c(1,2,3), c(2,4,6) )
  [1] 2

So I first wrote a set-inclusion function:

  # True if vector a is a superset of vector b.
  #
  includes <- function( a, b )
  {
    return( setequal( union( a, b ), a ) )
  }

Here are some sample calls:

  > includes( c(1), c() )
  [1] TRUE
  > includes( c(1), c(1) )
  [1] TRUE
  > includes( c(1), c(1,2) )
  [1] FALSE
  > includes( c(2,1), c(1,2) )
  [1] TRUE
  > includes( c(2,1,3), c(1,2) )
  [1] TRUE
  > includes( c(2,1,3), c(4,1,2) )
  [1] FALSE

I then made myself a variable holding my sample data frame:

  df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 )
                  , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 )
                  )

And I tried flattening it, using dcast and an aggregator function asdescribed in (amongst many other places)http://seananderson.ca/2013/10/19/reshape.html , "An Introduction toreshape2" by Sean C. Anderson.

The idea behind this is that (for my data) dcast will call the aggregatorfunction once per patient ID, passing it all the Day values for thepatient. The aggregator must combine them in some way, and dcast puts itsresults into a new column. For example, here's an aggregator that merelysums its arguments:


  aggregator_making_sum <- function( ... )
  {
    return( sum( ... ) )
  }

If I call it, I get this:

  >  dcast( df, ID~. , fun.aggregate=aggregator_making_sum )
  Using Day as value column: use value.var to override.
    ID  .
  1  1 14
  2  2  5
  3  3 16

And here's an aggregator that converts the argument list to a string:

  aggregator_making_string <- function( ... )
  {
    return( toString( ... ) )
  }

Calling it gives this:

  >  dcast( df, ID~. , fun.aggregate=aggregator_making_string )
  Using Day as value column: use value.var to override.
    ID          .
  1  1 1, 2, 4, 7
  2  2       2, 3
  3  3 1, 3, 4, 8

In both of these, the three dots denote all arguments to the aggregator,as explained in Burns Statistics'shttp://www.burns-stat.com/the-three-dots-construct-in-r/ . My firstaggregator sums them; my second converts them to a string. Both uses ofdcast generate a data frame with a column named "." , which contains theaggregates. In the second data frame, that may not be so clear: the firstcolumn of numbers is row numbers; the second column of numbers are theIDs; and the remaining columns form the strings, belonging to "." .

But what I want is neither a sum nor a string but a set. Specifically, aset that's compatible with the R set operations I called in my 'includes'function. Since these sets are vectors, my aggregator should just pack itsarguments into a vector:


  aggregator_making_set <- function( ... )
  {
    return( c( ... ) )
  }

But when I tried it, I got an error:

  > dcast( df, ID~. , fun.aggregate=aggregator_making_set )
  Using Day as value column: use value.var to override.
  Error in vapply(indices, fun, .default) : values must be length 0,
   but FUN(X[[1]]) result is length 4

It's not an informative error message, because it expects me to know howdcast is coded. And I'm surprised that values need to be length 0: length1 would seem more appropriate. But perhaps it's trying to say that 'c'doesn't work on three-dots argument lists. Let's test that hypothesis:


  test_c_on_three_dots <- function( ... )
  {
    return( c( ... ) )
  }

  >   test_c_on_three_dots( 1 )
  [1] 1
  >   test_c_on_three_dots( 1, 2 )
  [1] 1 2
  >   test_c_on_three_dots( 1, 2, 3 )
  [1] 1 2 3

So 'c' does indeed work on three-dots argument lists. The error must havebeen caused by something else. Let's try making a set and putting it intoa data frame directly:


  > df <- data.frame( col1=c(1,2), col2=c(3,4) )
  > df
    col1 col2
  1    1    3
  2    2    4
  > set <- union( c(5,6), c(6,7) )
  > set
  [1] 5 6 7
  > df[ 1, ]$col1 <- set
  Error in `$<-.data.frame`(`*tmp*`, "col1", value = c(5, 6, 7)) :
    replacement has 3 rows, data has 1

So that's the problem. Already in 1968, there was a language named Algol68which had arrays and, in order to make things easy for its programmers,allowed you to create arrays of every data type the language provided. Youcould have arrays of Booleans, arrays of integers, arrays of records,arrays of discriminated unions, arrays of procedures, arrays of I/Oformats, arrays of pointers, and arrays of arrays. The idea was"orthogonality" (see for examplehttp://stackoverflow.com/questions/1527393/what-is-orthogonality ): thatthe programmer does not have to think about unexpected interactionsbetween the concept of array and the concept of the element type, becausethere are none. If you have a data type, you can make arrays of that type.Pop-2 (1970), Snobol4 (1966), and Lisp (1958) were similarly generous. ButR (1993) isn't. It wants to make life hard by forcing me to use differentkinds of container for different kinds of element. And by providing a niceimplementation of sets and then not letting me store them.

So I thought about the kinds of data that I _can_ store in a data frameand generate by flattening. Strings! So I decided to use myaggregator_making_string function to make a string representation of theset of days, and to write a set-inclusion function that compared thesesets against sets represented as vectors:


  includes2 <- function( a_as_string, b )
  {
    a <- as.numeric( unlist( strsplit( a_as_string, split="," ) ) )
    return( setequal( union( a, b ), a ) )
  }

Here are some example calls:

  > includes2( '1,2,3', c(1) )
  [1] TRUE
  > includes2( '1,2,3', c(1,2) )
  [1] TRUE
  > includes2( '1,2,3', c(1,2,4) )
  [1] FALSE
  > includes2( '1,2,3', c(3) )
  [1] TRUE
  > includes2( '1,2,3', c(0,3) )
  [1] FALSE
  >

I then tried using it:

  df <- data.frame( ID = c( 1, 1, 1, 1, 2, 2, 3, 3, 3, 3 )
                  , Day = c( 1, 2, 4, 7, 2, 3, 1, 3, 4, 8 )
                  )

  aggregator_making_string <- function( ... )
  {
    return( toString( ... ) )
  }

  flattened <- dcast( df, ID~. , fun.aggregate=aggregator_making_string )

  # Which patients have a day 1?
  flattened[ includes2( flattened$. , c(1) ), ]

Unfortunately, that didn't work. The final statement selected every row of'flattened'. I eventually realised that I had to vectorise 'includes2':


  includes3 <- Vectorize( includes2, "a_as_string" )

And that did work:

  >   flattened[ includes3( flattened$. , c(1) ), ]
    ID          .
  1  1 1, 2, 4, 7
  3  3 1, 3, 4, 8
  >   flattened[ includes3( flattened$. , c(1,2) ), ]
    ID          .
  1  1 1, 2, 4, 7
  >   flattened[ includes3( flattened$. , c(1,3) ), ]
    ID          .
  3  3 1, 3, 4, 8
  >   flattened[ includes3( flattened$. , c(2) ), ]
    ID          .
  1  1 1, 2, 4, 7
  2  2       2, 3

The moral of this email tale is that sets are really useful for filteringdata, and dcast ought to be really useful for generating sets, but Rrefuses to let me store them in the data frame that dcast generates. I canfudge it by representing the sets as strings, but is there a cleaner wayto solve the problem?


Cheers,

Jocelyn Ireson-Paine
07768 534 091
http://www.jocelyns-cartoons.uk
http://www.j-paine.org

______________________________________________
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] How to filter data using sets generated by flattening with dcast, when I can't store those sets in a data frame

Reply via email to