[R] row by row similarity

2008-04-06 Thread Grant Gillis
Hello all and thanks in advance for any advice.
I am very new to R and have searched my question but have not come up with
anything quite like what I would like to do.

My problem is:

I have a data set for individuals (rows) and values for behaviours
(columns).  I would like to know the proportion of shared behaviours for all
possible pairs of individuals.  The sum of shared behaviours divided by the
total.  There are zeros in the data that I would like treated as the
behaviour does not exist.


example data format:

indB1  B2  B3  B4  B5  B6
w   215344
x   123456
y   135276
z   232426


Desired output:

w  x   0
w  y   0.17
w  z   0
x   y   0.3
x   z   0.3
etc.


Thanks

Grant

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] row by row similarity

2008-04-06 Thread Simon Anders
Hi Grant,

Grant Gillis wrote:
> My problem is:
> 
> I have a data set for individuals (rows) and values for behaviours
> (columns).  I would like to know the proportion of shared behaviours for all
> possible pairs of individuals.  The sum of shared behaviours divided by the
> total.  There are zeros in the data that I would like treated as the
> behaviour does not exist.
> 
> example data format:
> 
> indB1  B2  B3  B4  B5  B6
> w   215344
> x   123456
> y   135276
> z   232426

I hope I understand correctly that the numbers label different
behaviours, hence e.g. individuals 'y' and 'z' have the same level of
behaviour, namely level '3', for the behaviour B2. You may want to look
at R's 'factor's, which allow you to give the levels descriptive names
instead of just numbers.

Let us first make a dataframe out of your example:

t <- data.frame(
B1 = c(2,1,1,2),
B2 = c(1,NA,3,3),
B3 = c(5,2,5,3),
B4 = c(3,4,2,4),
B5 = c(4,5,7,2),
B6 = c(4,6,6,6) )
rownames(t) = c("w","x","y","z")

> t
   B1 B2 B3 B4 B5 B6
w  2  1  5  3  4  4
x  1  2  2  4  5  6
y  1  3  5  2  7  6
z  2  3  3  4  2  6

If you now test two rows for equality, this happens element-wise:

> t["w",] == t["y",]
  B1B2   B3B4B5B6
w FALSE FALSE TRUE FALSE FALSE FALSE

You can call 'sum' on this output to get the number of TRUE values.

> sum( t["w",] == t["y",] )
[1] 1

As you want to do this with all pairings, we need a nested 'sapply':

> sapply( rownames(t), function(ind1)
+sapply( rownames(t), function(ind2)
+   sum( t[ind1,] == t[ind2,] ) ) )
   w x y z
w 6 0 1 1
x 0 6 2 2
y 1 2 6 2
z 1 2 2 6

This table now contains the desired information. Of course, you have to
divide by the number of behaviours, i.e. by 6, and the format is a bit
different from your suggestion, but I hope that does not matter.

> Desired output:
> 
> w  x   0
> w  y   0.17
> w  z   0
> x   y   0.3
> x   z   0.3
> etc.

To deal with the missing behaviour you should better use 'NA' instead of
0. Then R may be able to help you with it, as it treats NAs, i.e. values
marked as missing, in a special way.

Assume, for example, that you compare the rows

> r1 <- c( 2, 3, NA, 1, 5 )
> r2 <- c( 1, 3, 4, NA, 4 )

Calling '==' as above on such data yields:

> r1==r2
[1] FALSE  TRUENANA FALSE

As you can see, the missing behaviour is marked NA, because it is
uncomparable. To get the number of TRUE values, use

> sum( r1==r2, na.rm=TRUE )
[1] 1

And to get the number of comparable observations, i.e. those without NA,
use e.g.

> length( na.omit( r1==r2 ) )
[1] 3

I hope this helps you to work out your own solution. Otherwise, ask again.

Best
   Simon


+---
| Dr. Simon Anders, Dipl. Phys.
| European Bioinformatics Institute, Hinxton, Cambridgeshire, UK
| preferred (permanent) e-mail: [EMAIL PROTECTED]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.