On Thu, 2007-08-16 at 12:33 -0400, AbouEl-Makarim Aboueissa wrote:
> Dear All:
> Urgent help is needed.
> I have a data set in matrix format  of three columns: X, Y and index
> of four groups (1,2,3,4). What I need to do is the following;
> 1- How I can subtract the sample mean of each group indexed 1,2,3,4
> from the 
>      corresponding data values of this group and create new columns
> say X-sample mean 
>       and Y-sample mean? I tried to use the "tapply" but I have some
> difficulties to restore the new data
> 2- How I can use the “tapply” if possible or any other R-function to
> find the correlation 
>      coefficient between the X and Y columns for each group indexed
> 1,2,3,4.? Could not use the "tapply".
> I attached part of the data as txt file.
> Thank you so much for your attention to this matter, and I look
> forward to hear from you soon.
> Regards,
> Abou
> Data:
> ====
> x     y       index
> 15807.24      12.5    4
> 15752.51      33.5    4
> 12893.76      01.5    3
> 8426.88       22.2    3
> 5706.24       333     3
> 3982.08       560     2
> 3642.62       670     2
> 295.68                124     1
> 215.40                104     1
> 195.40                204     1
> 4240.21       22.4    2
> 1222.72       45.9    2
> 1142.26       23.6    2
> 63.00                 90.1    1
> 1216.00       82.4    2
> 2769.60       111     2
> 1790.46       34.7    2
> 26.10                 26.10   1
> 19676.83      0.99    4
> 10920.60      203     3
> 6144.00       46      3
> 4534.48       4534.48 3
> 40000.00      65      4
> 29500.00      56      4
> 17100.00      77      4
> 9000.00       435     3
> 6300.00       84      3
> 3962.88       334     2
> 5690.00       653     3
> 3736.00       233     2
> 2750.00       22      2
> 1316.00       345     2
> 4595.00       4595.00 3
> 5928.00       45      3
> 2645.70       0.00    2
> 2580.24       454     2
> 6547.34       6547.34 3
> 1615.68       5       2
> 194.06                55      1
> 184.80                6       1
> 82.94                 44      1
> 16649.00      56      4
> 4500.00       74      3
> 1600.00       744     2
> =================

I might be tempted to take the following approach:

If your data is a matrix, coerce it to a data frame first. Let's call
that 'DF'.

> str(DF)
'data.frame':   44 obs. of  3 variables:
 $ x    : num  15807 15753 12894  8427  5706 ...
 $ y    : num  12.5 33.5 1.5 22.2 333 560 670 124 104 204 ...
 $ index: int  4 4 3 3 3 2 2 1 1 1 ...

Now use split() to break up the data frame into a list of 4
sub-dataframes, based upon the index value.  We can use scale() within a
lapply() loop to center the 'x' and 'y' columns for each sub-dataframe:

DF.ctr <- lapply(split(DF[, -3], DF$index), scale, scale = FALSE)

> str(DF.ctr)
List of 4
 $ 1: num [1:8, 1:2]  138.5   58.2   38.2  -94.2 -131.1 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:8] "8" "9" "10" "14" ...
  .. ..$ : chr [1:2] "x" "y"
  ..- attr(*, "scaled:center")= Named num [1:2] 157.2  81.7
  .. ..- attr(*, "names")= chr [1:2] "x" "y"
 $ 2: num [1:16, 1:2]  1469  1129  1727 -1291 -1371 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:16] "6" "7" "11" "12" ...
  .. ..$ : chr [1:2] "x" "y"
  ..- attr(*, "scaled:center")= Named num [1:2] 2513  230
  .. ..- attr(*, "names")= chr [1:2] "x" "y"
 $ 3: num [1:13, 1:2]  5879  1413 -1308  3906  -870 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:13] "3" "4" "5" "20" ...
  .. ..$ : chr [1:2] "x" "y"
  ..- attr(*, "scaled:center")= Named num [1:2] 7014 1352
  .. ..- attr(*, "names")= chr [1:2] "x" "y"
 $ 4: num [1:7, 1:2] -6262 -6317 -2393 17931  7431 ...
  ..- attr(*, "dimnames")=List of 2
  .. ..$ : chr [1:7] "1" "2" "19" "23" ...
  .. ..$ : chr [1:2] "x" "y"
  ..- attr(*, "scaled:center")= Named num [1:2] 22069    43
  .. ..- attr(*, "names")= chr [1:2] "x" "y"

Now, create a new single DF comprised of the sub-dataframes from DF.ctr:

DF.new <- do.call(rbind, DF.ctr)

Define colnames:

colnames(DF.new) <- c("x-mean", "y-mean")

> str(DF.new)
 num [1:44, 1:2]  138.5   58.2   38.2  -94.2 -131.1 ...
 - attr(*, "dimnames")=List of 2
  ..$ : chr [1:44] "8" "9" "10" "14" ...
  ..$ : chr [1:2] "x-mean" "y-mean"

Now, use merge() to join DF and DF.new by the rownames:

DF.final <- merge(DF, DF.new, by = "row.names")

> DF.final
   Row.names        x       y index      x-mean       y-mean
1          1 15807.24   12.50     4 -6262.12857   -30.498571
2         10   195.40  204.00     1    38.22750   122.350000
3         11  4240.21   22.40     2  1726.93188  -208.037500
4         12  1222.72   45.90     2 -1290.55812  -184.537500
5         13  1142.26   23.60     2 -1371.01812  -206.837500
6         14    63.00   90.10     1   -94.17250     8.450000
7         15  1216.00   82.40     2 -1297.27812  -148.037500
8         16  2769.60  111.00     2   256.32188  -119.437500
9         17  1790.46   34.70     2  -722.81812  -195.737500
10        18    26.10   26.10     1  -131.07250   -55.550000
11        19 19676.83    0.99     4 -2392.53857   -42.008571
12         2 15752.51   33.50     4 -6316.85857    -9.498571
13        20 10920.60  203.00     3  3906.26923 -1148.809231
14        21  6144.00   46.00     3  -870.33077 -1305.809231
15        22  4534.48 4534.48     3 -2479.85077  3182.670769
16        23 40000.00   65.00     4 17930.63143    22.001429
17        24 29500.00   56.00     4  7430.63143    13.001429
18        25 17100.00   77.00     4 -4969.36857    34.001429
19        26  9000.00  435.00     3  1985.66923  -916.809231
20        27  6300.00   84.00     3  -714.33077 -1267.809231
21        28  3962.88  334.00     2  1449.60188   103.562500
22        29  5690.00  653.00     3 -1324.33077  -698.809231
23         3 12893.76    1.50     3  5879.42923 -1350.309231
24        30  3736.00  233.00     2  1222.72188     2.562500
25        31  2750.00   22.00     2   236.72188  -208.437500
26        32  1316.00  345.00     2 -1197.27812   114.562500
27        33  4595.00 4595.00     3 -2419.33077  3243.190769
28        34  5928.00   45.00     3 -1086.33077 -1306.809231
29        35  2645.70    0.00     2   132.42188  -230.437500
30        36  2580.24  454.00     2    66.96187   223.562500
31        37  6547.34 6547.34     3  -466.99077  5195.530769
32        38  1615.68    5.00     2  -897.59812  -225.437500
33        39   194.06   55.00     1    36.88750   -26.650000
34         4  8426.88   22.20     3  1412.54923 -1329.609231
35        40   184.80    6.00     1    27.62750   -75.650000
36        41    82.94   44.00     1   -74.23250   -37.650000
37        42 16649.00   56.00     4 -5420.36857    13.001429
38        43  4500.00   74.00     3 -2514.33077 -1277.809231
39        44  1600.00  744.00     2  -913.27812   513.562500
40         5  5706.24  333.00     3 -1308.09077 -1018.809231
41         6  3982.08  560.00     2  1468.80188   329.562500
42         7  3642.62  670.00     2  1129.34188   439.562500
43         8   295.68  124.00     1   138.50750    42.350000
44         9   215.40  104.00     1    58.22750    22.350000

With respect to getting the correlation coefficient for each sub-group,
you can do the following:

> unlist(lapply(split(DF[, -3], DF$index), function(x) cor(x)[1, 2]))
         1          2          3          4 
 0.4468744  0.2619220 -0.3608070  0.3848641

See ?split, ?lapply, ?scale, ?do.call, ?rbind, ?unlist, ?merge and ?cor


Marc Schwartz

