[R] Conditional editing of rows in a data frame

2010-01-28 Thread Irene Gallego Romero
Dear R users,

I have a dataframe (main.table) with ~30,000 rows and 6 columns, of
which here are a few rows:

  id chr window gene xp.normxp.top
129 1_32   1 32   TAS1R1  1.28882115 FALSE
130 1_32   1 32   ZBTB48  1.28882115 FALSE
131 1_32   1 32   KLHL21  1.28882115 FALSE
132 1_32   1 32PHF13  1.28882115 FALSE
133 1_33   1 33PHF13  1.02727430 FALSE
134 1_33   1 33THAP3  1.02727430 FALSE
135 1_33   1 33  DNAJC11  1.02727430 FALSE
136 1_33   1 33   CAMTA1  1.02727430 FALSE
137 1_34   1 34   CAMTA1  1.40312732  TRUE
138 1_35   1 35   CAMTA1  1.52104538 FALSE
139 1_36   1 36   CAMTA1  1.04853732 FALSE
140 1_37   1 37   CAMTA1  0.64794094 FALSE
141 1_38   1 38   CAMTA1  1.23026086  TRUE
142 1_38   1 38VAMP3  1.23026086  TRUE
143 1_38   1 38 PER3  1.23026086  TRUE
144 1_39   1 39 PER3  1.18154967  TRUE
145 1_39   1 39 UTS2  1.18154967  TRUE
146 1_39   1 39  TNFRSF9  1.18154967  TRUE
147 1_39   1 39PARK7  1.18154967  TRUE
148 1_39   1 39   ERRFI1  1.18154967  TRUE
149 1_40   1 40  no_gene  1.79796879 FALSE
150 1_41   1 41  SLC45A1  0.20193560 FALSE

I want to create two new columns, xp.bg and xp.n.top, using the
following criteria:

If gene is the same in consecutive rows, xp.bg is the minimum value of
xp.norm in those rows; if gene is not the same, xp.bg is simply the
value of xp.norm for that row;

Likewise, if there's a run of contiguous xp.top = TRUE values,
xp.n.top is the minimum value in that range, and if xp.top is false or
NA, xp.n.top is NA, or 0 (I don't care).

So, in the above example,
xp.bg for rows 136:141 should be 0.64794094, and is equal to xp.norm
for all other rows,
xp.n.top for row 137 is 1.40312732, 1.18154967 for rows 141:148, and
0/NA for all other rows.

Is there a way to combine indexing and if statements or some such to
accomplish this? I want to it this without using split(main.table,
main.table$gene), because there's about 20,000 unique entries for
gene, and one of the entries, no_gene, is repeated throughout. I
thought briefly of subsetting the rows where xp.top is TRUE, but I
then don't know how to set the range for min, so that it only looks at
what would originally have been consecutive rows, and searching the
help has not proved particularly useful.

Thanks in advance,
Irene Gallego Romero


-- 
Irene Gallego Romero
Leverhulme Centre for Human Evolutionary Studies
University of Cambridge
Fitzwilliam St
Cambridge
CB1 3QH
UK
email: ig...@cam.ac.uk

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Sliding window over irregular intervals

2009-03-30 Thread Irene Gallego Romero

Dear all,

I have some very big data files that look something like this:

id chr pos ihh1 ihh2 xpehh
rs5748748 22 15795572 0.0230222 0.0268394 -0.153413
rs5748755 22 15806401 0.0186084 0.0268672 -0.367296
rs2385785 22 15807037 0.0198204 0.0186616 0.0602451
rs1981707 22 15809384 0.0299685 0.0176768 0.527892
rs1981708 22 15809434 0.0305465 0.0187227 0.489512
rs11914222 22 15810040 0.0307183 0.0172399 0.577633
rs4819923 22 15813210 0.02707 0.0159736 0.527491
rs5994105 22 15813888 0.025202 0.0141296 0.578651
rs5748760 22 15814084 0.0242894 0.0146486 0.505691
rs2385786 22 15816846 0.0173057 0.0107816 0.473199
rs1990483 22 15817310 0.0176641 0.0130525 0.302555
rs5994110 22 15821524 0.0178411 0.0129001 0.324267
rs17733785 22 15822154 0.0201797 0.0182093 0.102746
rs7287116 22 15823131 0.0201993 0.0179028 0.12069
rs5748765 22 15825502 0.0193195 0.0176513 0.090302

I'm trying to extract the maximum and minimum xpehh (last column) values 
within a sliding window (non overlapping), of width 1 (calculated 
relative to pos (third column)). However, as you can tell from the brief 
excerpt here, although all possible intervals will probably be covered 
by at least one data point, the number of data points will be variable 
(incidentally, if anyone knows of a way to obtain this number, that 
would be lovely), as will the spacing between them. Furthermore, values 
of chr (second column) will range from 1 to 22, and values of pos will 
be overlapping across them; I want to evaluate the window separately for 
each value of chr.


I've looked at the help and FAQ on sliding windows, but I'm a relative 
newcomer to R and cannot find a way to do what I need to do. Everything 
I've managed to unearth so far seems geared towards smoother time 
series. Any help on this problem would be vastly appreciated.


Thanks,
Irene

--
Irene Gallego Romero
Leverhulme Centre for Human Evolutionary Studies
University of Cambridge
Fitzwilliam St
Cambridge
CB2 1QH
UK

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.