Re: [R] Extract subsets of different and unknown lengths from huge dataset

Phil Spector Sun, 30 Jan 2011 12:18:42 -0800

A reproducible example would be nice, but if I understand you,

you want to find the index of values which are preceded by atleast 24 zeroes. The rle (run length encoding) function isvery handy for problems like these.

Suppose the vector of interest is called "vec". To createa vector called "start" whose value is "NA" except for those

positions immediately after at least 24 zeroes, you could try
something like this:

start = rep("NA",length(vec))
rls = rle(vec==0)
ind = cumsum(rls$lengths)[rls$lengths >= 24 & rls$values == TRUE] + 1
if(rls$values[length(rls$values)] == TRUE)ind = ind[-length(ind)]
start[ind] = 'start'

To number the starts, you could use something like

num = rep(0,length(vec))
num[start == 'start'] = 1:sum(start == 'start')


                                        - Phil Spector
                                         Statistical Computing Facility
                                         Department of Statistics
                                         UC Berkeley
                                         spec...@stat.berkeley.edu




On Sun, 30 Jan 2011, Dustin wrote:


Dear prospective reader,


I apologize for posting my problem but I've just no idea how to go on by
processing this huge (over 70 MB) dataset. Thank you in advance for any help
or comment! I do appreciate it!

My textfile contains 1 column of interest (numbers/values only). The overall
issue is to extract 'events', starting points of which are defined by at
least 24 preceding values being equal to 0. Then, if the 25th value is
greater than 0, this is the start of an event of unknown length (unknown
number of values). And the end of an event again is defined by at least 24
values being equal to 0. I want to subset the single events for the purpose
of examining the maximum value within each event.

I tried:

xx1 <- read.table(pipe("cut -f2 corrected_data.txt"),header=T)
nrow(xx1)

[1] 2500000

start1 <- data.frame(start=rep("NA",length.out=nrow(xx1)))
stop1 <- data.frame(stop=rep("NA",length.out=nrow(xx1)))
max.xx1 <- data.frame(max.xx=rep("NA",length.out=nrow(xx1)))
XXframe <- data.frame(XX=xx1, start=start1, stop=stop1, max.xx=max.xx1)
attach(XXframe)
for(i in 1:(nrow(XX)-25)){

+       start[i+24] <- ifelse(XX[i:(i+23)]==0 && XX[i+24]>0, "start", "NA")
+ }

But this doesn't work - and every time I try it again, after changing the
'start' and the 'NA' within 'ifelse', e.g. into integers, a different error
appears (after hours). But this is only to set starts and stops; for the
original issue I further would try to number the starts and then maybe to
subset the single events using subset(). Do you think this could work, or
does anyone know a way to number the events? This would help me a lot!

Thanks again,
Dustin
--
View this message in context: 
http://r.789695.n4.nabble.com/Extract-subsets-of-different-and-unknown-lengths-from-huge-dataset-tp3247511p3247511.html
Sent from the R help mailing list archive at Nabble.com.

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Extract subsets of different and unknown lengths from huge dataset

Reply via email to