Thank you for the help everyone, it has been a very helpful but steep learning curve for me. I have ended up doing a loop as suggested by David as I could understand this a bit better and seems can apply more generally. I have set out my solution below in case that helps anyone.
I am interested though if the loop can be vectorized, as from my reading this is the preferred (faster) R approach but I can't get my head around how it works. Regards Adam # # I could either 1) loop through stops and check onoff, or 2) loop through onoff and check stops. # # I chose option 2 because the first option might miss data if a stop occurs more than once in the sequence and the # first occurance of stop_onoff does not relate to the first occurance (e.g for stop_sequence A,B,C,D,B,A if stop_onoff # records are B, A then on first iteration option 1 will link the first stop_sequence A to stop_onoff A and subsequent # iterations will ignore stop_onoff B because occurs before A). # bus_load_profile <- function(stop_onoff, stop_sequence) { # Add additional columns to stop_sequence to hold matched on/off/load data stop_sequence <- cbind(stop_sequence, on=0, off=0, load=0) # Start stop_sequence index at 1 (i.e. over all records) idx_stop_sequence <- 1 # Loop through stop_onoff records and assign to stop_sequence for (idx_stop_onoff in 1:nrow(stop_onoff)) { # Get stop_onoff ref ref_onoff <- as.character(stop_onoff$ref[idx_stop_onoff]) # Get candidate refs from stop_sequence (i.e only those occuring after any previous matches) ref_sequence_canidates <- as.character(stop_sequence$ref[idx_stop_sequence:nrow(stop_sequence)]) # Match the first occurance on ref_onoff in ref_sequence_canidates match_stop_sequence <- which(ref_onoff == ref_sequence_canidates)[1] # Update stop_sequence index to match current record idx_stop_sequence <- idx_stop_sequence + match_stop_sequence # Add stop_onoff data to the stop_sequence () stop_sequence$on[idx_stop_sequence - 1] <- stop_onoff$on[idx_stop_onoff] stop_sequence$off[idx_stop_sequence - 1] <- stop_onoff$off[idx_stop_onoff] } # Generate load profile stop_sequence$load <- cumsum(stop_sequence$on - stop_sequence$off) # return data.frame with load profile return(stop_sequence) } # Original test data: # Test data, note use of data.frames because data is pulled from database, noting real stop_equence data has # additional columns. NB I have removed the seq column from stop_sequence as the stops are by definition in sequence stop_sequence <- data.frame(ref=c('A','B','C','D','B','A')) stop_onoff <- data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6)) # Get bus load profile bus_load_profile(stop_onoff, stop_sequence) #Aditional test: stop_sequence <- data.frame(ref=c('A','B','C','D','B','A')) stop_onoff <- data.frame(ref=c('B','A'),on=c(10,0),off=c(4,6)) bus_load_profile(stop_onoff, stop_sequence) On 31 August 2014 01:08, Gabor Grothendieck <ggrothendi...@gmail.com> wrote: > Try dtw. First convert ref to numeric since dtw does not handle > character input. Then align using dtw and NA out repeated values in > the alignment. Finally zap ugly row names and calculate loading: > > library(dtw) > s1 <- as.numeric(stop_sequence$ref) > s2 <- as.numeric(factor(as.character(stop_onoff$ref), > levels(stop_sequence$ref))) > a <- dtw(s1, s2) > DF <- cbind(stop_sequence, > stop_onoff[replace(a$index2, c(FALSE, diff(a$index2) == 0), NA), > ])[-3] > rownames(DF) <- NULL > transform(DF, loading = cumsum(ifelse(is.na(on), 0, on)) - > cumsum(ifelse(is.na(off), 0, off))) > > giving: > > seq ref on off loading > 1 10 A 5 0 5 > 2 20 B NA NA 5 > 3 30 C NA NA 5 > 4 40 D 0 2 3 > 5 50 B 10 2 11 > 6 60 A 0 6 5 > > You will need to test this with more data and tweak it if necessary > via the various dtw arguments. > > > On Fri, Aug 29, 2014 at 8:46 PM, Adam Lawrence <alaw...@gmail.com> wrote: > > I am hoping someone can help me with a bus stop sequencing problem in R, > > where I need to match counts of people getting on and off a bus to the > > correct stop in the bus route stop sequence. I have tried looking > > online/forums for sequence matching but seems to refer to numeric > sequences > > or DNA matching and over my head. I am after a simple example if anyone > can > > please help. > > > > I have two data series as per below (from database), that I want to > > combine. In this example “stop_sequence” includes the equence (seq) of > bus > > stops and “stop_onoff” is a count of people getting on and off at certain > > stops (there is no entry if noone gets on or off). > > > > stop_sequence <- data.frame(seq=c(10,20,30,40,50,60), > > ref=c('A','B','C','D','B','A')) > > ## seq ref > > ## 1 10 A > > ## 2 20 B > > ## 3 30 C > > ## 4 40 D > > ## 5 50 B > > ## 6 60 A > > stop_onoff <- > > data.frame(ref=c('A','D','B','A'),on=c(5,0,10,0),off=c(0,2,2,6)) > > ## ref on off > > ## 1 A 5 0 > > ## 2 D 0 2 > > ## 3 B 10 2 > > ## 4 A 0 6 > > > > I need to match the stop_onoff numbers in the right sto sequence, with > the > > correctly matched output as follows (load is a cumulative count of on and > > off) > > > > desired_output <- data.frame(seq=c(10,20,30,40,50,60), > > ref=c('A','B','C','D','B','A'), > > on=c(5,'-','-',0,10,0),off=c(0,'-','-',2,2,6), load=c(5,0,0,3,11,5)) > > ## seq ref on off load > > ## 1 10 A 5 0 5 > > ## 2 20 B - - 0 > > ## 3 30 C - - 0 > > ## 4 40 D 0 2 3 > > ## 5 50 B 10 2 11 > > ## 6 60 A 0 6 5 > > > > In this example the stop “B” is matched to the second stop “B” in the > stop > > sequence and not the first because the onoff data is after stop “D”. > > > > Any guidance much appreciated. > > > > Regards > > Adam > > > > [[alternative HTML version deleted]] > > > > ______________________________________________ > > R-help@r-project.org mailing list > > https://stat.ethz.ch/mailman/listinfo/r-help > > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > > and provide commented, minimal, self-contained, reproducible code. > > > > -- > Statistics & Software Consulting > GKX Group, GKX Associates Inc. > tel: 1-877-GKX-GROUP > email: ggrothendieck at gmail.com > [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.