Christos, At least on my system, this does not appear to increase timing:
DF.X <- data.frame(X = 35000:1, Y = runif(35000)) DF.Y <- data.frame(X = 35000:1, Y = runif(35000)) > system.time(DF.XY <- merge(DF.X, DF.Y, by = "X", all = TRUE)) [1] 0.238 0.012 0.256 0.000 0.000 compared to: DF.list <- list(DF.X, DF.Y) > str(DF.list) List of 2 $ :'data.frame': 35000 obs. of 2 variables: ..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 34991 ... ..$ Y: num [1:35000] 0.720 0.855 0.216 0.817 0.534 ... $ :'data.frame': 35000 obs. of 2 variables: ..$ X: int [1:35000] 35000 34999 34998 34997 34996 34995 34994 34993 34992 34991 ... ..$ Y: num [1:35000] 0.68090 0.00694 0.64235 0.15728 0.27436 ... > system.time(DF.XY.L <- merge(DF.list[[1]], DF.list[[2]], by = "X", all = > TRUE)) [1] 0.251 0.005 0.262 0.000 0.000 So I am still confuzzled as to why it is taking 13 seconds on your system. I am missing something here. However, I did note that using merge.zoo() appears to be helpful. Regards, Marc On Thu, 2007-02-01 at 23:36 -0500, Christos Hatzis wrote: > Marc, > > The data structure is a list of data frames generated from read.table: > > > class(nmr.spectra.serum) > [1] "list" > > class(nmr.spectra.serum[[1]]) > [1] "data.frame" > > dim(nmr.spectra.serum[[1]]) > [1] 32768 2 > > Converting the data.frames to matrices does not have much of an effect on > timing. > > -Christos > > -----Original Message----- > From: Marc Schwartz [mailto:[EMAIL PROTECTED] > Sent: Thursday, February 01, 2007 11:06 PM > To: [EMAIL PROTECTED] > Cc: 'Prof Brian Ripley'; r-help@stat.math.ethz.ch > Subject: Re: [R] Lining up x-y datasets based on values of x > > On Thu, 2007-02-01 at 22:46 -0500, Christos Hatzis wrote: > > Marc, > > > > I don't think the issue is duplicates in the matching columns. The > > data were generated by an instrument (NMR spectrometer), processed by > > the instrument's software through an FFT transform and other > > transformations and finally reported as a sequence of chemical shift (x) > vs intensity (y) pairs. > > So all x values are unique. For the example that I reported earlier: > > > > > length(nmr.spectra.serum[[1]]$V1) > > [1] 32768 > > > length(unique(nmr.spectra.serum[[1]]$V1)) > > [1] 32768 > > > length(nmr.spectra.serum[[2]]$V1) > > [1] 32768 > > > length(unique(nmr.spectra.serum[[2]]$V1)) > > [1] 32768 > > > > And most of the x-values are common > > > sum(nmr.spectra.serum[[1]]$V1 %in% nmr.spectra.serum[[2]]$V1) > > [1] 32625 > > > > For this reason, merge is probably an overkill for this problem and my > > initial thought was to align the datasets through some simple > > index-shifting operation. > > > > Profiling of the merge code in my case shows that most of the time is > > spent on data frame subsetting operations and on internal merge and > > rbind calls secondarily (if I read the summary output correctly). So > > even if most of the time in the internal merge function is spent on > > sorting (haven't checked the source code), this is in the worst case a > > rather minor effect, as suggested by Prof. Ripley. > > > > > Rprof("merge.out") > > > zz <- merge(nmr.spectra.serum[[1]], nmr.spectra.serum[[2]], by="V1", > > all=T, sort=T) > > > Rprof(NULL) > > > summaryRprof("merge.out") > > <snip> ______________________________________________ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.