(forgot to cc the list)
---------- Forwarded message ---------- From: Pierre Neuvial <pie...@stat.berkeley.edu> Date: Wed, Nov 16, 2011 at 10:16 PM Subject: Re: [aroma.affymetrix] Processing multiple Affymetrix arrays To: Anguraj Sadanandam <anguraj.sadanan...@epfl.ch> Hi Anguraj, On Wed, Nov 16, 2011 at 4:13 PM, Anguraj Sadanandam <anguraj.sadanan...@epfl.ch> wrote: > > Hi Henrik, > > I am interested in processing about 228 Affy SNP 6.0 array samples. I used > the following code (attached below). Though it was working well, I was trying > to get the raw data through a loop (as you can see in the code below). I was > executing the loop for 228 times for that many samples (may be not a good > idea). The code stopped somewhere around 162. This happened 2 times but not > exactly at 162. May be the loop is not the best way. Could you please suggest > the better way? I like to increase the number of samples up to 1070 and is it > good idea to do it by batch or do it all together? I have the normal samples > for normalization. > > My sessioninfo() is below. > > Thanks, > > Anguraj > > > > dataSet <- "Panc"; > chipType <- "GenomeWideSNP_6,Full"; > expNbrOfArrays <- 228; > > > library("aroma.affymetrix"); > verbose <- Arguments$getVerbose(-8, timestamp=TRUE); > > cdf <- AffymetrixCdfFile$byChipType(chipType); > print(cdf); > > # Assert that all annotation data files exist > gi <- getGenomeInformation(cdf); > print(gi); > > si <- getSnpInformation(cdf); > print(si); > > acs <- AromaCellSequenceFile$byChipType(getChipType(cdf, fullname=FALSE)); > print(acs); > > # Setup data set > cs <- AffymetrixCelSet$byName(dataSet, cdf=cdf); > print(cs); > # Sanity check > stopifnot(nbrOfArrays(cs) == expNbrOfArrays); > > > # Step 1: Crosstalk calibration > acc <- AllelicCrosstalkCalibration(cs); > print(acc); > csC <- process(acc, verbose=verbose); > print(csC); > > > # Step 2 - Normalization for nucleotide-position probe sequence effects > bpn <- BasePositionNormalization(csC, target="zero"); > print(bpn); > csN <- process(bpn, verbose=verbose); > print(csN); > > > # Step 3 - Probe summarization > plm <- AvgCnPlm(csN, mergeStrands=TRUE, combineAlleles=TRUE); > print(plm); > > if (length(findUnitsTodo(plm)) > 0) { > # Fit CN probes quickly (~5-10s/array + some overhead) > units <- fitCnProbes(plm, verbose=verbose); > str(units); > # int [1:945826] 935590 935591 935592 935593 935594 935595 ... > # Fit remaining units, i.e. SNPs (~5-10min/array) > units <- fit(plm, verbose=verbose); > str(units); > } > > # Get the estimated chip effects > ces <- getChipEffectSet(plm); > print(ces); > > > # Step 4 - Normalization for PCR fragment-length effects > fln <- FragmentLengthNormalization(ces); > print(fln); > cesN <- process(fln, verbose=verbose); > print(cesN); > FYI, you could replace Step 1 through 4 by cesN <- doCRMAv2(cs) see http://www.aroma-project.org/blocks/doCRMAv2 > > # Sanity check > stopifnot(nbrOfArrays(cesN) == expNbrOfArrays); > > # Identify the index of all normal samples by mapping to names in a text > file ( > or whatever you wish) > normalNames <- readLines("normal.txt"); > #normals <- indexOf(cesN, normalNames); > normals <- match(normalNames, getNames(cesN)) > str(normals); > stopifnot(all(is.finite(normals))); # Sanity check > > # Extract pool of reference samples (normals only) > cesR <- extract(cesN, normals); > print(cesR); > > # Calculate the robust average across this pool > ceR <- getAverageFile(cesR, verbose=verbose); > print(ceR); > > #cesR <- cesNHapMap; > #print(cesR); > ceR <- getAverageFile(cesR, verbose=verbose); > print(ceR); > > # Set up the CBS model with ceR as the reference for all samples > cbs <- CbsModel(cesN, ceR); > print(cbs); > > # Fit the segmentation > #fit(cbs, verbose=verbose); > > gi <- getGenomeInformation(cdf) > print(gi) > > l<-NULL > for(i in 1:23) { > units <- getUnitsOnChromosome(gi, chromosome=i); > pos <- getPositions(gi, units=units); > l[i]<-length(pos) > } > > length(l) > > l_sum<-sum(l) > > raw<-NULL > > for(i in 1:228) { > cat("\n",i,"\n") > for(j in 1:23) { > rawCNs <- extractRawCopyNumbers(cbs, array=i, chromosome=j) > raw<-rbind(raw,as.data.frame(rawCNs)) > cat(j,"\n") > } > } > Here you are trying to concatenate copy number estimates for all samples in a huge data.frame. Remember that the SNP 6 array has roughly 2 million loci ! After 162 iterations, this data.frame has about 324 million lines !!! It probably breaks down when your computer runs out of memory. I see two problems here: 1) I don't think it's a good idea to try to create a text file with 228*2e6=456 million lines. Why don't you create only one file per sample ? That would also be much easier to manipulate afterwards. 2) In case you'd really want to do that (again, I really don't see why if you have that many samples), you should write the results to file for each chromosome directly, using write.table(..., append=TRUE) within the loop on index "j". This way you would avoid using so much memory, and also loosing a lot of time with this "rbind" which re-creates a bigger and bigger data.frame at each iteration. And my suggestion is: look at writeDataFrame: it should give you what you need, see: http://www.aroma-project.org/howtos/writeDataFrame Cheers, Pierre > write.table(raw,"all-data-PANC-40-normal.txt",sep="\t") > > raw_1<-matrix(raw[,2],nrow=l_sum,ncol=228) > raw_2<-cbind(raw[1:l_sum,1],raw_1) > > write.table(raw_1,"all-data-matrix-Panc-40-normal.txt", sep="\t") > write.table(raw_2,"all-data-matrix-probes-PANC-40-normal.txt", sep="\t") > > > R version 2.13.0 (2011-04-13) > Platform: x86_64-redhat-linux-gnu (64-bit) > > locale: > [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C > [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8 > [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8 > [7] LC_PAPER=en_US.UTF-8 LC_NAME=C > [9] LC_ADDRESS=C LC_TELEPHONE=C > [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C > > attached base packages: > [1] stats graphics grDevices utils datasets methods base > > > > > On Jun 17, 2011, at 8:03 PM, Henrik Bengtsson wrote: > > > Hi. > > > > On Thu, Jun 16, 2011 at 10:58 AM, Anguraj Sadanandam > > <anguraj.sadanan...@epfl.ch> wrote: > >> Hello, > >> > >> I am trying to process several (~800) .CEL files of Hug133a affymetrix and > >> having trouble processing due to memory error. I tried to look through > >> aroma > >> affymetrix code to do this but not fruitful. > > > > Exactly what did you try and what did not work? > > > > The Aroma framework can process any number of arrays. You'll find > > vignettes at: > > > > http://aroma-project.org/vignettes/ > > > > and you may also be interested in the one-line doRMA() command: > > > > http://aroma-project.org/blocks/doRMA > > > > /Henrik > > > >> > >> Could you please point me to the link to memory efficient RMA analysis? > >> > >> Thanks, > >> > >> Anguraj > >> > >> > > -- When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group with website http://www.aroma-project.org/. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe and other options, go to http://www.aroma-project.org/forum/