[aroma.affymetrix] Processing multiple Affymetrix arrays

Pierre Neuvial Wed, 16 Nov 2011 18:11:22 -0800

(forgot to cc the list)


---------- Forwarded message ----------
From: Pierre Neuvial <pie...@stat.berkeley.edu>
Date: Wed, Nov 16, 2011 at 10:16 PM
Subject: Re: [aroma.affymetrix] Processing multiple Affymetrix arrays
To: Anguraj Sadanandam <anguraj.sadanan...@epfl.ch>


Hi Anguraj,

On Wed, Nov 16, 2011 at 4:13 PM, Anguraj Sadanandam
<anguraj.sadanan...@epfl.ch> wrote:
>
> Hi Henrik,
>
>  I am interested in processing about 228 Affy SNP 6.0 array samples. I used 
> the following code (attached below). Though it was working well, I was trying 
> to get the raw data through a loop (as you can see in the code below). I was 
> executing the loop for 228 times for that many samples (may be not a good 
> idea). The code stopped somewhere around 162. This happened 2 times but not 
> exactly at 162. May be the loop is not the best way. Could you please suggest 
> the better way? I like to increase the number of samples up to 1070 and is it 
> good idea to do it by batch or do it all together? I have the normal samples 
> for normalization.
>
> My sessioninfo() is below.
>
> Thanks,
>
> Anguraj
>
>
>
> dataSet <- "Panc";
> chipType <- "GenomeWideSNP_6,Full";
> expNbrOfArrays <- 228;
>
>
> library("aroma.affymetrix");
> verbose <- Arguments$getVerbose(-8, timestamp=TRUE);
>
> cdf <- AffymetrixCdfFile$byChipType(chipType);
> print(cdf);
>
> # Assert that all annotation data files exist
> gi <- getGenomeInformation(cdf);
> print(gi);
>
> si <- getSnpInformation(cdf);
> print(si);
>
> acs <- AromaCellSequenceFile$byChipType(getChipType(cdf, fullname=FALSE));
> print(acs);
>
> # Setup data set
> cs <- AffymetrixCelSet$byName(dataSet, cdf=cdf);
> print(cs);
> # Sanity check
> stopifnot(nbrOfArrays(cs) == expNbrOfArrays);
>
>
> # Step 1: Crosstalk calibration
> acc <- AllelicCrosstalkCalibration(cs);
> print(acc);
> csC <- process(acc, verbose=verbose);
> print(csC);
>
>
> # Step 2 - Normalization for nucleotide-position probe sequence effects
> bpn <- BasePositionNormalization(csC, target="zero");
> print(bpn);
> csN <- process(bpn, verbose=verbose);
> print(csN);
>
>
> # Step 3 - Probe summarization
> plm <- AvgCnPlm(csN, mergeStrands=TRUE, combineAlleles=TRUE);
> print(plm);
>
> if (length(findUnitsTodo(plm)) > 0) {
>   # Fit CN probes quickly (~5-10s/array + some overhead)
>   units <- fitCnProbes(plm, verbose=verbose);
>   str(units);
>   # int [1:945826] 935590 935591 935592 935593 935594 935595 ...
>   # Fit remaining units, i.e. SNPs (~5-10min/array)
>   units <- fit(plm, verbose=verbose);
>   str(units);
> }
>
> # Get the estimated chip effects
> ces <- getChipEffectSet(plm);
> print(ces);
>
>
> # Step 4 - Normalization for PCR fragment-length effects
> fln <- FragmentLengthNormalization(ces);
> print(fln);
> cesN <- process(fln, verbose=verbose);
> print(cesN);
>

FYI, you could replace Step 1 through 4 by

cesN <- doCRMAv2(cs)

see http://www.aroma-project.org/blocks/doCRMAv2


>
> # Sanity check
> stopifnot(nbrOfArrays(cesN) == expNbrOfArrays);
>
>  # Identify the index of all normal samples by mapping to names in a text 
> file (
> or whatever you wish)
> normalNames <- readLines("normal.txt");
> #normals <- indexOf(cesN, normalNames);
> normals <- match(normalNames, getNames(cesN))
> str(normals);
> stopifnot(all(is.finite(normals)));  # Sanity check
>
> # Extract pool of reference samples (normals only)
> cesR <- extract(cesN, normals);
> print(cesR);
>
> # Calculate the robust average across this pool
> ceR <- getAverageFile(cesR, verbose=verbose);
> print(ceR);
>
> #cesR <- cesNHapMap;
> #print(cesR);
> ceR <- getAverageFile(cesR, verbose=verbose);
> print(ceR);
>
> # Set up the CBS model with ceR as the reference for all samples
> cbs <- CbsModel(cesN, ceR);
> print(cbs);
>
> # Fit the segmentation
> #fit(cbs, verbose=verbose);
>
> gi <- getGenomeInformation(cdf)
> print(gi)
>
> l<-NULL
> for(i in 1:23) {
>  units <- getUnitsOnChromosome(gi, chromosome=i);
>  pos <- getPositions(gi, units=units);
>  l[i]<-length(pos)
>  }
>
> length(l)
>
> l_sum<-sum(l)
>
> raw<-NULL
>
> for(i in 1:228) {
> cat("\n",i,"\n")
> for(j in 1:23) {
>  rawCNs <- extractRawCopyNumbers(cbs, array=i, chromosome=j)
>  raw<-rbind(raw,as.data.frame(rawCNs))
>  cat(j,"\n")
> }
> }
>

Here you are trying to concatenate copy number estimates for all
samples in a huge data.frame.
Remember that the SNP 6 array has roughly 2 million loci !  After 162
iterations, this data.frame has about 324 million lines !!!
It probably breaks down when your computer runs out of memory.

I see two problems here:
1) I don't think it's a good idea to try to create a text file with
228*2e6=456 million lines.  Why don't you create only one file per
sample ?  That would also be much easier to manipulate afterwards.
2) In case you'd really want to do that (again, I really don't see why
if you have that many samples), you should write the results to file
for each chromosome directly, using write.table(..., append=TRUE)
within the loop on index "j".  This way you would avoid using so much
memory, and also loosing a lot of time with this "rbind" which
re-creates a bigger and bigger data.frame at each iteration.

And my suggestion is: look at writeDataFrame: it should give you what
you need, see:

http://www.aroma-project.org/howtos/writeDataFrame

Cheers,

Pierre

> write.table(raw,"all-data-PANC-40-normal.txt",sep="\t")
>
> raw_1<-matrix(raw[,2],nrow=l_sum,ncol=228)
> raw_2<-cbind(raw[1:l_sum,1],raw_1)
>
> write.table(raw_1,"all-data-matrix-Panc-40-normal.txt", sep="\t")
> write.table(raw_2,"all-data-matrix-probes-PANC-40-normal.txt", sep="\t")
>
>
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
>
>
>
> On Jun 17, 2011, at 8:03 PM, Henrik Bengtsson wrote:
>
> > Hi.
> >
> > On Thu, Jun 16, 2011 at 10:58 AM, Anguraj Sadanandam
> > <anguraj.sadanan...@epfl.ch> wrote:
> >> Hello,
> >>
> >>  I am trying to process several (~800) .CEL files of Hug133a affymetrix and
> >> having trouble processing due to memory error. I tried to look through 
> >> aroma
> >> affymetrix code to do this but not fruitful.
> >
> > Exactly what did you try and what did not work?
> >
> > The Aroma framework can process any number of arrays.  You'll find 
> > vignettes at:
> >
> >  http://aroma-project.org/vignettes/
> >
> > and you may also be interested in the one-line doRMA() command:
> >
> >  http://aroma-project.org/blocks/doRMA
> >
> > /Henrik
> >
> >>
> >>  Could you please point me to the link to memory efficient RMA analysis?
> >>
> >>  Thanks,
> >>
> >> Anguraj
> >>
> >>
>
>

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

[aroma.affymetrix] Processing multiple Affymetrix arrays

Reply via email to