[aroma.affymetrix] Re: Total Copy Number Analysis - Hardware requirements
Hi. On Wed, Jan 14, 2009 at 7:02 AM, connys wrote: > > HI, > I have read the documentation of this tool, but could not find > anything regarding the hardware requirements (computational need). I > want to analyze about 1000 samples affy 6.0 and I would also be > interested to know of how this requirement scales up, let's say to the > 10,000 samples. > Does anyone have any idea/suggestions? The short answer is that you can get by with ~1.5GB of RAM using any operating system (this is pointed out on the front page). Since aroma.affymetrix access the file system quite a bit, you wish to have a fast file system. It is faster to work with data on a local HDD than on a network-based file system. Having a fast local drive will help; don't know of any benchmarking but I guess the greater the HDD's cache is, the better. Of course, with more RAM, things will go a bit faster (up to a certain level). You might also want to get a 64-bit OS for the future, although this is not (at all) required by aroma.affymetrix. See also Page 'Improving processing time' in the User Guide: http://groups.google.com/group/aroma-affymetrix/web/improving-processing-time More details: Currently, (I believe) there is not a single method implemented in aroma.affymetrix that is not bounded in memory regardless of chiptype and number of samples. In other words, whatever method you apply, your memory usage will stay below a certain upper limit. People have reported that they've used aroma.affymetrix to process 5000+ Affymetrix gene expression arrays without problems. Note that we do this not by providing approximate algorithms, but more clever designs of algorithms, so you will still get the same results as if everything would be kept in memory. The only exception to the latter is our recent CrlmmModel for estimating genotypes according to CRLMM; the CRLMM algorithm is hierarchical by nature and it is therefore hard to design an exact estimator that is bounded in memory. To achieve bounded memory runs, SNPs are processed in chunks. The greater number of arrays modeled, the fewer number SNPs we have per chunk, in order to keep the overall memory usage bounded. Not sure what kind of GWS6 analysis you are planning, but the CRMA v2 method for estimating full-resolution raw CNs is a single-array method by design, so it will definitely scale infinitely. Even multi-array methods such as CRMA (v1) should scale quite well, because of the bounded algorithms used. Hope this helps /Henrik > > > > > --~--~-~--~~~---~--~~ When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe from this group, send email to aroma-affymetrix-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en -~--~~~~--~~--~--~---
[aroma.affymetrix] Re: Wrong layout in plotAllelePairs
Hi again, Henrik pointed to me that plotAllelePairs.AllelicCrosstalkCalibration has an argument 'pairs' which can be used to specify the indexes of the pairs that one wants to plot. In my case plotAllelePairs(acc, array=array, pairs=1:6, what="input", xlim=xlim/3) does the job. Cheers, Pierre On Wed, Jan 14, 2009 at 11:56 AM, Pierre Neuvial wrote: > Hi, > > I have followed the steps of the CRMAv2 vignette: > > http://groups.google.com/group/aroma-affymetrix/web/estimation-of-total-copy-numbers-using-the-crma-v2-method > > using aroma.affymetrix version 1.0.0 and I got the following warning: > >> plotAllelePairs(acc, array=array, what="input", xlim=xlim/3) > Warning message: > In matrix(seq(length = nbrOfPairs), nrow = floor(sqrt(nbrOfPairs))) : > data length [7] is not a sub-multiple or multiple of the number of rows [2] > > This is because there are now *7* allele probe pair groups, not 6: > >> names(getSetsOfProbes(acc)$snp) > [1] "A/C" "A/G" "A/T" "C/G" "C/T" "G/T" "missing" > > As a result the plot layout is wrong: > >> nbrOfPairs <- length(getSetsOfProbes(acc)$snp) >> matrix(seq(length = nbrOfPairs), nrow = floor(sqrt(nbrOfPairs))) > [,1] [,2] [,3] [,4] > [1,]1357 > [2,]2461 > Warning message: > In matrix(seq(length = nbrOfPairs), nrow = floor(sqrt(nbrOfPairs))) : > data length [7] is not a sub-multiple or multiple of the number of rows [2] > > Maybe the easiest fix is to plot only the non "missing" allele pairs. > > Note that if one wants to use several central nucleotides instead of one, as > in > > alpha <- c(0.1, 0.075, 0.05, 0.03, 0.01, 0.0025, 1e-3, 1e-4); > acc <- AllelicCrosstalkCalibration(csR, alpha=alpha, pairBy="sequence", B=3) > > the number of allele pairs grows quickly: in this particular case > there are 96 allele pairs + 1 missing. > So maybe we it would make sense to hardcode a not to complex layout, > e. g. layout(matrix(seq(length=6, nrow=2))) ? > > Cheers, > > Pierre > >> sessionInfo() > R version 2.8.1 (2008-12-22) > i386-apple-darwin8.11.1 > > locale: > C > > attached base packages: > [1] stats graphics grDevices datasets utils methods base > > other attached packages: > [1] sfit_0.1.5 aroma.affymetrix_1.0.0 aroma.apd_0.1.3 > [4] R.huge_0.1.6 affxparser_1.14.0 aroma.core_1.0.0 > [7] aroma.light_1.9.2 digest_0.3.1 matrixStats_0.1.3 > [10] R.rsp_0.3.4R.cache_0.1.7 R.utils_1.1.3 > [13] R.oo_1.4.6 R.methodsS3_1.0.3 > --~--~-~--~~~---~--~~ When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe from this group, send email to aroma-affymetrix-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en -~--~~~~--~~--~--~---
[aroma.affymetrix] Re: ufl and ugp files for na27?
Hi Christian, I'm quite swamped but I've added links the scripts that I used to generate the existing NetAffx 26 (na26) UFL and UGP to each of the different SNP & CN chip type pages, e.g. http://groups.google.com/group/aroma-affymetrix/web/mapping250k-nsp-mapping250k-sty More comments below. On Mon, Jan 12, 2009 at 6:24 AM, cstratowa wrote: > > Dear Henrik > > Meanwhile I have created ufl and ugp files for both 100K and 500K > arrays but not for GenomeWideSNP_6 aray. The above scripts will show you how to do it for GenomeWideSNP_6. Slightly more complicated since two NetAffx CSV files are involved. > > Can you confirm that the following code, which I use for both 100K and > 500K arrays, is correct: > > # retrieving annotation files > chiptypes <- c("Mapping50K_Hind240", "Mapping50K_Xba240") > cdfs <- lapply(chiptypes, FUN=function(x){AffymetrixCdfFile$byChipType > (x)}) > names(cdfs) <- chiptypes > print(cdfs) > > # importing data from NetAffx CSV files > csvs <- lapply(cdfs, FUN=function(cdf){AffymetrixNetAffxCsvFile > $byChipType(getChipType(cdf), tags=".na27")}) > print(csvs) > > # allocating empty UFL (Unit Fragment Length) files > ufls <- lapply(cdfs, FUN=function(cdf){AromaUflFile$allocateFromCdf > (cdf, tags="na27,CS20090112")}) > print(ufls) > > # import SNP data > units <- list(); > for (chipType in names(ufls)) { > ufl <- ufls[[chipType]]; > csv <- csvs[[chipType]]; > units[[chipType]] <- importFrom(ufl, csv, verbose=-50); > } > str(units) > > # allocating empty UGP (Unit Genome Position) files > ugps <- lapply(cdfs, FUN=function(cdf){AromaUgpFile$allocateFromCdf > (cdf, tags="na27,CS20090112")}) > print(ugps) > > # import SNP data > units <- list(); > for (chipType in names(ugps)) { > ugp <- ugps[[chipType]]; > csv <- csvs[[chipType]]; > units[[chipType]] <- importFrom(ugp, csv, verbose=-50); > } > str(units) This looks alright to me. You might want to check toward the posted NA26 scripts as well, because they are more recent. > > > Here is the summary for the 100K arrays: > # Summary 50K chips >> str(units) > List of 2 > $ Mapping50K_Hind240: int [1:57244] 18632 18677 1631 18713 1630 18712 > 18619 1639 18722 18608 ... > $ Mapping50K_Xba240 : int [1:58960] 29181 18239 31302 19831 47750 > 45114 19103 39711 19772 37811 ... >> >> ufl <- AromaUflFile$byChipType(chiptypes[1], tags="na27,CS20090112"); >> print(summaryOfUnits(ufl, enzymes="HindIII")) > snp cnp affxSnp other total > enzyme1-only 56933 0 0 0 56933 > missing311 0 055 366 > total57244 0 055 57299 The NA26 version gives: snp cnp affxSnp other total enzyme1-only 56933 0 0 0 56933 missing311 0 055 366 total57244 0 055 57299 >> ufl <- AromaUflFile$byChipType(chiptypes[2], tags="na27,CS20090112"); >> print(summaryOfUnits(ufl, enzymes="XbaI")) > snp cnp affxSnp other total > enzyme1-only 58616 0 0 0 58616 > missing344 0 055 399 > total58960 0 055 59015 NA26: snp cnp affxSnp other total enzyme1-only 58616 0 0 0 58616 missing344 0 055 399 total58960 0 055 59015 >> >> ugp <- AromaUgpFile$byChipType(chiptypes[1], tags="na27,CS20090112"); >> print(summary(ugp, enzymes="HindIII")) > chromosomeposition > Min. : 1.000 Min. :48603 > 1st Qu.: 4.000 1st Qu.: 34667112 > Median : 7.000 Median : 72677620 > Mean : 8.402 Mean : 80405004 > 3rd Qu.: 12.000 3rd Qu.:114826216 > Max. : 23.000 Max. :246727435 > NA's :363.000 NA's : 363 NA26: > print(summary(ugp)); chromosomeposition Min. : 1.000 Min. :48603 1st Qu.: 4.000 1st Qu.: 34667112 Median : 7.000 Median : 72677621 Mean : 8.402 Mean : 80405004 3rd Qu.: 12.000 3rd Qu.:114826216 Max. : 23.000 Max. :246727435 NA's :363.000 NA's : 363 > print(table(ugp[,1])); 123456789 10 11 12 13 14 15 16 4541 5072 3962 4342 4215 3968 3444 3549 2357 2743 2466 2592 2661 1931 1440 1145 17 18 19 20 21 22 23 985 1731 326 993 883 433 1157 >> ugp <- AromaUgpFile$byChipType(chiptypes[2], tags="na27,CS20090112"); >> print(summary(ugp, enzymes="XbaI")) > chromosomeposition > Min. : 1.000 Min. :93683 > 1st Qu.: 4.000 1st Qu.: 34636629 > Median : 7.000 Median : 72249739 > Mean : 8.507 Mean : 80010574 > 3rd Qu.: 12.000 3rd Qu.:114666170 > Max. : 24.000 Max. :246885089 > NA's :390.000 NA's : 390 NA26: > print(summary(ugp)); chromosomeposition Min. : 1.000 Min. :93683 1st Qu.: 4.000 1st Qu.: 34636629 Median : 7.000 Median : 72249739 Mean : 8.507 Mean : 80010574 3rd Qu.: 12.000 3rd Qu.:114666170 Max. : 24.000 Max. :246885089 NA's :390.000 NA's
[aroma.affymetrix] Wrong layout in plotAllelePairs
Hi, I have followed the steps of the CRMAv2 vignette: http://groups.google.com/group/aroma-affymetrix/web/estimation-of-total-copy-numbers-using-the-crma-v2-method using aroma.affymetrix version 1.0.0 and I got the following warning: > plotAllelePairs(acc, array=array, what="input", xlim=xlim/3) Warning message: In matrix(seq(length = nbrOfPairs), nrow = floor(sqrt(nbrOfPairs))) : data length [7] is not a sub-multiple or multiple of the number of rows [2] This is because there are now *7* allele probe pair groups, not 6: > names(getSetsOfProbes(acc)$snp) [1] "A/C" "A/G" "A/T" "C/G" "C/T" "G/T" "missing" As a result the plot layout is wrong: > nbrOfPairs <- length(getSetsOfProbes(acc)$snp) > matrix(seq(length = nbrOfPairs), nrow = floor(sqrt(nbrOfPairs))) [,1] [,2] [,3] [,4] [1,]1357 [2,]2461 Warning message: In matrix(seq(length = nbrOfPairs), nrow = floor(sqrt(nbrOfPairs))) : data length [7] is not a sub-multiple or multiple of the number of rows [2] Maybe the easiest fix is to plot only the non "missing" allele pairs. Note that if one wants to use several central nucleotides instead of one, as in alpha <- c(0.1, 0.075, 0.05, 0.03, 0.01, 0.0025, 1e-3, 1e-4); acc <- AllelicCrosstalkCalibration(csR, alpha=alpha, pairBy="sequence", B=3) the number of allele pairs grows quickly: in this particular case there are 96 allele pairs + 1 missing. So maybe we it would make sense to hardcode a not to complex layout, e. g. layout(matrix(seq(length=6, nrow=2))) ? Cheers, Pierre > sessionInfo() R version 2.8.1 (2008-12-22) i386-apple-darwin8.11.1 locale: C attached base packages: [1] stats graphics grDevices datasets utils methods base other attached packages: [1] sfit_0.1.5 aroma.affymetrix_1.0.0 aroma.apd_0.1.3 [4] R.huge_0.1.6 affxparser_1.14.0 aroma.core_1.0.0 [7] aroma.light_1.9.2 digest_0.3.1 matrixStats_0.1.3 [10] R.rsp_0.3.4R.cache_0.1.7 R.utils_1.1.3 [13] R.oo_1.4.6 R.methodsS3_1.0.3 --~--~-~--~~~---~--~~ When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe from this group, send email to aroma-affymetrix-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en -~--~~~~--~~--~--~---
[aroma.affymetrix] Total Copy Number Analysis - Hardware requirements
HI, I have read the documentation of this tool, but could not find anything regarding the hardware requirements (computational need). I want to analyze about 1000 samples affy 6.0 and I would also be interested to know of how this requirement scales up, let's say to the 10,000 samples. Does anyone have any idea/suggestions? --~--~-~--~~~---~--~~ When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe from this group, send email to aroma-affymetrix-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en -~--~~~~--~~--~--~---