Dear Henrik Thank you very much for your reply and your scripts.
First, the differences between na26 and na27 are as follows: Header: - line 11: changed annotation date format from "July 21 2008" to "2008-12-01" Data: - line 17: added one new column, i.e. last column is now %GC - the data for "Probe Set ID","Chromosome","Physical Position" are identical for na26 and na27. This means that in principle I can still use na26. Second, when I compare your results for HindIII, XbaI and NspI for na26 to my results with na26 and na27, I get identical summary results for both ufl and ugp files,as you may have realized when adding your results for na26. In contrast, for StyI I my results for both na26 and na27 are still: snp cnp affxSnp other total enzyme1-only 144868 0 0 0 144868 missing 93436 0 0 74 93510 total 238304 0 0 74 238378 This means I get 93436 missing snps vs your 607 missing snps. (See the partial output for "importFrom" below!!!) Even downloading the na26 file and the cdf file from Affymetrix again did not change the result, which is strange since for the other 3 chiptypes the results agree with your results. BTW, I have downloaded "250k_sty_libraryfile_rev4.zip". However, there exists also an older version "250k_sty_libraryfile_rev3.zip". Which version have you used for creating the ufl file? Third, trying to run your script "Mapping250K_Sty,UFL,na26.R" gave the following error at line units <- importFrom(ufl, csv, enzymes=enzyme, verbose=log); Error in list("importFrom(ufl, csv, enzymes = enzyme, verbose = -50)" = <environment>, : [2009-01-15 12:33:30] Exception: Argument 'enzymes' contains 1 NA value (s). at throw(Exception(...)) at throw.default(sprintf("Argument '%s' contains %d NA value (s).", .name, sum( at throw(sprintf("Argument '%s' contains %d NA value(s).", .name, sum (is.na(x) at getNumerics.Arguments(static, ..., asMode = "integer", disallow = disallow) at getNumerics(static, ..., asMode = "integer", disallow = disallow) at getIntegers.Arguments(static, ..., range = range) at getIntegers(static, ..., range = range) at method(static, ...) at Arguments$getIndices(enzymes, range = c(1, 10)) at readDataUnitFragmentLength.AffymetrixNetAffxCsvFile(csv, enzymes = enzymes, at readDataUnitFragmentLength(csv, enzymes = enzymes, rows = keep, ..., verbos at importFromAffymetrixNetAffxCsvFile.AromaUflFile(this, src, ...) at importFromAffymetrixNetAffxCsvFile(this, src, ...) at importFrom.AromaUnitTabularBinaryFile(ufl, csv, enzymes = enzyme, verbose = at importFrom(ufl, csv, In addition: Warning message: In eval(expr, envir, enclos) : NAs introduced by coercion Importing (unit name, fragment length+) data from AffymetrixNetAffxCsvFile...done This is strange since I have tested that argument enzymes returns correctly "StyI". Maybe this has to do with my version of aroma.affymetrix_0.9.4. Here are the partial outputs for "importFrom(ufl, csv, verbose=-50);" for NspI and StyI: 1. Running "importFrom(ufl, csv, verbose=-50)" for NspI has the following partial output (see "==>"): Reading AffymetrixNetAffxCsvFile...done Extracting fragment lengths from ([enzyme], lengths, start, stop)... Inferring if enzyme names are specified... Has enzyme names: TRUE Inferring if enzyme names are specified...done Identifying number of enzymes... nbrOfEnzymes 1 2 3 ==> 131564 130698 2 Max number of enzymes: 3 Identifying number of enzymes...done Splitting into subunits and padding with NAs... Splitting into subunits and padding with NAs...done Extracting enzyme names... Identified enzymes: NspI, StyI Extracting enzyme names...done Identifying the location of the fragment lengths... Offset: 3 Identifying the location of the fragment lengths...done Extracting fragment lengths... Summary of *all* fragment lengths: Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 19.0 526.0 755.0 842.9 1049.0 2000.0 394527.0 Extracting fragment lengths...done Sorting data by enzyme... V1 V2 V3 Min. : 100.0 Min. : 19 Min. : NA 1st Qu.: 476.0 1st Qu.: 937 1st Qu.: NA Median : 644.0 Median : 1280 Median : NA Mean : 648.9 Mean : 1231 Mean : NaN 3rd Qu.: 816.0 3rd Qu.: 1614 3rd Qu.: NA Max. :1480.0 Max. : 2000 Max. : NA ==> NA's : 701.0 NA's :131564 NA's :262264 Sorting data by enzyme...done int [1:262264, 1] 574 700 580 631 666 1060 798 842 822 608 ... Extracting fragment lengths from ([enzyme], lengths, start, stop)...done Reading (unitName, fragmentLength+) from file...done 2. Running "importFrom(ufl, csv, verbose=-50)" for StyI has the following partial output: Reading AffymetrixNetAffxCsvFile...done Extracting fragment lengths from ([enzyme], lengths, start, stop)... Inferring if enzyme names are specified... Has enzyme names: TRUE Inferring if enzyme names are specified...done Identifying number of enzymes... nbrOfEnzymes 1 2 3 ==> 93436 144865 3 Max number of enzymes: 3 Identifying number of enzymes...done Splitting into subunits and padding with NAs... Splitting into subunits and padding with NAs...done Extracting enzyme names... Identified enzymes: NspI, StyI Extracting enzyme names...done Identifying the location of the fragment lengths... Offset: 3 Identifying the location of the fragment lengths...done Extracting fragment lengths... Summary of *all* fragment lengths: Min. 1st Qu. Median Mean 3rd Qu. Max. NA's 22 481 695 788 966 2000 332341 Extracting fragment lengths...done Sorting data by enzyme... V1 V2 V3 Min. : 22 Min. : 115 Min. : NA 1st Qu.: 672 1st Qu.: 440 1st Qu.: NA Median : 1078 Median : 598 Median : NA Mean : 1082 Mean : 609 Mean : NaN 3rd Qu.: 1499 3rd Qu.: 771 3rd Qu.: NA Max. : 2000 Max. :1266 Max. : NA ==> NA's :93436 NA's : 607 NA's :238304 Sorting data by enzyme...done int [1:238304, 1] NA 1139 599 1872 1857 1465 382 1651 NA 1833 ... Extracting fragment lengths from ([enzyme], lengths, start, stop)...done Reading (unitName, fragmentLength+) from file...done As you see the higlighted lines for NspI and STyI are: V1 V2 V3 ==> NA's : 701.0 NA's :131564 NA's :262264 ==> NA's :93436 NA's : 607 NA's :238304 This means missing = 701 appears in V1 for NspI, but missing = 607 appears in V2 for StyI. What does this mean? Best regards Christian On Jan 15, 1:50 am, Henrik Bengtsson <h...@stat.berkeley.edu> wrote: > Hi Christian, > > I'm quite swamped but I've added links the scripts that I used to > generate the existing NetAffx 26 (na26) UFL and UGP to each of the > different SNP & CN chip type pages, e.g. > > http://groups.google.com/group/aroma-affymetrix/web/mapping250k-nsp-m... > > More comments below. > > On Mon, Jan 12, 2009 at 6:24 AM, cstratowa > > <christian.strat...@vie.boehringer-ingelheim.com> wrote: > > > Dear Henrik > > > Meanwhile I have created ufl and ugp files for both 100K and 500K > > arrays but not for GenomeWideSNP_6 aray. > > The above scripts will show you how to do it for GenomeWideSNP_6. > Slightly more complicated since two NetAffx CSV files are involved. > > > > > > > Can you confirm that the following code, which I use for both 100K and > > 500K arrays, is correct: > > > # retrieving annotation files > > chiptypes <- c("Mapping50K_Hind240", "Mapping50K_Xba240") > > cdfs <- lapply(chiptypes, FUN=function(x){AffymetrixCdfFile$byChipType > > (x)}) > > names(cdfs) <- chiptypes > > print(cdfs) > > > # importing data from NetAffx CSV files > > csvs <- lapply(cdfs, FUN=function(cdf){AffymetrixNetAffxCsvFile > > $byChipType(getChipType(cdf), tags=".na27")}) > > print(csvs) > > > # allocating empty UFL (Unit Fragment Length) files > > ufls <- lapply(cdfs, FUN=function(cdf){AromaUflFile$allocateFromCdf > > (cdf, tags="na27,CS20090112")}) > > print(ufls) > > > # import SNP data > > units <- list(); > > for (chipType in names(ufls)) { > > ufl <- ufls[[chipType]]; > > csv <- csvs[[chipType]]; > > units[[chipType]] <- importFrom(ufl, csv, verbose=-50); > > } > > str(units) > > > # allocating empty UGP (Unit Genome Position) files > > ugps <- lapply(cdfs, FUN=function(cdf){AromaUgpFile$allocateFromCdf > > (cdf, tags="na27,CS20090112")}) > > print(ugps) > > > # import SNP data > > units <- list(); > > for (chipType in names(ugps)) { > > ugp <- ugps[[chipType]]; > > csv <- csvs[[chipType]]; > > units[[chipType]] <- importFrom(ugp, csv, verbose=-50); > > } > > str(units) > > This looks alright to me. You might want to check toward the posted > NA26 scripts as well, because they are more recent. > > > > > > > Here is the summary for the 100K arrays: > > # Summary 50K chips > >> str(units) > > List of 2 > > $ Mapping50K_Hind240: int [1:57244] 18632 18677 1631 18713 1630 18712 > > 18619 1639 18722 18608 ... > > $ Mapping50K_Xba240 : int [1:58960] 29181 18239 31302 19831 47750 > > 45114 19103 39711 19772 37811 ... > > >> ufl <- AromaUflFile$byChipType(chiptypes[1], tags="na27,CS20090112"); > >> print(summaryOfUnits(ufl, enzymes="HindIII")) > > snp cnp affxSnp other total > > enzyme1-only 56933 0 0 0 56933 > > missing 311 0 0 55 366 > > total 57244 0 0 55 57299 > > The NA26 version gives: > snp cnp affxSnp other total > enzyme1-only 56933 0 0 0 56933 > missing 311 0 0 55 366 > total 57244 0 0 55 57299 > > >> ufl <- AromaUflFile$byChipType(chiptypes[2], tags="na27,CS20090112"); > >> print(summaryOfUnits(ufl, enzymes="XbaI")) > > snp cnp affxSnp other total > > enzyme1-only 58616 0 0 0 58616 > > missing 344 0 0 55 399 > > total 58960 0 0 55 59015 > > NA26: > snp cnp affxSnp other total > enzyme1-only 58616 0 0 0 58616 > missing 344 0 0 55 399 > total 58960 0 0 55 59015 > > > > >> ugp <- AromaUgpFile$byChipType(chiptypes[1], tags="na27,CS20090112"); > >> print(summary(ugp, enzymes="HindIII")) > > chromosome position > > Min. : 1.000 Min. : 48603 > > 1st Qu.: 4.000 1st Qu.: 34667112 > > Median : 7.000 Median : 72677620 > > Mean : 8.402 Mean : 80405004 > > 3rd Qu.: 12.000 3rd Qu.:114826216 > > Max. : 23.000 Max. :246727435 > > NA's :363.000 NA's : 363 > > NA26: > > > print(summary(ugp)); > > chromosome position > Min. : 1.000 Min. : 48603 > 1st Qu.: 4.000 1st Qu.: 34667112 > Median : 7.000 Median : 72677621 > Mean : 8.402 Mean : 80405004 > 3rd Qu.: 12.000 3rd Qu.:114826216 > Max. : 23.000 Max. :246727435 > NA's :363.000 NA's : 363 > > > print(table(ugp[,1])); > > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > 16 > > 4541 5072 3962 4342 4215 3968 3444 3549 2357 2743 2466 2592 2661 1931 1440 > 1145 > > 17 18 19 20 21 22 23 > 985 1731 326 993 883 433 1157 > > >> ugp <- AromaUgpFile$byChipType(chiptypes[2], tags="na27,CS20090112"); > >> print(summary(ugp, enzymes="XbaI")) > > chromosome position > > Min. : 1.000 Min. : 93683 > > 1st Qu.: 4.000 1st Qu.: 34636629 > > Median : 7.000 Median : 72249739 > > Mean : 8.507 Mean : 80010574 > > 3rd Qu.: 12.000 3rd Qu.:114666170 > > Max. : 24.000 Max. :246885089 > > NA's :390.000 NA's : 390 > > NA26: > > > print(summary(ugp)); > > chromosome position > Min. : 1.000 Min. : 93683 > 1st Qu.: 4.000 1st Qu.: 34636629 > Median : 7.000 Median : 72249739 > Mean : 8.507 Mean : 80010574 > 3rd Qu.: 12.000 3rd Qu.:114666170 > Max. : 24.000 Max. :246885089 > NA's :390.000 NA's : 390 > > > print(table(ugp[,1])); > > 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 > 16 > > 4669 5274 3864 4231 4149 4111 3612 3422 2439 2953 2896 2685 2567 2083 1590 > 1245 > > 17 18 19 20 21 22 23 24 > 971 1837 364 1101 1027 330 1204 1 > > > > > > > Here is the summary for the 500K arrays: > > # Summary 500K chips > >> str(units) > > List of 2 > > $ Mapping250K_Sty: int [1:238304] 15133 175423 164715 237140 112643 > > 189587 162193 79611 196992 73555 ... > > $ Mapping250K_Nsp: int [1:262264] 34952 76 74370 232354 3677 72977 > > 73533 176215 161345 238482 ... > > >> ufl <- AromaUflFile$byChipType(chiptypes[1], tags="na27,CS20090112"); > >> print(summaryOfUnits(ufl, enzymes="StyI")) > > snp cnp affxSnp other total > > enzyme1-only 144868 0 0 0 144868 > > missing 93436 0 0 74 93510 > > total 238304 0 0 74 238378 > > NA26: > snp cnp affxSnp other total > enzyme1-only 237697 0 0 0 237697 > missing 607 0 0 74 681 > total 238304 0 0 74 238378 > > >> ufl <- AromaUflFile$byChipType(chiptypes[2], tags="na27,CS20090112"); > >> print(summaryOfUnits(ufl, enzymes="NspI")) > > snp cnp affxSnp other total > > enzyme1-only 261563 0 0 0 261563 > > missing 701 0 0 74 775 > > total 262264 0 0 74 262338 > > NA26: > snp cnp affxSnp other total > enzyme1-only 261563 0 0 0 261563 > missing 701 0 0 74 775 > total 262264 0 0 74 262338 > > > > >> ugp <- AromaUgpFile$byChipType(chiptypes[1], tags="na27,CS20090112"); > >> print(summary(ugp, enzymes="StyI")) > > chromosome position > > Min. : 1.000 Min. : 2994 > > 1st Qu.: 4.000 1st Qu.: 31306881 > > Median : 8.000 Median : 67082398 > > Mean : 9.117 Mean : 77333484 > > 3rd Qu.: 13.000 3rd Qu.:114799352 > > Max. : 23.000 Max. :247135059 > > NA's :677.000 NA's : 677 > > NA26: > > > print(summary(ugp)); > > chromosome position > Min. : 1.000 Min. : 2994 > 1st Qu.: 4.000 1st Qu.: 31306881 > Median : 8.000 Median : 67082398 > Mean : 9.117 Mean : 77333484 > 3rd Qu.: 13.000 3rd Qu.:114799352 > Max. : 23.000 Max. :247135059 > NA's :677.000 NA's : 677 > > > print(table(ugp[,1])); > > 1 2 3 4 5 6 7 8 9 10 11 12 13 > 20273 19147 15396 13236 14854 14296 11822 12601 10900 14190 12928 11863 8052 > 14 15 16 17 18 19 20 21 22 23 > 7531 7323 8272 6403 6726 3670 6555 3182 3669 4812 > > >> ugp <- AromaUgpFile$byChipType(chiptypes[2], tags="na27,CS20090112"); > >> print(summary(ugp, enzymes="NspI")) > > chromosome position > > Min. : 1.000 Min. : 17408 > > 1st Qu.: 4.000 1st Qu.: 32574796 > > Median : 8.000 Median : 70596240 > > Mean : 8.758 Mean : 79224244 > > 3rd Qu.: 13.000 3rd Qu.:114776300 > > Max. : 23.000 Max. :247110269 > > NA's :775.000 NA's : 775 > > NA26: > > > print(summary(ugp)); > > chromosome position > Min. : 1.000 Min. : 17408 > 1st Qu.: 4.000 1st Qu.: 32574797 > Median : 8.000 Median : 70596240 > Mean : 8.758 Mean : 79224244 > 3rd Qu.: 13.000 3rd Qu.:114776301 > Max. : 23.000 Max. :247110269 > NA's :775.000 NA's : 775 > > > print(table(ugp[,1])); > > 1 2 3 4 5 6 7 8 9 10 11 12 13 > 19810 22178 18347 19016 17133 17097 13912 14820 11899 14241 13254 13026 11094 > 14 15 16 17 18 19 20 21 22 23 > 8165 6982 7005 4830 8136 2661 5823 3927 2511 5696 > > > > > Could it be that function summaryOfUnits() does not work as expected? > > For ufl it prints "enzyme1-only" instead of e.g. "NspI-only" > > I've noticed that too; the enzyme name is not reported. But don't > worry, the files will contain the same information regardless. > > > For ugp it prints an error: no applicable method > > summaryOfUnits() is only available for AromaUflFile objects, not > AromaUgpFile objects. > > > > > Is it correct that for na27 there are now 93436 missing SNPs for StyI > > compared to only 75 missing SNPs for na24 (as shown on your page about > > building UFL files)? > > No, that does not seem to be correct. This could be an error in the > NetAffx CSV file, e.g. once they [Affymetrix] added Windows newlines > in funny place. However, since the same CSV file is used to import > the UGP data and that seems to be alright, I don't think this is the > case here. It could also be that they've changed to column names so > that importFrom() does no longer recognize the format, but this sounds ... > > read more » --~--~---------~--~----~------------~-------~--~----~ When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe from this group, send email to aroma-affymetrix-unsubscr...@googlegroups.com For more options, visit this group at http://groups.google.com/group/aroma-affymetrix?hl=en -~----------~----~----~----~------~----~------~--~---