[aroma.affymetrix] Re: ufl and ugp files for na27?

cstratowa Thu, 15 Jan 2009 04:42:23 -0800

Dear Henrik

Thank you very much for your reply and your scripts.



First, the differences between na26 and na27 are as follows:
Header:
- line 11: changed annotation date format from "July 21 2008" to
"2008-12-01"
Data:
- line 17: added one new column, i.e. last column is now %GC
- the data for "Probe Set ID","Chromosome","Physical Position" are
identical for na26 and na27.
This means that in principle I can still use na26.


Second, when I compare your results for HindIII, XbaI and NspI for
na26 to my results with na26 and na27, I get identical summary results
for both ufl and ugp files,as you may have realized when adding your
results for na26.
In contrast, for StyI I my results for both na26 and na27 are still:

                snp cnp affxSnp other  total
enzyme1-only 144868   0       0     0 144868
missing       93436   0       0    74  93510
total        238304   0       0    74 238378

This means I get 93436 missing snps vs your 607 missing snps. (See the
partial output for "importFrom" below!!!)
Even downloading the na26 file and the cdf file from Affymetrix again
did not change the result, which is strange since for the other 3
chiptypes the results agree with your results.

BTW, I have downloaded "250k_sty_libraryfile_rev4.zip". However, there
exists also an older version "250k_sty_libraryfile_rev3.zip". Which
version have you used for creating the ufl file?


Third, trying to run your script "Mapping250K_Sty,UFL,na26.R" gave the
following error at line
units <- importFrom(ufl, csv, enzymes=enzyme, verbose=log);

Error in list("importFrom(ufl, csv, enzymes = enzyme, verbose = -50)"
= <environment>,  :

[2009-01-15 12:33:30] Exception: Argument 'enzymes' contains 1 NA value
(s).
  at throw(Exception(...))
  at throw.default(sprintf("Argument '%s' contains %d NA value
(s).", .name, sum(
  at throw(sprintf("Argument '%s' contains %d NA value(s).", .name, sum
(is.na(x)
  at getNumerics.Arguments(static, ..., asMode = "integer", disallow =
disallow)
  at getNumerics(static, ..., asMode = "integer", disallow = disallow)
  at getIntegers.Arguments(static, ..., range = range)
  at getIntegers(static, ..., range = range)
  at method(static, ...)
  at Arguments$getIndices(enzymes, range = c(1, 10))
  at readDataUnitFragmentLength.AffymetrixNetAffxCsvFile(csv, enzymes
= enzymes,
  at readDataUnitFragmentLength(csv, enzymes = enzymes, rows =
keep, ..., verbos
  at importFromAffymetrixNetAffxCsvFile.AromaUflFile(this, src, ...)
  at importFromAffymetrixNetAffxCsvFile(this, src, ...)
  at importFrom.AromaUnitTabularBinaryFile(ufl, csv, enzymes = enzyme,
verbose =
  at importFrom(ufl, csv,
In addition: Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
Importing (unit name, fragment length+) data from
AffymetrixNetAffxCsvFile...done

This is strange since I have tested that argument enzymes returns
correctly "StyI".
Maybe this has to do with my version of aroma.affymetrix_0.9.4.


Here are the partial outputs for "importFrom(ufl, csv, verbose=-50);"
for NspI and StyI:

1. Running "importFrom(ufl, csv, verbose=-50)" for NspI has the
following partial output (see "==>"):

  Reading AffymetrixNetAffxCsvFile...done
  Extracting fragment lengths from ([enzyme], lengths, start, stop)...
   Inferring if enzyme names are specified...
    Has enzyme names: TRUE
   Inferring if enzyme names are specified...done
   Identifying number of enzymes...
    nbrOfEnzymes
         1      2      3
==>    131564 130698      2
    Max number of enzymes: 3
   Identifying number of enzymes...done
   Splitting into subunits and padding with NAs...
   Splitting into subunits and padding with NAs...done
   Extracting enzyme names...
    Identified enzymes: NspI, StyI
   Extracting enzyme names...done
   Identifying the location of the fragment lengths...
    Offset: 3
   Identifying the location of the fragment lengths...done
   Extracting fragment lengths...
    Summary of *all* fragment lengths:
        Min.  1st Qu.   Median     Mean  3rd Qu.     Max.     NA's
        19.0    526.0    755.0    842.9   1049.0   2000.0 394527.0
   Extracting fragment lengths...done
   Sorting data by enzyme...
           V1               V2               V3
     Min.   : 100.0   Min.   :    19   Min.   :    NA
     1st Qu.: 476.0   1st Qu.:   937   1st Qu.:    NA
     Median : 644.0   Median :  1280   Median :    NA
     Mean   : 648.9   Mean   :  1231   Mean   :   NaN
     3rd Qu.: 816.0   3rd Qu.:  1614   3rd Qu.:    NA
     Max.   :1480.0   Max.   :  2000   Max.   :    NA
==>     NA's   : 701.0   NA's   :131564   NA's   :262264
   Sorting data by enzyme...done
    int [1:262264, 1] 574 700 580 631 666 1060 798 842 822 608 ...
  Extracting fragment lengths from ([enzyme], lengths, start,
stop)...done
 Reading (unitName, fragmentLength+) from file...done


2. Running "importFrom(ufl, csv, verbose=-50)" for StyI has the
following partial output:

  Reading AffymetrixNetAffxCsvFile...done
  Extracting fragment lengths from ([enzyme], lengths, start, stop)...
   Inferring if enzyme names are specified...
    Has enzyme names: TRUE
   Inferring if enzyme names are specified...done
   Identifying number of enzymes...
    nbrOfEnzymes
         1      2      3
==>     93436 144865      3
    Max number of enzymes: 3
   Identifying number of enzymes...done
   Splitting into subunits and padding with NAs...
   Splitting into subunits and padding with NAs...done
   Extracting enzyme names...
    Identified enzymes: NspI, StyI
   Extracting enzyme names...done
   Identifying the location of the fragment lengths...
    Offset: 3
   Identifying the location of the fragment lengths...done
   Extracting fragment lengths...
    Summary of *all* fragment lengths:
       Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's
         22     481     695     788     966    2000  332341
   Extracting fragment lengths...done
   Sorting data by enzyme...
           V1              V2             V3
     Min.   :   22   Min.   : 115   Min.   :    NA
     1st Qu.:  672   1st Qu.: 440   1st Qu.:    NA
     Median : 1078   Median : 598   Median :    NA
     Mean   : 1082   Mean   : 609   Mean   :   NaN
     3rd Qu.: 1499   3rd Qu.: 771   3rd Qu.:    NA
     Max.   : 2000   Max.   :1266   Max.   :    NA
==>     NA's   :93436   NA's   : 607   NA's   :238304
   Sorting data by enzyme...done
    int [1:238304, 1] NA 1139 599 1872 1857 1465 382 1651 NA 1833 ...
  Extracting fragment lengths from ([enzyme], lengths, start,
stop)...done
 Reading (unitName, fragmentLength+) from file...done

As you see the higlighted lines for NspI and STyI are:
           V1              V2             V3
==>     NA's   : 701.0   NA's   :131564   NA's   :262264
==>     NA's   :93436   NA's   : 607   NA's   :238304

This means missing = 701 appears in V1 for NspI, but missing = 607
appears in V2 for StyI.
What does this mean?

Best regards
Christian


On Jan 15, 1:50 am, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
> Hi Christian,
>
> I'm quite swamped but I've added links the scripts that I used to
> generate the existing NetAffx 26 (na26) UFL and UGP to each of the
> different SNP & CN chip type pages, e.g.
>
>  http://groups.google.com/group/aroma-affymetrix/web/mapping250k-nsp-m...
>
> More comments below.
>
> On Mon, Jan 12, 2009 at 6:24 AM, cstratowa
>
> <christian.strat...@vie.boehringer-ingelheim.com> wrote:
>
> > Dear Henrik
>
> > Meanwhile I have created ufl and ugp files for both 100K and 500K
> > arrays but not for GenomeWideSNP_6 aray.
>
> The above scripts will show you how to do it for GenomeWideSNP_6.
> Slightly more complicated since two NetAffx CSV files are involved.
>
>
>
>
>
> > Can you confirm that the following code, which I use for both 100K and
> > 500K arrays, is correct:
>
> > # retrieving annotation files
> > chiptypes <- c("Mapping50K_Hind240", "Mapping50K_Xba240")
> > cdfs <- lapply(chiptypes, FUN=function(x){AffymetrixCdfFile$byChipType
> > (x)})
> > names(cdfs) <- chiptypes
> > print(cdfs)
>
> > # importing data from NetAffx CSV files
> > csvs <- lapply(cdfs, FUN=function(cdf){AffymetrixNetAffxCsvFile
> > $byChipType(getChipType(cdf), tags=".na27")})
> > print(csvs)
>
> > # allocating empty UFL (Unit Fragment Length) files
> > ufls <- lapply(cdfs, FUN=function(cdf){AromaUflFile$allocateFromCdf
> > (cdf, tags="na27,CS20090112")})
> > print(ufls)
>
> > # import SNP data
> > units <- list();
> > for (chipType in names(ufls)) {
> >   ufl <- ufls[[chipType]];
> >   csv <- csvs[[chipType]];
> >   units[[chipType]] <- importFrom(ufl, csv, verbose=-50);
> > }
> > str(units)
>
> > # allocating empty UGP (Unit Genome Position) files
> > ugps <- lapply(cdfs, FUN=function(cdf){AromaUgpFile$allocateFromCdf
> > (cdf, tags="na27,CS20090112")})
> > print(ugps)
>
> > # import SNP data
> > units <- list();
> > for (chipType in names(ugps)) {
> >   ugp <- ugps[[chipType]];
> >   csv <- csvs[[chipType]];
> >   units[[chipType]] <- importFrom(ugp, csv, verbose=-50);
> > }
> > str(units)
>
> This looks alright to me.  You might want to check toward the posted
> NA26 scripts as well, because they are more recent.
>
>
>
>
>
> > Here is the summary for the 100K arrays:
> > # Summary 50K chips
> >> str(units)
> > List of 2
> >  $ Mapping50K_Hind240: int [1:57244] 18632 18677 1631 18713 1630 18712
> > 18619 1639 18722 18608 ...
> >  $ Mapping50K_Xba240 : int [1:58960] 29181 18239 31302 19831 47750
> > 45114 19103 39711 19772 37811 ...
>
> >> ufl <- AromaUflFile$byChipType(chiptypes[1], tags="na27,CS20090112");
> >> print(summaryOfUnits(ufl, enzymes="HindIII"))
> >               snp cnp affxSnp other total
> > enzyme1-only 56933   0       0     0 56933
> > missing        311   0       0    55   366
> > total        57244   0       0    55 57299
>
> The NA26 version gives:
>                snp cnp affxSnp other total
> enzyme1-only 56933   0       0     0 56933
> missing        311   0       0    55   366
> total        57244   0       0    55 57299
>
> >> ufl <- AromaUflFile$byChipType(chiptypes[2], tags="na27,CS20090112");
> >> print(summaryOfUnits(ufl, enzymes="XbaI"))
> >               snp cnp affxSnp other total
> > enzyme1-only 58616   0       0     0 58616
> > missing        344   0       0    55   399
> > total        58960   0       0    55 59015
>
> NA26:
>                snp cnp affxSnp other total
> enzyme1-only 58616   0       0     0 58616
> missing        344   0       0    55   399
> total        58960   0       0    55 59015
>
>
>
> >> ugp <- AromaUgpFile$byChipType(chiptypes[1], tags="na27,CS20090112");
> >> print(summary(ugp, enzymes="HindIII"))
> >  chromosome        position
> >  Min.   :  1.000   Min.   :    48603
> >  1st Qu.:  4.000   1st Qu.: 34667112
> >  Median :  7.000   Median : 72677620
> >  Mean   :  8.402   Mean   : 80405004
> >  3rd Qu.: 12.000   3rd Qu.:114826216
> >  Max.   : 23.000   Max.   :246727435
> >  NA's   :363.000   NA's   :      363
>
> NA26:
>
> > print(summary(ugp));
>
>  chromosome        position
>  Min.   :  1.000   Min.   :    48603
>  1st Qu.:  4.000   1st Qu.: 34667112
>  Median :  7.000   Median : 72677621
>  Mean   :  8.402   Mean   : 80405004
>  3rd Qu.: 12.000   3rd Qu.:114826216
>  Max.   : 23.000   Max.   :246727435
>  NA's   :363.000   NA's   :      363
>
> > print(table(ugp[,1]));
>
>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   
> 16
>
> 4541 5072 3962 4342 4215 3968 3444 3549 2357 2743 2466 2592 2661 1931 1440 
> 1145
>
>   17   18   19   20   21   22   23
>  985 1731  326  993  883  433 1157
>
> >> ugp <- AromaUgpFile$byChipType(chiptypes[2], tags="na27,CS20090112");
> >> print(summary(ugp, enzymes="XbaI"))
> >  chromosome        position
> >  Min.   :  1.000   Min.   :    93683
> >  1st Qu.:  4.000   1st Qu.: 34636629
> >  Median :  7.000   Median : 72249739
> >  Mean   :  8.507   Mean   : 80010574
> >  3rd Qu.: 12.000   3rd Qu.:114666170
> >  Max.   : 24.000   Max.   :246885089
> >  NA's   :390.000   NA's   :      390
>
> NA26:
>
> > print(summary(ugp));
>
>  chromosome        position
>  Min.   :  1.000   Min.   :    93683
>  1st Qu.:  4.000   1st Qu.: 34636629
>  Median :  7.000   Median : 72249739
>  Mean   :  8.507   Mean   : 80010574
>  3rd Qu.: 12.000   3rd Qu.:114666170
>  Max.   : 24.000   Max.   :246885089
>  NA's   :390.000   NA's   :      390
>
> > print(table(ugp[,1]));
>
>    1    2    3    4    5    6    7    8    9   10   11   12   13   14   15   
> 16
>
> 4669 5274 3864 4231 4149 4111 3612 3422 2439 2953 2896 2685 2567 2083 1590 
> 1245
>
>   17   18   19   20   21   22   23   24
>  971 1837  364 1101 1027  330 1204    1
>
>
>
>
>
> > Here is the summary for the 500K arrays:
> > # Summary 500K chips
> >> str(units)
> > List of 2
> >  $ Mapping250K_Sty: int [1:238304] 15133 175423 164715 237140 112643
> > 189587 162193 79611 196992 73555 ...
> >  $ Mapping250K_Nsp: int [1:262264] 34952 76 74370 232354 3677 72977
> > 73533 176215 161345 238482 ...
>
> >> ufl <- AromaUflFile$byChipType(chiptypes[1], tags="na27,CS20090112");
> >> print(summaryOfUnits(ufl, enzymes="StyI"))
> >                snp cnp affxSnp other  total
> > enzyme1-only 144868   0       0     0 144868
> > missing       93436   0       0    74  93510
> > total        238304   0       0    74 238378
>
> NA26:
>                 snp cnp affxSnp other  total
> enzyme1-only 237697   0       0     0 237697
> missing         607   0       0    74    681
> total        238304   0       0    74 238378
>
> >> ufl <- AromaUflFile$byChipType(chiptypes[2], tags="na27,CS20090112");
> >> print(summaryOfUnits(ufl, enzymes="NspI"))
> >                snp cnp affxSnp other  total
> > enzyme1-only 261563   0       0     0 261563
> > missing         701   0       0    74    775
> > total        262264   0       0    74 262338
>
> NA26:
>                 snp cnp affxSnp other  total
> enzyme1-only 261563   0       0     0 261563
> missing         701   0       0    74    775
> total        262264   0       0    74 262338
>
>
>
> >> ugp <- AromaUgpFile$byChipType(chiptypes[1], tags="na27,CS20090112");
> >> print(summary(ugp, enzymes="StyI"))
> >  chromosome        position
> >  Min.   :  1.000   Min.   :     2994
> >  1st Qu.:  4.000   1st Qu.: 31306881
> >  Median :  8.000   Median : 67082398
> >  Mean   :  9.117   Mean   : 77333484
> >  3rd Qu.: 13.000   3rd Qu.:114799352
> >  Max.   : 23.000   Max.   :247135059
> >  NA's   :677.000   NA's   :      677
>
> NA26:
>
> > print(summary(ugp));
>
>  chromosome        position
>  Min.   :  1.000   Min.   :     2994
>  1st Qu.:  4.000   1st Qu.: 31306881
>  Median :  8.000   Median : 67082398
>  Mean   :  9.117   Mean   : 77333484
>  3rd Qu.: 13.000   3rd Qu.:114799352
>  Max.   : 23.000   Max.   :247135059
>  NA's   :677.000   NA's   :      677
>
> > print(table(ugp[,1]));
>
>     1     2     3     4     5     6     7     8     9    10    11    12    13
> 20273 19147 15396 13236 14854 14296 11822 12601 10900 14190 12928 11863  8052
>    14    15    16    17    18    19    20    21    22    23
>  7531  7323  8272  6403  6726  3670  6555  3182  3669  4812
>
> >> ugp <- AromaUgpFile$byChipType(chiptypes[2], tags="na27,CS20090112");
> >> print(summary(ugp, enzymes="NspI"))
> >  chromosome        position
> >  Min.   :  1.000   Min.   :    17408
> >  1st Qu.:  4.000   1st Qu.: 32574796
> >  Median :  8.000   Median : 70596240
> >  Mean   :  8.758   Mean   : 79224244
> >  3rd Qu.: 13.000   3rd Qu.:114776300
> >  Max.   : 23.000   Max.   :247110269
> >  NA's   :775.000   NA's   :      775
>
> NA26:
>
> > print(summary(ugp));
>
>  chromosome        position
>  Min.   :  1.000   Min.   :    17408
>  1st Qu.:  4.000   1st Qu.: 32574797
>  Median :  8.000   Median : 70596240
>  Mean   :  8.758   Mean   : 79224244
>  3rd Qu.: 13.000   3rd Qu.:114776301
>  Max.   : 23.000   Max.   :247110269
>  NA's   :775.000   NA's   :      775
>
> > print(table(ugp[,1]));
>
>     1     2     3     4     5     6     7     8     9    10    11    12    13
> 19810 22178 18347 19016 17133 17097 13912 14820 11899 14241 13254 13026 11094
>    14    15    16    17    18    19    20    21    22    23
>  8165  6982  7005  4830  8136  2661  5823  3927  2511  5696
>
>
>
> > Could it be that function summaryOfUnits() does not work as expected?
> > For ufl it prints "enzyme1-only" instead  of e.g. "NspI-only"
>
> I've noticed that too; the enzyme name is not reported.  But don't
> worry, the files will contain the same information regardless.
>
> > For ugp it prints an error: no applicable method
>
> summaryOfUnits() is only available for AromaUflFile objects, not
> AromaUgpFile objects.
>
>
>
> > Is it correct that for na27 there are now 93436 missing SNPs for StyI
> > compared to only 75 missing SNPs for na24 (as shown on your page about
> > building UFL files)?
>
> No, that does not seem to be correct.  This could be an error in the
> NetAffx CSV file, e.g. once they [Affymetrix] added Windows newlines
> in funny place.  However, since the same CSV file is used to import
> the UGP data and that seems to be alright, I don't think this is the
> case here.  It could also be that they've changed to column names so
> that importFrom() does no longer recognize the format, but this sounds ...
>
> read more »
--~--~---------~--~----~------------~-------~--~----~
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe from this group, send email to 
aroma-affymetrix-unsubscr...@googlegroups.com
For more options, visit this group at 
http://groups.google.com/group/aroma-affymetrix?hl=en
-~----------~----~----~----~------~----~------~--~---

[aroma.affymetrix] Re: ufl and ugp files for na27?

Reply via email to