[aroma.affymetrix] Re: ufl and ugp files for na27?

2009-01-15 Thread cstratowa

Dear Henrik

Thank you very much for your reply and your scripts.


First, the differences between na26 and na27 are as follows:
Header:
- line 11: changed annotation date format from July 21 2008 to
2008-12-01
Data:
- line 17: added one new column, i.e. last column is now %GC
- the data for Probe Set ID,Chromosome,Physical Position are
identical for na26 and na27.
This means that in principle I can still use na26.


Second, when I compare your results for HindIII, XbaI and NspI for
na26 to my results with na26 and na27, I get identical summary results
for both ufl and ugp files,as you may have realized when adding your
results for na26.
In contrast, for StyI I my results for both na26 and na27 are still:

snp cnp affxSnp other  total
enzyme1-only 144868   0   0 0 144868
missing   93436   0   074  93510
total238304   0   074 238378

This means I get 93436 missing snps vs your 607 missing snps. (See the
partial output for importFrom below!!!)
Even downloading the na26 file and the cdf file from Affymetrix again
did not change the result, which is strange since for the other 3
chiptypes the results agree with your results.

BTW, I have downloaded 250k_sty_libraryfile_rev4.zip. However, there
exists also an older version 250k_sty_libraryfile_rev3.zip. Which
version have you used for creating the ufl file?


Third, trying to run your script Mapping250K_Sty,UFL,na26.R gave the
following error at line
units - importFrom(ufl, csv, enzymes=enzyme, verbose=log);

Error in list(importFrom(ufl, csv, enzymes = enzyme, verbose = -50)
= environment,  :

[2009-01-15 12:33:30] Exception: Argument 'enzymes' contains 1 NA value
(s).
  at throw(Exception(...))
  at throw.default(sprintf(Argument '%s' contains %d NA value
(s)., .name, sum(
  at throw(sprintf(Argument '%s' contains %d NA value(s)., .name, sum
(is.na(x)
  at getNumerics.Arguments(static, ..., asMode = integer, disallow =
disallow)
  at getNumerics(static, ..., asMode = integer, disallow = disallow)
  at getIntegers.Arguments(static, ..., range = range)
  at getIntegers(static, ..., range = range)
  at method(static, ...)
  at Arguments$getIndices(enzymes, range = c(1, 10))
  at readDataUnitFragmentLength.AffymetrixNetAffxCsvFile(csv, enzymes
= enzymes,
  at readDataUnitFragmentLength(csv, enzymes = enzymes, rows =
keep, ..., verbos
  at importFromAffymetrixNetAffxCsvFile.AromaUflFile(this, src, ...)
  at importFromAffymetrixNetAffxCsvFile(this, src, ...)
  at importFrom.AromaUnitTabularBinaryFile(ufl, csv, enzymes = enzyme,
verbose =
  at importFrom(ufl, csv,
In addition: Warning message:
In eval(expr, envir, enclos) : NAs introduced by coercion
Importing (unit name, fragment length+) data from
AffymetrixNetAffxCsvFile...done

This is strange since I have tested that argument enzymes returns
correctly StyI.
Maybe this has to do with my version of aroma.affymetrix_0.9.4.


Here are the partial outputs for importFrom(ufl, csv, verbose=-50);
for NspI and StyI:

1. Running importFrom(ufl, csv, verbose=-50) for NspI has the
following partial output (see ==):

  Reading AffymetrixNetAffxCsvFile...done
  Extracting fragment lengths from ([enzyme], lengths, start, stop)...
   Inferring if enzyme names are specified...
Has enzyme names: TRUE
   Inferring if enzyme names are specified...done
   Identifying number of enzymes...
nbrOfEnzymes
 1  2  3
==131564 130698  2
Max number of enzymes: 3
   Identifying number of enzymes...done
   Splitting into subunits and padding with NAs...
   Splitting into subunits and padding with NAs...done
   Extracting enzyme names...
Identified enzymes: NspI, StyI
   Extracting enzyme names...done
   Identifying the location of the fragment lengths...
Offset: 3
   Identifying the location of the fragment lengths...done
   Extracting fragment lengths...
Summary of *all* fragment lengths:
Min.  1st Qu.   Median Mean  3rd Qu. Max. NA's
19.0526.0755.0842.9   1049.0   2000.0 394527.0
   Extracting fragment lengths...done
   Sorting data by enzyme...
   V1   V2   V3
 Min.   : 100.0   Min.   :19   Min.   :NA
 1st Qu.: 476.0   1st Qu.:   937   1st Qu.:NA
 Median : 644.0   Median :  1280   Median :NA
 Mean   : 648.9   Mean   :  1231   Mean   :   NaN
 3rd Qu.: 816.0   3rd Qu.:  1614   3rd Qu.:NA
 Max.   :1480.0   Max.   :  2000   Max.   :NA
== NA's   : 701.0   NA's   :131564   NA's   :262264
   Sorting data by enzyme...done
int [1:262264, 1] 574 700 580 631 666 1060 798 842 822 608 ...
  Extracting fragment lengths from ([enzyme], lengths, start,
stop)...done
 Reading (unitName, fragmentLength+) from file...done


2. Running importFrom(ufl, csv, verbose=-50) for StyI has the
following partial output:

  Reading AffymetrixNetAffxCsvFile...done
  Extracting fragment lengths from ([enzyme], lengths, start, stop)...
   Inferring 

[aroma.affymetrix] Re: ufl and ugp files for na27?

2009-01-14 Thread Henrik Bengtsson

Hi Christian,

I'm quite swamped but I've added links the scripts that I used to
generate the existing NetAffx 26 (na26) UFL and UGP to each of the
different SNP  CN chip type pages, e.g.

  
http://groups.google.com/group/aroma-affymetrix/web/mapping250k-nsp-mapping250k-sty

More comments below.

On Mon, Jan 12, 2009 at 6:24 AM, cstratowa
christian.strat...@vie.boehringer-ingelheim.com wrote:

 Dear Henrik

 Meanwhile I have created ufl and ugp files for both 100K and 500K
 arrays but not for GenomeWideSNP_6 aray.

The above scripts will show you how to do it for GenomeWideSNP_6.
Slightly more complicated since two NetAffx CSV files are involved.


 Can you confirm that the following code, which I use for both 100K and
 500K arrays, is correct:

 # retrieving annotation files
 chiptypes - c(Mapping50K_Hind240, Mapping50K_Xba240)
 cdfs - lapply(chiptypes, FUN=function(x){AffymetrixCdfFile$byChipType
 (x)})
 names(cdfs) - chiptypes
 print(cdfs)

 # importing data from NetAffx CSV files
 csvs - lapply(cdfs, FUN=function(cdf){AffymetrixNetAffxCsvFile
 $byChipType(getChipType(cdf), tags=.na27)})
 print(csvs)

 # allocating empty UFL (Unit Fragment Length) files
 ufls - lapply(cdfs, FUN=function(cdf){AromaUflFile$allocateFromCdf
 (cdf, tags=na27,CS20090112)})
 print(ufls)

 # import SNP data
 units - list();
 for (chipType in names(ufls)) {
   ufl - ufls[[chipType]];
   csv - csvs[[chipType]];
   units[[chipType]] - importFrom(ufl, csv, verbose=-50);
 }
 str(units)

 # allocating empty UGP (Unit Genome Position) files
 ugps - lapply(cdfs, FUN=function(cdf){AromaUgpFile$allocateFromCdf
 (cdf, tags=na27,CS20090112)})
 print(ugps)

 # import SNP data
 units - list();
 for (chipType in names(ugps)) {
   ugp - ugps[[chipType]];
   csv - csvs[[chipType]];
   units[[chipType]] - importFrom(ugp, csv, verbose=-50);
 }
 str(units)

This looks alright to me.  You might want to check toward the posted
NA26 scripts as well, because they are more recent.



 Here is the summary for the 100K arrays:
 # Summary 50K chips
 str(units)
 List of 2
  $ Mapping50K_Hind240: int [1:57244] 18632 18677 1631 18713 1630 18712
 18619 1639 18722 18608 ...
  $ Mapping50K_Xba240 : int [1:58960] 29181 18239 31302 19831 47750
 45114 19103 39711 19772 37811 ...

 ufl - AromaUflFile$byChipType(chiptypes[1], tags=na27,CS20090112);
 print(summaryOfUnits(ufl, enzymes=HindIII))
   snp cnp affxSnp other total
 enzyme1-only 56933   0   0 0 56933
 missing311   0   055   366
 total57244   0   055 57299

The NA26 version gives:
   snp cnp affxSnp other total
enzyme1-only 56933   0   0 0 56933
missing311   0   055   366
total57244   0   055 57299


 ufl - AromaUflFile$byChipType(chiptypes[2], tags=na27,CS20090112);
 print(summaryOfUnits(ufl, enzymes=XbaI))
   snp cnp affxSnp other total
 enzyme1-only 58616   0   0 0 58616
 missing344   0   055   399
 total58960   0   055 59015

NA26:
   snp cnp affxSnp other total
enzyme1-only 58616   0   0 0 58616
missing344   0   055   399
total58960   0   055 59015


 ugp - AromaUgpFile$byChipType(chiptypes[1], tags=na27,CS20090112);
 print(summary(ugp, enzymes=HindIII))
  chromosomeposition
  Min.   :  1.000   Min.   :48603
  1st Qu.:  4.000   1st Qu.: 34667112
  Median :  7.000   Median : 72677620
  Mean   :  8.402   Mean   : 80405004
  3rd Qu.: 12.000   3rd Qu.:114826216
  Max.   : 23.000   Max.   :246727435
  NA's   :363.000   NA's   :  363

NA26:

 print(summary(ugp));
 chromosomeposition
 Min.   :  1.000   Min.   :48603
 1st Qu.:  4.000   1st Qu.: 34667112
 Median :  7.000   Median : 72677621
 Mean   :  8.402   Mean   : 80405004
 3rd Qu.: 12.000   3rd Qu.:114826216
 Max.   : 23.000   Max.   :246727435
 NA's   :363.000   NA's   :  363
 print(table(ugp[,1]));

   123456789   10   11   12   13   14   15   16

4541 5072 3962 4342 4215 3968 3444 3549 2357 2743 2466 2592 2661 1931 1440 1145

  17   18   19   20   21   22   23
 985 1731  326  993  883  433 1157

 ugp - AromaUgpFile$byChipType(chiptypes[2], tags=na27,CS20090112);
 print(summary(ugp, enzymes=XbaI))
  chromosomeposition
  Min.   :  1.000   Min.   :93683
  1st Qu.:  4.000   1st Qu.: 34636629
  Median :  7.000   Median : 72249739
  Mean   :  8.507   Mean   : 80010574
  3rd Qu.: 12.000   3rd Qu.:114666170
  Max.   : 24.000   Max.   :246885089
  NA's   :390.000   NA's   :  390

NA26:

 print(summary(ugp));
 chromosomeposition
 Min.   :  1.000   Min.   :93683
 1st Qu.:  4.000   1st Qu.: 34636629
 Median :  7.000   Median : 72249739
 Mean   :  8.507   Mean   : 80010574
 3rd Qu.: 12.000   3rd Qu.:114666170
 Max.   : 24.000   Max.   :246885089
 NA's   :390.000   NA's   :  390
 print(table(ugp[,1]));

   123456789   10   11   12   13