Re: [aroma.affymetrix] Re: Mouse diversity array --building the required files for aroma.affymetrix UGP, UFL

Henrik Bengtsson Wed, 09 Jun 2010 05:20:57 -0700

Hi.

On Wed, Jun 9, 2010 at 11:57 AM, Ivanek, Robert <robert.iva...@fmi.ch> wrote:
> Hi Henrik,
>
> one more question to this array.
> During the analysis (Step 4 - Normalization for PCR fragment-length
> effect) I am getting the following error:
>
> [2010-06-09 11:51:49] Exception: Cannot fit normalization function to
> enzyme, because there are no (finite) data points that are unique to
> this enzyme: 1
>  at throw(Exception(...))
>  at throw.default("Cannot fit normalization function to enzyme,
> because there are no (finite) data points that are unique to this
> enzyme: ", ee)
>  at throw("Cannot fit normalization function to enzyme, because there
> are no (finite) data points that are unique to this enzyme: ", ee)
>  at normalizeFragmentLength.default(y, fragmentLengths = fl,
> targetFcns = targetFcns, subsetToFit = subset, onMissing =
> onMissing, ...)
>  at normalizeFragmentLength(y, fragmentLengths = fl, targetFcns =
> targetFcns, subsetToFit = subset, onMissing = onMissing, ...)
>  at process.FragmentLengthNormalization(fln, verbose = verbose)
>
> I think that this is caused by the fact that the UFL file contains
> also "theoretical" values for all fragments and therefore there is no
> single SNP/CN probe annoatated as produced by only one enzyme.


You are completely right here; the algorithm for estimating PCR
fragment-length normalization effects assumes that for each
restriction enzyme (if more than one is used as here), there exists at
least some loci (SNPs or CN units) that are only for that enzyme.   It
is these data points that are used to estimate the fragment-length
effect per enzyme, which is then used for correcting for the effect.
More precisely, this is Step (1) in the algorithm described in Section
2.2.4 'Normalization for fragment-length effects' in the CRMAv2 paper
(Bengtsson, Wirapati et al. 2009;
http://www.aroma-project.org/publications/).

Now looking at the UFL information imported from the NetAffx CSV files:

library("aroma.affymetrix");
cdf <- AffymetrixCdfFile$byChipType("MOUSEDIVm520650");
ufl <- getAromaUflFile(cdf);
print(ufl);

AromaUflFile:
Name: MOUSEDIVm520650
Tags: na30,mm9,HB20190603
Full name: MOUSEDIVm520650,na30,mm9,HB20190603
Pathname: 
annotationData/chipTypes/MOUSEDIVm520650/MOUSEDIVm520650,na30,mm9,HB20190603.ufl
File size: 9.38 MB (9835569 bytes)
RAM: 0.00 MB
Number of data rows: 2458697
File format: v1
Dimensions: 2458697x2
Column classes: integer, integer
Number of bytes per column: 2, 2
Footer: <createdOn>20100603 05:11:33 PDT</createdOn><platform>Affymetrix</platfo
rm><chipType>MOUSEDIVm520650</chipType><createdBy><fullname>Henrik Bengtsson</fu
llname><email>h...@aroma-project.org</email></createdBy><srcFiles><srcFile1><filen
ame>MOUSEDIVm520650.CDF</filename><filesize>668362285</filesize><checksum>1d83d7
dde5e4816f4be315f6863f6a7c</checksum></srcFile1><srcFile2><filename>MOUSEDIVm520
650.na30.annot.csv</filename><filesize>358437027</filesize><checksum>e3ef3524595
4d5a08ae299b95666fc3a</checksum></srcFile2><srcFile3><filename>MOUSEDIVm520650.c
n.na30.annot.csv</filename><filesize>795251151</filesize><checksum>5b5bbdb60ed44
3b2616358beea06ddaf</checksum></srcFile3></srcFiles>
Chip type: MOUSEDIVm520650
Platform: Affymetrix

print(getChecksum(ufl));
[1] "31cc2be0708c34a977f456dfb4b4d3b9"

hasEnzyme1 <- is.finite(ufl[,1,drop=TRUE]);
hasEnzyme2 <- is.finite(ufl[,2,drop=TRUE]);

hasEnzyme1Only <- (hasEnzyme1 & !hasEnzyme2);
hasEnzyme2Only <- (!hasEnzyme1 & hasEnzyme2);

hasEither <- (hasEnzyme1 | hasEnzyme2);
print(summary(hasEither));

   Mode   FALSE    TRUE    NA's
logical  230775 2227922       0

print(summary(hasEnzyme1Only));

   Mode   FALSE    NA's
logical 2458697       0

print(summary(hasEnzyme2Only));

   Mode   FALSE    NA's
logical 2458697       0

So, yes, neither enzyme has data points that are unique to that
enzyme.   This is a problem with the annotation data - not the CEL
file data.

Querying say probesets "JAX00289207" the online Affymetrix NetAffx
Analysis Center:

http://www.affymetrix.com/analysis/netaffx/xmlquery.affx?netaffx=mapping&hightlight=true&rootCategoryId=34005

and the click on the small 'Details' label (not the check box) you get to:

https://www.affymetrix.com/analysis/netaffx/mappingfullrecord.affx?pk=MOUSEDIVm520650:JAX00289207

and confirms that the fragment-length information is:

Enzyme Name, Length
NspI, 11635
StyI, 406

Compare to:
unit <- indexOf(cdf, "JAX00289207");
print(ufl[unit,]);
  length length.02
1  11635       406


> What do you think about filtering out fragments bigger than 5kb (or
> even smaller)? They should not be amplified anyway during the PCR

That might be a solution - though a bit ad hoc.  In order to get this
correct, I would spend the time to carefully do the following:

1) Check what the Affymetrix's Assay documentation says about what
range of fragment lengths the PCR amplification is supposed to keep
and what are supposed to be filtered out.  You should be able to fine
that documentation via the Affymetrix Product Page for
MOUSEDIVm520650.  You find links to the latter at
http://www.aroma-project.org/chipTypes/MOUSEDIVm520650

2) Ask Affymetrix about the fragment-length information available in
the NetAffx CSV files (and the above).  The NetAffx group are quite
responsive and are happy to get feedback so they can correct any
mistakes.  If you are quick enough, they might be able to correct it
before the NetAffx Release 31 scheduled for "June/July".  You can post
a message to one of the forums at Affymetrix Scientific Community
Forums [https://www.affymetrix.com/community/forums/index.jspa].  You
may also want send a message to Affymetrix Support
[www.affymetrix.com/support/] - in my experience they are also very
responsive.  I would point them also to the above NetAffx query
results.

It's worth getting this right once and for all.  Feel free to update
us here on the list.

Cheers,

Henrik

>
> Best Regards
>
> Robert
>
>
>
> On Jun 3, 3:22 pm, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
>> On Thu, Jun 3, 2010 at 12:21 PM, Ivanek, Robert <robert.iva...@fmi.ch> wrote:
>> > Hi Henrik,
>>
>> > I think you are right, the fragment sizes are theoretical ones. I
>> > would guess that the reason why also the long fragments are reported
>> > is because the same SNP is present in short fragment produced by the
>> > other enzyme.
>>
>> > Thank you very much for the patch.
>>
>> > Would you mind to update the MOUSEDIVm520650 chipType page and add
>> > there the UGP and UFL files?
>>
>> Ideally users contribute with UGP and UFL too, though this time I've
>> done it since I've already done most of the work.  Please compare to
>> what you got when you did.
>>
>> /Henrik
>>
>>
>>
>> > Best Regards
>>
>> > Robert
>>
>> > On Jun 2, 6:47 pm, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
>> >> Hi.
>>
>> >> On Wed, Jun 2, 2010 at 11:16 AM, Ivanek, Robert <robert.iva...@fmi.ch> 
>> >> wrote:
>> >> > HI Henrik,
>>
>> >> > I was a little bit investigating the error and I found out that some
>> >> > of the fragments reported in NetAffx files are really long.
>> >> > Why they got a negative value of -32768 and not a positive one?
>>
>> >> Thanks for reporting.  It turns out to be a bug in aroma.core causing
>> >> it to censor values into [-32767,32768], whereas it should have been
>> >> [-32768,32767].  Thus, the fragment lengths that are too large where
>> >> written as 32768, which when read back became -32768 (that's how
>> >> signed integers loops around when output of range).  That should have
>> >> been written as 32767.
>>
>> >> I have fixed this in the next release of aroma.core.  Until that is
>> >> released, you can install a patch as explained in:
>>
>> >>  http://aroma-project.org/howtos/updateOrPatch
>>
>> >> With the patch, you will get correct censoring and more informative
>> >> warnings, e.g.
>>
>> >> Warning messages:
>> >> 1: In updateDataColumn.AromaTabularBinaryFile(this, .con = con, .hdr = 
>> >> hdr,  :
>> >>   33 values to be assigned were out of range [-32768,32767] and
>> >> therefore censored to fit the range. Of these, 33 values in
>> >> [35102,655381] were too large.
>> >> 2: In updateDataColumn.AromaTabularBinaryFile(this, .con = con, .hdr = 
>> >> hdr,  :
>> >>   21 values to be assigned were out of range [-32768,32767] and
>> >> therefore censored to fit the range. Of these, 21 values in
>> >> [50496,56758] were too large.
>>
>> >> About the very large fragment lengths:  My guess is that they are
>> >> "theoretical" fragments lengths.  After running the PCR in the assay,
>> >> very long fragments are not amplified and hence filtered out.  For the
>> >> specific enzyme, you should not get any hybrization signal for very
>> >> long fragments.  It is possible that you have signal from the cuts of
>> >> the other enzyme.   Maybe someone else has a better explanation of why
>> >> they are so long and still on the array?   You could also drop a
>> >> message on the Affymetrix forums and ask.
>>
>> >> /Henrik
>>
>> >> > Robert
>>
>> >> > On Jun 1, 7:16 pm, "Ivanek, Robert" <robert.iva...@fmi.ch> wrote:
>> >> >> Hi Henrik,
>>
>> >> >> Thanks for the answer and also the ACS file.
>> >> >> I have one more question regarding the UFL file generation.
>>
>> >> >> I tried it by using the NettAffx and I got the following error:
>>
>> >> >> R> ufl <- AromaUflFile$allocateFromCdf(cdf, nbrOfEnzymes=2,
>> >> >> tags=c("na30", "RI20100601"))
>> >> >> R> csv <- AffymetrixNetAffxCsvFile$byChipType(chipType, tags=".na30");
>> >> >> R> units <- importFrom(ufl, csv);
>> >> >> Warning messages:
>> >> >> 1: In updateDataColumn.AromaTabularBinaryFile(this, .con = con, .hdr =
>> >> >> hdr,  :
>> >> >>   Values to be assigned were out of range [-32767,32768] and therefore
>> >> >> censored to fit the range.
>> >> >> 2: In updateDataColumn.AromaTabularBinaryFile(this, .con = con, .hdr =
>> >> >> hdr,  :
>> >> >>   Values to be assigned were out of range [-32767,32768] and therefore
>> >> >> censored to fit the range.
>>
>> >> >> R> csv <- AffymetrixNetAffxCsvFile$byChipType(chipType,
>> >> >> tags=".cn.na30");
>> >> >> R> units <- importFrom(ufl, csv);
>> >> >> Warning messages:
>> >> >> 1: In updateDataColumn.AromaTabularBinaryFile(this, .con = con, .hdr =
>> >> >> hdr,  :
>> >> >>   Values to be assigned were out of range [-32767,32768] and therefore
>> >> >> censored to fit the range.
>> >> >> 2: In updateDataColumn.AromaTabularBinaryFile(this, .con = con, .hdr =
>> >> >> hdr,  :
>> >> >>   Values to be assigned were out of range [-32767,32768] and therefore
>> >> >> censored to fit the range.
>>
>> >> >> And the summary produce the following
>> >> >> R> summary(ufl)
>> >> >>  length           length.02
>> >> >>  Min.   :-32768   Min.   :-32768
>> >> >>  1st Qu.:   614   1st Qu.:   541
>> >> >>  Median :  1146   Median :   997
>> >> >>  Mean   :  1601   Mean   :  1466
>> >> >>  3rd Qu.:  2195   3rd Qu.:  2000
>> >> >>  Max.   : 22095   Max.   : 30002
>> >> >>  NA's   :230775   NA's   :230775
>>
>> >> >> Would you be so kind and build also the UFL and UGP files?
>>
>> >> >> Best Regards
>>
>> >> >> Robert
>>
>> >> >> On May 30, 7:27 pm, Henrik Bengtsson <h...@stat.berkeley.edu> wrote:
>>
>> >> >> > Hi.
>>
>> >> >> > On Wed, May 26, 2010 at 3:24 PM, Ivanek, Robert 
>> >> >> > <robert.iva...@fmi.ch> wrote:
>> >> >> > > Dear Sir or Madam,
>>
>> >> >> > > I would like to analyse the copy number variation data from 
>> >> >> > > Affymetrix
>> >> >> > > Mouse Diversity Array. I have not found any information on your 
>> >> >> > > website
>> >> >> > > about this particular array.
>>
>> >> >> > I have created page for this:
>>
>> >> >> >http://aroma-project.org/chipTypes/MOUSEDIVm520650
>>
>> >> >> > > I have tried to build the annotation files
>> >> >> > > which are required by aroma but without success. I have few 
>> >> >> > > questions
>> >> >> > > regarding that:
>>
>> >> >> > > 1: Is aroma.affymetrix able to analyse the "Mouse Diversity Array" 
>> >> >> > > ?
>>
>> >> >> > Yes, because there should be no reason why it shouldn't - it uses a
>> >> >> > standard CDF etc.  As you've noted, UGP (and UFL) files have not been
>> >> >> > created by anyone yet.
>>
>> >> >> > For CN analysis, at least the UGP (genome positions) annotation data
>> >> >> > file needs to be there.
>>
>> >> >> > > 2: I tried to build the "UGP" file directly from NetAffx annotation
>> >> >> > > files using the code on your website, however I am getting the 
>> >> >> > > following
>> >> >> > > error.
>>
>> >> >> > > ##
>> >> >> > > library("aroma.affymetrix")
>> >> >> > > ##
>> >> >> > > ## create UGP from NetAffx files
>> >> >> > > cdf <- AffymetrixCdfFile$byChipType("MOUSEDIVm520650")
>> >> >> > > ##
>> >> >> > > ## Creates an empty UGP file for the CDF, if missing.
>> >> >> > > ugp <- AromaUgpFile$allocateFromCdf(cdf, tags=c("na30", 
>> >> >> > > "RI20100526"))
>> >> >> > > ##
>> >> >> > > ## Import NetAffx unit position data
>> >> >> > > csv <- AffymetrixNetAffxCsvFile$byChipType("MOUSEDIVm520650",
>> >> >> > > otags=".na30")
>>
>> >> >> > > Error in 
>> >> >> > > list(`AffymetrixNetAffxCsvFile$byChipType("MOUSEDIVm520650",
>> >> >> > > tags = ".na30")` = <environment>,  :
>>
>> >> >> > > [2010-05-26 15:11:00] Exception: File format error of the tabular 
>> >> >> > > file
>> >> >> > > ('annotationData/chipTypes/MOUSEDIVm520650/NetAffx/MOUSEDIVm520650.na30.annot.csv'):
>> >> >> > >  \
>> >> >> > > line 1 did not have 12 elements
>> >> >> > >  at throw(Exception(...))
>> >> >> > >  at throw.default("File format error of the tabular file ('",
>> >> >> > > getPathname(this), "'): ", ex$message)
>> >> >> > >  at throw("File format error of the tabular file ('",
>> >> >> > > getPathname(this), "'): ", ex$message)
>> >> >> > >  at value[[3]](cond)
>> >> >> > >  at tryCatchOne(expr, names, parentenv, handlers[[1]])
>> >> >> > >  at tryCatchList(expr, classes, parentenv, handlers)
>> >> >> > >  at tryCatch({
>> >> >> > >  at verify.TabularTextFile(this, ...)
>> >> >> > >  at verify(this, ...)
>> >> >> > >  at this(...)
>> >> >> > >  at newInstance.Class(clazz, ...)
>> >> >> > >  at newInstance(clazz, ...)
>> >> >> > >  at newInstance.Object(static, pathname)
>> >> >> > >  at newInstance(static, pathname)
>> >> >> > >  at method(static, ...)
>> >> >> > >  at AffymetrixNetAffxCsvFile$byChipType("MOUSEDIVm520650", tags =
>> >> >> > > ".na30")
>> >> >> > > In addition: Warning message:
>> >> >> > > In read.table(3L, header = TRUE, colClasses = c(NA_character_,
>> >> >> > > NA_character_,  :
>> >> >> > >  not all columns named in 'colClasses' exist
>>
>> >> >> > I had a look at the MOUSEDIVm520650.na30.annot.csv file.  The line
>> >> >> > containing column names, that is:
>>
>> >> >> > "Probe Set ID","dbSNP RS ID","Chromosome","Physical
>> >> >> > Position","Strand","Cytoband","Allele A","Allele B","Associated
>> >> >> > Gene","Genetic Map","Fragment Enzyme Type Length Start Stop",
>>
>> >> >> > contains a trailing comma (,) that shouldn't be there ("file format
>> >> >> > error").  This cause R to think there should be 12 and not 11 columns
>> >> >> > in the data set.  Open the file in an editor and remove that trailing
>> >> >> > comma and any whitespace after "Fragment Enzyme Type Length Start
>> >> >> > Stop".  Then save the file.  That should solve the problem.
>>
>> >> >> > The other CSV file - MOUSEDIVm520650.cn.na30.annot.csv - does not 
>> >> >> > have
>> >> >> > this problem.
>>
>> >> >> > > 3. I tried it also by using the "manual" approach using the
>> >> >> > > tab=delimited file, however it seems to me that the mitochondria 
>> >> >> > > probes
>> >> >> > > are skipped  (NA values in ugp[,1] but valid values in ugp[,2]).
>>
>> >> >> > The Affymetrix NetAffx CSV files use s "M" for the mitochondria
>> >> >> > chromosome.  In aroma we encode this by integer 25.
>>
>> >> >> > > Another
>> >> >> > > problem is that some positions for other chromosomes are not 
>> >> >> > > loaded in
>> >> >> > > properly (valid values in ugp[,1] but NA values in ugp[,2]).
>>
>> >> >> > You don't show how you read the data "manually", so it is hard to say
>> >> >> > what you are doing wrong here.  But note that there are quite a few
>> >> >> > arguments in read.table() that you need to set correctly in order to
>> >> >> > read Affymetrix NetAffx CSV files (it doesn't make easier that
>> >> >> > Affymetrix changes the file format once in a while and have stray
>> >> >> > erroneous symbols such as the above comma).
>>
>> >> >> > Also, search our forum for 'MOUSEDIVm520650', because about a year 
>> >> >> > ago
>> >> >> > David Rosenberg disscussed this chip type and I think he did create
>> >> >> > various annotation data files for the
>>
>> ...
>>
>> read more »
>
> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
> version of the package, 2) to report the output of sessionInfo() and 
> traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups 
> "aroma.affymetrix" group with website http://www.aroma-project.org/.
> To post to this group, send email to aroma-affymetrix@googlegroups.com
> To unsubscribe and other options, go to http://www.aroma-project.org/forum/
>

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

Re: [aroma.affymetrix] Re: Mouse diversity array --building the required files for aroma.affymetrix UGP, UFL

Reply via email to