Re: [aroma.affymetrix] Create binary data files containing BAF data

Henrik Bengtsson Tue, 31 Jan 2012 11:49:47 -0800

On Mon, Jan 30, 2012 at 9:45 PM, Kai <wangz...@gmail.com> wrote:
> Hi Henrik,
>
> Thanks for the detailed explanations. They make a lot of sense. I just had
> one follow-up question on the format of data stored in an
> AromaUnitTotalCnBinaryFile when the "log2ratio" tag is not supplied. In your
> response, you mentioned that "it holds CN ratios on the intensity scale,
> i.e. C=2*theta/thetaR". I am a bit confused by why you multiply
> "theta/thetaR" by a factor of 2?

The "multiplication with 2" is based on the assumption that the
reference sample/array ("thetaR") is a truly diploid in any position.
Thus, when your test sample ("theta") have the same amount of
hybridization signal as the reference, that is, when theta/thetaR is
close to one, we believe that the test sample is also diploid in this
position.  By multiplying with 2, the entity C then is close to two
whenever the test sample is diploid.   This is obviously not true for
all organisms/genomes.

The important thing is to not analyze CN estimates as if they are on
non-log scale when they are indeed on the log-scale, and vice versa.
Getting the scale factor correct is not that important and any
downstream analysis method should not rely on the scale being correct.
 If they do, I claim the make to much of assumptions.  It is very very
very very very rare that you can trust the scale/absolute values of
the CN estimates - what you should be able to trust is their relative
ordering.  Another way to put this, whenever you plot CNs along the
genome, you don't know more or less about what is going on in the
genome if you drop the numbers of the y-axis.

>
> In my work, I got the LRR data exported from GenomeStudio for Illumina SNP
> arrays. Since Illumina's LRR is calculated as log2(R_observed/R_expected),
> when I created the AromaUnitTotalCnBinaryFile data without the log2ratio
> tag, I stored them as 2^LRR.

So, that corresponds to C = 1 * theta/thetaR (using "1" not "2" as a
scale factor).

>I noticed that you also had some examples using
> data from the Illumina platforms (e.g.
> http://aroma-project.org/vignettes/tumorboost-highlevel), so is this how you
> would import data into the aroma framework?

So in that vignette ('TumorBoost - Normalization of allelic-specific
copy numbers in tumors with matched normals'), that has already been
imported to the aroma framework; the data sits in *,total.asb and
*,fracB.asb files).  While you pointed me to this vignette, I realized
that you may want to add tag "ratio" to your CN files, i.e.

totalAndFracBData/<dataSet>(,<tags>)*/<chipType>/
  <sampleName>,ratio,total.asb
  <sampleName>,fracB.asb

It is not critical, but very useful to indicate that they hold TCN
*ratios* and not TCN *intensities* (which is actual what we are
working with in the above TumorBoost vignette).

Hope this helps more than confuses you.  The take-home message is that
the definition of "log2 ratios" has become de facto standard in our
field, where as "CN" is a bit ambiguous, and can mean "CN ratio"
(C=theta/thetaR, or C=2*theta/thetaR) as well as "CN intensity"
(theta).  By adding tags we try to make this less ambiguous.

/Henrik

>
> Thank you for sharing your experience.
>
> Best,
> Kai
>
>
> On Friday, January 20, 2012 6:37:25 PM UTC-8, Henrik Bengtsson wrote:
>>
>> > Also in the same vignette, the TotalCnBinary files are named something
>> > like "%s,log2ratio,total.asb". I remember seeing somewhere that the
>> > "log2ratio" tag is used by aroma to tell what type of data the file
>> > contains.
>>
>> You're correct.  An AromaUnitTotalCnBinaryFile data file that has tag
>> "log2ratio" will indicate to aroma (and anyone who browse the file
>> system) that it holds log2 ratios.  Without that tag, aroma will think
>> it holds CN ratios on the intensity scale, i.e. C=2*theta/thetaR.  (it
>> also supports "log10ratio").  I recommend that you don't convert to
>> log2 ratios unless you already have them, because then you might turns
>> some zero (and some small negative CN ratios) into -Inf.  It's better
>> to preserve those as far as possible (and deal with them later, if
>> even needed).  For instance, if you would smooth over multiple CNs, it
>> might be that your smoothed average is positive although some of the
>> individual values are not.
>>
>> > If I also created binary files containing BAF data, do I need to keep
>> > this tag and simply change "total" to "fracB"?
>>
>> BAFs are not on the log scale, so they should not have the "log2ratio"
>> tag.  Other than that, the TCNs and BAFs should be in files with names
>> only differing by the "total" and "fracB" tags.  Thus, if storing TCNs
>> on the intensity scale, you'll have something like:
>>
>> totalAndFracBData/<dataSet>(,<tags>)*/<chipType>/
>>   <sampleName>,total.asb
>>   <sampleName>,fracB.asb
>>
>> If storing log2ratios, you'll have:
>>
>> totalAndFracBData/<dataSet>(,<tags>)*/<chipType>/
>>   <sampleName>,log2ratio,total.asb
>>   <sampleName>,fracB.asb
>>
>> FUTURE PLANS: For simplicity, we will (some day) move to storing TCNs
>> and BAFs in the same file, e.g. <sampleName>,pscn.asb.  The reason why
>> we decided to keep separate TCN and BAF files is that we (mostly me)
>> argued that it would help preserve disk space, for instance when one
>> do downstream normalization of BAFs leaving the TCNs untouched (e.g.
>> TumorBoost), or vice versa (e.g. MSCN).  However, after a few years
>> with PSCN projects, it is clear that it is much more convenient if we
>> had TCNs and BAFs in the same data file.
>>
>> Hope this helps
>>
>> /Henrik
>>
>> >
>> > Thank you very much.
>> >
>> > Best,
>> > Kai
>> >
>> > --
>> > When reporting problems on aroma.affymetrix, make sure 1) to run the
>> > latest version of the package, 2) to report the output of sessionInfo() and
>> > traceback(), and 3) to post a complete code example.
>> >
>> >
>> > You received this message because you are subscribed to the Google
>> > Groups "aroma.affymetrix" group with website http://www.aroma-project.org/.
>> > To post to this group, send email to aroma-af...@googlegroups.com
>>
>> > To unsubscribe and other options, go to
>> > http://www.aroma-project.org/forum/
>
> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest
> version of the package, 2) to report the output of sessionInfo() and
> traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups
> "aroma.affymetrix" group with website http://www.aroma-project.org/.
> To post to this group, send email to aroma-affymetrix@googlegroups.com
> To unsubscribe and other options, go to http://www.aroma-project.org/forum/

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.

You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

Re: [aroma.affymetrix] Create binary data files containing BAF data

Reply via email to