Re: [aroma.affymetrix] Recommendations on re-running on a subset of samples

Henrik Bengtsson Mon, 24 Jan 2011 19:01:46 -0800

Hi.

On Mon, Jan 24, 2011 at 2:58 PM, Taku Tokuyasu <tok...@gmail.com> wrote:
> I would like to know if there are any recommendations on re-running
> 'aroma' on a subset of samples.  In my case, the first pass through a
> dataset (Affy Mouse expression) revealed an outlier array.  I would
> now like to re-run with the outlier array removed.


I guess you use this as an example only.  Dropping a single array
shouldn't make a big difference on the other arrays, because the
analysis steps used are all robust against outliers (as long as you
have enough samples in your data set, say, more than 10).  However, if
you want compare the results from using all samples with or without
that outlier, below is how to do it.

>
> 1) Is extract() the recommended way to do sample subsetting:
>  cs <- extract(cs, 2:length(cs))  ## first array is the outlier
> Using brackets [] produced a list, so I presume that does not work.

Yes, use extract() to subset a data set.  (Correct, brackets are not
doing the same thing are neither part of the official API, simply
because we haven't decided on what they should do).

>
> 2) An aroma re-run quickly returns at this point, because the output
> files already exist.  It appears necessary to remove the output files
> first.

Correct, if there already exists previous (intermediate and/or final)
results that have the same data set full name (name plus tags), the
aroma framework assumes the content is correct.

Alt 1: The easiest way to force the rerun is to simply delete those
intermediate results, which typically can be found in subdirectories
of the following "root" directories: probeData/ and plmData/.  Other
directories may also be created, depending on the analysis you do.

Alt 2: An alternative is to add a new tag to the *first* step of the
analysis where your want to drop some samples.  For instance, in your
case it is sufficient to do it in the quantile normalization step,
because the RMA-style background correction is a truly single-array
method.  So, you can do qn <- QuantileNormalization(csBC,
typesToUpdate="pm", tags="*,v2").  That will append your custom tag
"v2" to the default ones (hence "*").  Since any downstream steps will
include tags from previous steps, this will also make sure new
intermediate and final results will be done.

More comments below:

> The following code appears apropos (pulled from
> http://www.agron-omics.eu/uploads/Tiling%20array%20files/agronomicsTools01.r):

Just FYI, that script contains lots of "tricks" for aroma.*, R.cache,
R.utils etc, some of which I do not recommend others to use.  There is
also some code in there that the author of that script may want to fix
(if they're listening on this channel).

>>>>  CODE
> force <- TRUE
> bc <- RmaBackgroundCorrection(celSet);
> if (force & !is.null(getOutputFiles(bc))){
>    file.remove(getOutputFiles(bc))
> }

The easiest way to delete a data set from within R, is to do:

removeDirectory(getPath(bc), mustExist=FALSE);


> csBC <- process(bc, verbose=verbose, force=force, overwrite=force);
> qn <- QuantileNormalization(csBC, typesToUpdate="pm");
> if (force) file.remove(getTargetDistributionPathname(qn))

I don't not encourage this.  First, the
getTargetDistributionPathname() is not part of the public API.
Second, it is even unnecessary to delete the so called target
distribution file (here "getTargetDistributionPathname(qn)"), because
it is calculated as the robust average of all arrays in the 'csBC'
data set and its filename is generated using checksums such that the
filename will be unique for any set of data files.

> clearCache(qn)

Again, an internal method is used and shouldn't be needed; I'm not
sure why it is used here.

> if (force & !is.null(getOutputFiles(qn))){
>    file.remove(getOutputFiles(qn))
> }
> csN <- process(qn, verbose=verbose, force=force);
> plm <- RmaPlm(csN);
> fit(plm, unit=NULL, verbose=verbose, force=force)
> <<<   END OF CODE
>
> I feel having the output files consistent (i.e. not mixed from
> different runs) is a good idea.

Yes.

> The flip side is, analyzing subsets
> of samples in parallel (e.g. before a strict decision on outliers has
> been made) is probably best handled by treating each one as a separate
> dataset, starting from the original CEL files.

Or better, only at the first step that is a multi-array method, as
suggested above.

Thus, it is useful to understand how the models/methods/algorithms
work so one can tell which are truly single-array methods and which
are multi-array methods.  (Yes, I've been considering to annotate the
methods/classes to contain this information.  The problem is that some
methods can be both depending on which parameters are used.  It is
also a priority thing).

Hope this helps

Henrik

>
> Regards,
>
> _Taku
>
> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest
> version of the package, 2) to report the output of sessionInfo() and
> traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups
> "aroma.affymetrix" group with website http://www.aroma-project.org/.
> To post to this group, send email to aroma-affymetrix@googlegroups.com
> To unsubscribe and other options, go to http://www.aroma-project.org/forum/
>

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

Re: [aroma.affymetrix] Recommendations on re-running on a subset of samples

Reply via email to