Hi. On Mon, Jan 24, 2011 at 2:58 PM, Taku Tokuyasu <tok...@gmail.com> wrote: > I would like to know if there are any recommendations on re-running > 'aroma' on a subset of samples. In my case, the first pass through a > dataset (Affy Mouse expression) revealed an outlier array. I would > now like to re-run with the outlier array removed.
I guess you use this as an example only. Dropping a single array shouldn't make a big difference on the other arrays, because the analysis steps used are all robust against outliers (as long as you have enough samples in your data set, say, more than 10). However, if you want compare the results from using all samples with or without that outlier, below is how to do it. > > 1) Is extract() the recommended way to do sample subsetting: > cs <- extract(cs, 2:length(cs)) ## first array is the outlier > Using brackets [] produced a list, so I presume that does not work. Yes, use extract() to subset a data set. (Correct, brackets are not doing the same thing are neither part of the official API, simply because we haven't decided on what they should do). > > 2) An aroma re-run quickly returns at this point, because the output > files already exist. It appears necessary to remove the output files > first. Correct, if there already exists previous (intermediate and/or final) results that have the same data set full name (name plus tags), the aroma framework assumes the content is correct. Alt 1: The easiest way to force the rerun is to simply delete those intermediate results, which typically can be found in subdirectories of the following "root" directories: probeData/ and plmData/. Other directories may also be created, depending on the analysis you do. Alt 2: An alternative is to add a new tag to the *first* step of the analysis where your want to drop some samples. For instance, in your case it is sufficient to do it in the quantile normalization step, because the RMA-style background correction is a truly single-array method. So, you can do qn <- QuantileNormalization(csBC, typesToUpdate="pm", tags="*,v2"). That will append your custom tag "v2" to the default ones (hence "*"). Since any downstream steps will include tags from previous steps, this will also make sure new intermediate and final results will be done. More comments below: > The following code appears apropos (pulled from > http://www.agron-omics.eu/uploads/Tiling%20array%20files/agronomicsTools01.r): Just FYI, that script contains lots of "tricks" for aroma.*, R.cache, R.utils etc, some of which I do not recommend others to use. There is also some code in there that the author of that script may want to fix (if they're listening on this channel). >>>> CODE > force <- TRUE > bc <- RmaBackgroundCorrection(celSet); > if (force & !is.null(getOutputFiles(bc))){ > file.remove(getOutputFiles(bc)) > } The easiest way to delete a data set from within R, is to do: removeDirectory(getPath(bc), mustExist=FALSE); > csBC <- process(bc, verbose=verbose, force=force, overwrite=force); > qn <- QuantileNormalization(csBC, typesToUpdate="pm"); > if (force) file.remove(getTargetDistributionPathname(qn)) I don't not encourage this. First, the getTargetDistributionPathname() is not part of the public API. Second, it is even unnecessary to delete the so called target distribution file (here "getTargetDistributionPathname(qn)"), because it is calculated as the robust average of all arrays in the 'csBC' data set and its filename is generated using checksums such that the filename will be unique for any set of data files. > clearCache(qn) Again, an internal method is used and shouldn't be needed; I'm not sure why it is used here. > if (force & !is.null(getOutputFiles(qn))){ > file.remove(getOutputFiles(qn)) > } > csN <- process(qn, verbose=verbose, force=force); > plm <- RmaPlm(csN); > fit(plm, unit=NULL, verbose=verbose, force=force) > <<< END OF CODE > > I feel having the output files consistent (i.e. not mixed from > different runs) is a good idea. Yes. > The flip side is, analyzing subsets > of samples in parallel (e.g. before a strict decision on outliers has > been made) is probably best handled by treating each one as a separate > dataset, starting from the original CEL files. Or better, only at the first step that is a multi-array method, as suggested above. Thus, it is useful to understand how the models/methods/algorithms work so one can tell which are truly single-array methods and which are multi-array methods. (Yes, I've been considering to annotate the methods/classes to contain this information. The problem is that some methods can be both depending on which parameters are used. It is also a priority thing). Hope this helps Henrik > > Regards, > > _Taku > > -- > When reporting problems on aroma.affymetrix, make sure 1) to run the latest > version of the package, 2) to report the output of sessionInfo() and > traceback(), and 3) to post a complete code example. > > > You received this message because you are subscribed to the Google Groups > "aroma.affymetrix" group with website http://www.aroma-project.org/. > To post to this group, send email to aroma-affymetrix@googlegroups.com > To unsubscribe and other options, go to http://www.aroma-project.org/forum/ > -- When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group with website http://www.aroma-project.org/. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe and other options, go to http://www.aroma-project.org/forum/