Re: [aroma.affymetrix] Identify outlier density curves by cel file ID, on density plots?

Henrik Bengtsson Sat, 26 Nov 2011 14:29:18 -0800

Hi.

On Fri, Nov 25, 2011 at 10:07 AM, evansa <drannaev...@gmail.com> wrote:
> Hi,
>
> Sorry if this is dumb, I am new to microarray analysis, and
> aroma.affymetrix, and am having difficulties figuring out how data is
> structured and how to access it.


If you have specific questions/use cases on accessing data, I'll try
to explain how to do it.  Some of the 'how to' pages contains useful
information how to retrieve array signals etc:

http://www.aroma-project.org/howtos/

>
> I am working with 156 HG-U133a cel files and a csv file of patient
> characteristics and outcomes (and a similar number of HG-U133 Plus2
> chips).
>
> Down the line I am hoping to be able to perform semi-supervised
> expression data analysis, along the lines of the methods of Bair and
> Tibshirani (2004, PLoS Biology p. 0511-0522), but for the moment am
> attempting to examine my cel files, perform some QC and normalize my
> datasets.
>
> [I managed to create CDF and SAF files, and also  ACS and UPG files
> (though I'm not sure if I have done this correctly). So far, I have
> failed to create UFL files, but I don't think that I actually need
> them, for my purposes?]

Affymetrix provides CDF files for the HG-U133A and the HG-U133_Plus_2
chip types so you should not have to create one yourself, unless you
wish to use special custom genome annotations.

An ACS file holds the cell (probe) sequences of a chip type.  That is
needed if you wish to do probe-sequence normalization, e.g.
BasePositionNormalization.

An UGP (not UPG) file contains genome location information for each
unit, which is only used for SNP/CN DNA analysis.  I doubt you want to
use that for your expression analysis.

The UFL file contains PCR fragment-length information for each unit,
i.e. how long each DNA fragment is that a unit ("gene") sits on after
been digested by the restriction enzymes.  This is used in the
FragmentLengthNormalization methods which was developed specifically
for DNA analysis.

It is not clear what your preprocessing pipeline is, but if you're
only doing plain RMA, you need neither of the ACS, UGP or UFL files.
The CDF alone (in addition to your SAF) should be enough.

>
> I have just figured out how to plot raw log2 probe signal density
> plots colour coded by sample sub-type (by adapting code from
> http://www.aroma-project.org/node/141 and /node/16), defined by SAF
> file attribute values, but I would like to be able to identify
> outlying probe signal density curves by CEL file ID on the plot.
>
> Is there a way to do this, similar to R's identify function,
> perhaps?

there exist no special solution for this "problem" - it is the same
issue have with plain R density plots, e.g.

x <- matrix(rnorm(1000), ncol=50);
plot(density(x[,1]), xlim=c(-5,5), ylim=c(0,0.8), main="", xlab="");
for (cc in 2:ncol(x)) lines(density(x[,cc]));

Just as identify() won't do anything for you here, it won't in aroma.

What you can do is to narrow down arrays of interest using different
color ('col') and line type ('lty') arguments as well as plotting only
a subset of arrays.  To do the latter, just extract a subset of the
arrays and pass that to plotDensity(), e.g.

csRs <- extract(csR, 51:60);
plotDensity(csRs);

Ideally one want a pipeline that "identifes" (read calls) arrays that
are outliers (have densities that stands out from the rest)
automatically and highlights them in the density plots as well as
lists their names.  This requires a density-curve outlier calling
algorithms - I am not aware of such a method.

/Henrik

>
> Thanks in advance for any assistance.
>
> Kind regards
>
> Anna Evans
>
>
>
>
>
>
>> sessionInfo()
> R version 2.13.2 (2011-09-30)
> Platform: i386-pc-mingw32/i386 (32-bit)
>
> locale:
> [1] LC_COLLATE=English_United Kingdom.1252  LC_CTYPE=English_United
> Kingdom.1252
> [3] LC_MONETARY=English_United Kingdom.1252
> LC_NUMERIC=C
> [5] LC_TIME=English_United Kingdom.1252
>
> attached base packages:
> [1] stats     graphics  grDevices datasets  utils     methods
> base
>
> other attached packages:
>  [1] hgu133aprobe_2.8.0     AnnotationDbi_1.14.1
> hgu133acdf_2.8.0       Biobase_2.12.2
>  [5] aroma.affymetrix_2.3.0 affxparser_1.24.0
> aroma.apd_0.2.0        R.huge_0.3.0
>  [9] aroma.core_2.3.2       aroma.light_1.22.0
> matrixStats_0.4.0      R.rsp_0.6.10
> [13] R.cache_0.5.2          R.filesets_1.1.3
> digest_0.5.1           R.utils_1.9.4
> [17] R.oo_1.8.3             affy_1.30.0
> R.methodsS3_1.2.1      rcom_2.2-3.1
> [21] rscproxy_1.3-1
>
> loaded via a namespace (and not attached):
> [1] affyio_1.20.0         DBI_0.2-5             preprocessCore_1.14.0
> RColorBrewer_1.0-5
> [5] RSQLite_0.10.0        tools_2.13.2
>
>
> #================================================
> # Script used
> #================================================
> # load packages and set working directory
> #================================================
> library(aroma.affymetrix) ;
> setwd("D:/AE/AE_Haematology/Microarray/AE_microarray/cel_files/
> 133a") ;
> #================================================
> # Create CDF
> #================================================
> # env2Cdf("hgu133acdf", "C20.CEL", overwrite=TRUE); # works, but use
> method below
> #================================================
> # Install annotation packages from bioconductor
> #================================================
> # 
> http://www.bioconductor.org/packages/release/data/annotation/html/hgu133a.db.html
> source("http://www.bioconductor.org/biocLite.R";)  ;
> biocLite("hgu133a.db")  ;
>
> #================================================
> # Create CDF file from an R package
> # From: http://www.aroma-project.org/node/41
> #
> # create hgu122a.cdf in /cel_files/133a
> # then move manually to /cel_files/133a/annotationData/chipTypes/
> hgu133a
> #================================================
> library("hgu133a.db")  ;
>
> # creating a cdf file
> # created in 133a folder
> env2Cdf("hgu133acdf","C20.CEL",overwrite=TRUE)  ;
>
> #================================================
> # Create ACS file
> # From: http://www.aroma-project.org/node/100
> #================================================
> # downloaded U133Agb_probe_tab.gz from nih.gov
> # at: masker.nci.nih.gov/ev/
> # used Rtools gzip from C:/Rtools/bin on FredRM
> #
> # C:\Rtools\bin>gzip -d "C:\Users\wpcae\Desktop\U133Agb_probe_tab.gz"
> # moved file manually to:
> # D:\AE\AE_Haematology\Microarray\AE_microarray\cel_files\133a
> \annotationData\chipTypes\hgu133a
> # renamed file: hgu133a.probe_tab
> #
> # In R, with library(aroma.affymetrix) loaded, created and ACS file
> [aroma cell sequence file]
>
> ptb<-AffymetrixProbeTabFile$byChipType(chipType) ;
>
> # Allocate ACS of same chip type and probe dimensions as CDF
> cdf<-AffymetrixCdfFile$byChipType(chipType)  ;
> acs<-AromaCellSequenceFile$allocateFromCdf(cdf,tags="")  ;
>
> # import from the probe-tab file (which contains only PMs) . This will
> also
> # update the file footer.
> # AE: This generates warnings
> #================================================
> # Warning messages:
> # 1: In readDataFrame.TabularTextFile(srcFile, colClassPatterns =
> colClassPatterns,  ... :
> #  Argument 'rows' was out of range [1,199641]. Ignored rows beyond
> this range.
> # 2: In readDataFrame.TabularTextFile(srcFile, colClassPatterns =
> colClassPatterns,  ... :
> #  Argument 'rows' was out of range [1,199641]. Ignored rows beyond
> this range.
> #================================================
>
> importFrom(acs, ptb, verbose=-10) ;
>
> # For chips with MMs, infer them from the PM sequences (which will
> give an error
> # if no MMs)
> inferMmFromPm(acs, cdf=cdf, verbose=-10)  ;
>
> #================================================
> # create empty UGP file (re: http://aroma-project.org/node/43)
> #================================================
> ugp<-AromaUgpFile$allocateFromCdf(cdf,tags="")  ;
> print(ugp)  ;
>
> # Failed Attempt to create a UGP using a NetAffx file downloaded from
> # http://www.affymetrix.com/support/technical/byproduct.affx?product=hgu133
> # HG-U133A.na32.annot.csv , which I renamed as hgu133a,na32.csv
> #
> # I don't think that the file is in the right format (?)
> #
> # chipType<-"hgu133a"
> # cdf <- AffymetrixCdfFile$byChipType("hgu133a");
> # # Creates an empty UGP file for the CDF, if missing.
> # ugp <- AromaUgpFile$allocateFromCdf(cdf, tags=c("na32"));
> #
> # # Import NetAffx unit position data
> # csv <- AffymetrixNetAffxCsvFile$byChipType(chipType, tags="na32");
> # units <- importFrom(ugp, csv);
> # str(units)
>
> #================================================
> # Access CEL files in a non-standard directory using links
> #================================================
> target <- "D:/AE/AE_Haematology/Microarray/AE_microarray/cel_files/
> 133a/" ;
> path <- filePath("rawData", basename(target), "hgu133a") ;
> createLink(path, target=target) ;
>
> #================================================
> # Get cdf for chip type HG-U133a
> #================================================
> chipType="hgu133a"  ;
> cdf <- AffymetrixCdfFile$byChipType(chipType) ;
>
> #================================================
> # Get raw cel file data
> #================================================
> dataSet <- "133a" ;
> csR <- AffymetrixCelSet$byName(dataset, cdf=cdf)  ;
> setFullName(csR,dataSet)  ;
>
> #================================================
> # Get annotation files for specified dataset of chipType
> #================================================
> cdf<-AffymetrixCdfFile$byChipType(chipType) ;
> print(cdf) ;
>
> #================================================
> # check data loaded properly
> #================================================
> gi<-getGenomeInformation(cdf) ;
> print(gi) ;
>
> cs <- AffymetrixCelSet$byName(dataSet, cdf=cdf) ;
> print(cs) ;
>
>
> #================================================
> # get index of cell file (cs) using my hg133a C7.CEL
> #================================================
> cf<-getFile(cs,indexOf(cs,"C7"))
>
> #================================================
> # extract attributes, from SAF file, of the cell file to a list
> #================================================
> attrs<-getAttributes(cf)
>
> #================================================
> #================================================
> # Density plots - Here, colour coded by MRCI Group
> #================================================
> #================================================
> # Get mrcigrp attributes for cell files
> # Convert character to numeric entries (as [mi] statement does not
> work with chars)
> # assign colour by mrcigrp to a numeric vector
> #================================================
> mi<-sapply(csR,getAttribute,"mrcigrp")
> mi<-as.numeric(mi)
> micols<-c("blue","red","green","orange")[mi]
>
> #============================================
> # replace NA values with orange
> # To display density curves
> # for cel files with attribute values of ""
> #============================================
> # micols[is.na(micols)]<-"orange"
>
> #============================================
> # To exclude produce single mrcigrp group plots
> # probably a better way to do this
> #============================================
> # replace all colours with NA except for xx
> # make sure you leave at lease one of the below
> # un-commented out
>
> #mi<-sapply(csR,getAttribute,"mrcigrp")
> #mi<-as.numeric(mi)
> #micols<-c("blue","red","green","orange")[mi]
> #micols[is.na(micols)]<-"orange"
> #
> #micols[micols=="orange"]<-"NA" ;
> #micols[micols=="red"]<-"NA" ;
> #micols[micols=="blue"]<-"NA" ;
> #micols[micols=="green"]<-"NA" ;
>
> #================================================
> # Density plots - color by mrc index grouping
> #================================================
> if(interactive()) savehistory();
>
> # use a nicer palette of colours
> colors<-RColorBrewer::brewer.pal(12,"Paired");
> palette(colors);
>
> # defines the scale of the plot, but not the limits
> width <- 7;
>
> # not doing multi-probe type plots
> #types <- c("all", "pmmm", "pm", "mm");
> #cells <- lapply(types, FUN=function(type) identifyCells(cdf,
> types=type));
> #names(cells) <- types;
> #str(cells);
>
> # plot for one probe type at a time
> types <- c("pm");
>
> height <- 0.618*width;
>
> cols <- seq(along=types);
> # 1, or 1, 2 etc
> print(cols)
>
> fig <- "plotDensity,raw";
> if (!devIsOpen(fig)) {
>  devNew(width=width, height=height, label=fig);
>    for (kk in seq(along=types)) {
>      plotDensity(csR, types="pm", col=micols, subset=NULL,
>      lwd=2, ylim=c(0,1.0), add=(kk > 1));
>  }
>
>  # legend not correct
>  #legend("topleft", col=cols, lwd=4, micols);
>  title("Raw probe signals");
>  devDone();
> }
>
>
>
>
> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
> version of the package, 2) to report the output of sessionInfo() and 
> traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups 
> "aroma.affymetrix" group with website http://www.aroma-project.org/.
> To post to this group, send email to aroma-affymetrix@googlegroups.com
> To unsubscribe and other options, go to http://www.aroma-project.org/forum/
>

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

Re: [aroma.affymetrix] Identify outlier density curves by cel file ID, on density plots?

Reply via email to