I am attempting to replicate what Cluster 3.0 and Treeview (both by Mike
Eisen) to cluster both microarray genes and arrays does using R with
hclust. I basically utilized the plot.mat function in sma library with
some layout() and hclust().

1. Can I know if some has already written such a function or available
in some package.

2. If not, would appreciate if someone could take the time to test the
function as I only had time to test on a few datasets. Any feed back is
much appreciated. The function is called ecluster.fn and I have also
attached a simple example on how to use it. 

So far my code works (attached) fine. The only problem is that subtree
are ordered in terms of the 'tightest' leftmost (according to the
details section of hclust) . So sometimes I have blocks of red, green
and then red (again) although the branches can be mannually reorder ed
so that most of the reds are to one side and most of the reds. This gets
a bit more complicated when rows are also to be ordered. I don't time to
manually reorder every dataset and want to automate the process.

3. How do we create a criterion to arrange the subtrees in such a way to
show most visual distinction. Sorry if this is a bit vague.

Thank you very much.

Regards, Adai.

##############
## Introduction##
##############

# clab.col and rlab.col are colours to identify the arrays or genes in the dataset. 
This can be used for checking # the correlation between some suspected factors (such 
as ER or cell-cycle regulated genes etc) against the # #  clustered dendrogram.

# 'key.txt' are the key label according to the increasing values of clab.col. If 
clab.col={1,2} representing ER # #   negative and ER positive, then perhaps 
key.txt=c("ER negative", "ER positive").

###################################################################################

library(sma); library(amap)
source(file="ecluster.fn.R")

# An artificial example.

fake.data <- matrix(rnorm(4000), ncol=40)      # 100 genes and 40 arrays/tumors
ER <- sample( c( rep(-1,25), rep(1, 15))) # suppose 25 of the tumors are ER negative 
and 15 ER positive
fake.data[,which(ER==1)] <- fake.data[,which(ER==1)] + runif(150,-0.5, 3)

colnames(fake.data)[which(ER==1)]  <- LETTERS[1:15] # ER positive have capital letters
colnames(fake.data)[which(ER==-1)] <- letters[1:25]
rownames(fake.data) <- paste("Gene", 1:nrow(fake.data), sep="")
ER.col <- ifelse(ER==-1, 1, 2)                 # So that ER positive labels are red
ER.txt <- c("ER negative", "ER positive")

# Usually the data is row z-scored before inputting in Cluster 3.0 and but assume our 
fake.data is already z-scored #

ecluster.fn(data=fake.data, "pearson", "complete", key.txt=ER.txt, clab.col=ER.col)
ecluster.fn(data=fake.data, "pearson", "average", key.txt=ER.txt, clab.col=ER.col) 

# You might have noticed that sometimes the cluster need to be rearranged

#########################
## Identified problems so far##
#########################

# Problems with too many row labels being squeezed when large number of genes (> 100) 
are involved
# Need inclusion of title which can be used for the plot.mat
#  Maybe need some rearranging within array dendrograms to make the results more 
correlated with visual #  #   output.
ecluster.fn <- function(data, distance.method, linkage.method, key.txt=NULL,
                        clab=NULL, rlab=NULL, clab.col=NULL, rlab.col=NULL, 
gamma=0.75){

  array.hclust <- hclust( dist(t(data),distance.method), linkage.method) # cluster the 
arrays/columns
  genes.hclust <- hclust( dist(data, distance.method), linkage.method)   # cluster the 
genes/rows
  genes.dend <- as.dendrogram(genes.hclust)

  data.re <- data[ rev(genes.hclust$order), array.hclust$order ] # reorder the data
  def.par <- par()
  nf <- layout(matrix(c(4,3,2,1),2,2,byrow=TRUE), c(1,3), c(1,3), TRUE); 
layout.show(nf)

  if(is.null(rlab))     rlab <- dimnames(data.re)[[1]] else rlab <- 
rlab[rev(genes.hclust$order)]
  if(is.null(clab))     clab <- dimnames(data.re)[[2]] else clab <- 
clab[array.hclust$order]
  if(is.null(rlab.col)) rlab.col <- 1 else rlab.col <- 
rlab.col[rev(genes.hclust$order)]
  if(is.null(clab.col)) clab.col <- 1 else clab.col <- clab.col[array.hclust$order]
 
  par(mar=c(1,1,1,1))
  plot.mat(data.re, nrgcol=100, rlabels=rlab, clabels=clab, rcols=rlab.col, ccols= 
clab.col, gamma=gamma)

  par(mar=c(0,3,0.25,2))
  plot(genes.dend, horiz=T, ann=F, axes=F)

  par(mar=c(1,0.25,1,0.25))
  plclust(array.hclust, axes=F, ann=F, label=F, hang=-1)

  par(mar=c(0,0.25,1,0.25))
  plot.new()
  text(0.25, 0, paste(dim(data.re)[1], "genes used."), cex=0.8, adj=c(0,0)) # Number 
of genes used
  if(is.null(key.txt)==F){
    for(i in 1:length(key.txt)){
      text(0.25, 0.125*i, key.txt[i], col=i, cex=0.8, adj=c(0,0))}}
}
 

Reply via email to