I am attempting to replicate what Cluster 3.0 and Treeview (both by Mike Eisen) to cluster both microarray genes and arrays does using R with hclust. I basically utilized the plot.mat function in sma library with some layout() and hclust().
1. Can I know if some has already written such a function or available in some package. 2. If not, would appreciate if someone could take the time to test the function as I only had time to test on a few datasets. Any feed back is much appreciated. The function is called ecluster.fn and I have also attached a simple example on how to use it. So far my code works (attached) fine. The only problem is that subtree are ordered in terms of the 'tightest' leftmost (according to the details section of hclust) . So sometimes I have blocks of red, green and then red (again) although the branches can be mannually reorder ed so that most of the reds are to one side and most of the reds. This gets a bit more complicated when rows are also to be ordered. I don't time to manually reorder every dataset and want to automate the process. 3. How do we create a criterion to arrange the subtrees in such a way to show most visual distinction. Sorry if this is a bit vague. Thank you very much. Regards, Adai.
############## ## Introduction## ############## # clab.col and rlab.col are colours to identify the arrays or genes in the dataset. This can be used for checking # the correlation between some suspected factors (such as ER or cell-cycle regulated genes etc) against the # # clustered dendrogram. # 'key.txt' are the key label according to the increasing values of clab.col. If clab.col={1,2} representing ER # # negative and ER positive, then perhaps key.txt=c("ER negative", "ER positive"). ################################################################################### library(sma); library(amap) source(file="ecluster.fn.R") # An artificial example. fake.data <- matrix(rnorm(4000), ncol=40) # 100 genes and 40 arrays/tumors ER <- sample( c( rep(-1,25), rep(1, 15))) # suppose 25 of the tumors are ER negative and 15 ER positive fake.data[,which(ER==1)] <- fake.data[,which(ER==1)] + runif(150,-0.5, 3) colnames(fake.data)[which(ER==1)] <- LETTERS[1:15] # ER positive have capital letters colnames(fake.data)[which(ER==-1)] <- letters[1:25] rownames(fake.data) <- paste("Gene", 1:nrow(fake.data), sep="") ER.col <- ifelse(ER==-1, 1, 2) # So that ER positive labels are red ER.txt <- c("ER negative", "ER positive") # Usually the data is row z-scored before inputting in Cluster 3.0 and but assume our fake.data is already z-scored # ecluster.fn(data=fake.data, "pearson", "complete", key.txt=ER.txt, clab.col=ER.col) ecluster.fn(data=fake.data, "pearson", "average", key.txt=ER.txt, clab.col=ER.col) # You might have noticed that sometimes the cluster need to be rearranged ######################### ## Identified problems so far## ######################### # Problems with too many row labels being squeezed when large number of genes (> 100) are involved # Need inclusion of title which can be used for the plot.mat # Maybe need some rearranging within array dendrograms to make the results more correlated with visual # # output.
ecluster.fn <- function(data, distance.method, linkage.method, key.txt=NULL, clab=NULL, rlab=NULL, clab.col=NULL, rlab.col=NULL, gamma=0.75){ array.hclust <- hclust( dist(t(data),distance.method), linkage.method) # cluster the arrays/columns genes.hclust <- hclust( dist(data, distance.method), linkage.method) # cluster the genes/rows genes.dend <- as.dendrogram(genes.hclust) data.re <- data[ rev(genes.hclust$order), array.hclust$order ] # reorder the data def.par <- par() nf <- layout(matrix(c(4,3,2,1),2,2,byrow=TRUE), c(1,3), c(1,3), TRUE); layout.show(nf) if(is.null(rlab)) rlab <- dimnames(data.re)[[1]] else rlab <- rlab[rev(genes.hclust$order)] if(is.null(clab)) clab <- dimnames(data.re)[[2]] else clab <- clab[array.hclust$order] if(is.null(rlab.col)) rlab.col <- 1 else rlab.col <- rlab.col[rev(genes.hclust$order)] if(is.null(clab.col)) clab.col <- 1 else clab.col <- clab.col[array.hclust$order] par(mar=c(1,1,1,1)) plot.mat(data.re, nrgcol=100, rlabels=rlab, clabels=clab, rcols=rlab.col, ccols= clab.col, gamma=gamma) par(mar=c(0,3,0.25,2)) plot(genes.dend, horiz=T, ann=F, axes=F) par(mar=c(1,0.25,1,0.25)) plclust(array.hclust, axes=F, ann=F, label=F, hang=-1) par(mar=c(0,0.25,1,0.25)) plot.new() text(0.25, 0, paste(dim(data.re)[1], "genes used."), cex=0.8, adj=c(0,0)) # Number of genes used if(is.null(key.txt)==F){ for(i in 1:length(key.txt)){ text(0.25, 0.125*i, key.txt[i], col=i, cex=0.8, adj=c(0,0))}} }