Re: [Bioc-devel] ExperimentHub::GSE62944 outdated

2016-06-04 Thread Marini, Federico
Hi Valerie,

This is true. I also did the same thing for the normal samples, this as well 
already as SummarizedExperiment.

I can check in my scripts once I am back in the office, if you want to use it 
as a starter.

Cheers,
Federico


From: Obenchain, Valerie <valerie.obench...@roswellpark.org>
Sent: Friday, June 3, 2016 5:28 PM
To: Ludwig Geistlinger; bioc-devel@r-project.org
Cc: Marini, Federico; Sonali Arora
Subject: Re: [Bioc-devel] ExperimentHub::GSE62944 outdated

Hi Ludwig and Federico,

Yes, we plan to update these data in the next couple of weeks.

Sonali mentioned that the current data only include the tumor samples
and she'd like to add the normals. The new data will likely be added as
SummarizedExperiment objects instead of ExpressionSets.

Valerie


On 06/03/2016 04:57 AM, Ludwig Geistlinger wrote:
> FYI
>
> That works for me, but maybe this is also of interest for others, so I
> wonder if somebody of the Bioc annotation/experiment team (Sonali,
> Valerie, Martin?) could update this accordingly for ExperimentHub?
>
> Best,
> Ludwig
>



This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Bioc-devel] ExperimentHub::GSE62944 outdated

2016-06-03 Thread Ludwig Geistlinger
FYI

That works for me, but maybe this is also of interest for others, so I
wonder if somebody of the Bioc annotation/experiment team (Sonali,
Valerie, Martin?) could update this accordingly for ExperimentHub?

Best,
Ludwig

-- 
Dr. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik
Institut für Informatik
Ludwig-Maximilians-Universität München
Amalienstrasse 17, 2. Stock, Büro A201
80333 München

Tel.: 089-2180-4067
eMail: ludwig.geistlin...@bio.ifi.lmu.de

> Hi Ludwig,
>
> In november I sent the updated recipe to Martin, but I think it was not
> updated yet.
>
> Anyway, you can do it yourself with the code here below:
>
> library("GEOquery")
> library("Biobase")
>
> suppl <- GEOquery::getGEOSuppFiles("GSE62944")
>
> setwd("GSE62944")
>
> clinvar <-
> 
> read.delim("GSE62944_06_01_15_TCGA_24_548_Clinical_Variables_9264_Samples.txt.gz")
> clinvar2 <- t(clinvar)
>
> # add variable names
> colnames(clinvar2) <- clinvar2[1,]
> # and remove the 2nd abbreviation, with the CDE_ID too
> clinvar3 <- clinvar2[-c(1:3),]
>
> # substitute dots with dashes in the ids, to be consistent with
> previous object
> clinvar4 <- clinvar3
> rownames(clinvar4) <- gsub("\\.","-",rownames(clinvar3))
> clinvar4 <- as.data.frame(clinvar4)
>
> CancerType <-
> read.delim("GSE62944_06_01_15_TCGA_24_CancerType_Samples.txt.gz",
>   header=FALSE, colClasses=c("character", "factor"),
>   col.names=c("sample", "type"))
> idx <- match(rownames(clinvar4), CancerType$sample)
> # these are already nicely sorted
> clinvar4$CancerType <- CancerType$type[idx]
>
>
> countFile <-
> "GSM1536837_06_01_15_TCGA_24.tumor_Rsubread_FeatureCounts.txt.gz"
> untar("GSE62944_RAW.tar", countFile)
>
> counts <- local({
>data <- scan(countFile, what=character(), sep="\t", quote="")
>m <- matrix(data, 9265)
>dimnames(m) <- list(m[,1], m[1,])
>m <- t(m[-1, -1])
>mode(m) <- "integer"
>m
> })
>
> # just to be sure
> gplots::venn(list(colnames(counts),rownames(cl4))) # they are all
> there, but not correctly sorted
> head(colnames(counts))
> head(rownames(clinvar4))
>
> # re-sorting according to the counts object
> cl5 <-
> clinvar4[rownames(clinvar4)[match(colnames(counts),rownames(clinvar4))],]
> head(rownames(cl5),20)
> head(colnames(counts),20)
>
> # as in your example
> eset_new <- Biobase::ExpressionSet(counts, AnnotatedDataFrame(cl5))
>
> # or as SummarizedExperiment
> library("GenomicRanges")
> se <- SummarizedExperiment(assays=list(counts))
> colData(se) <- S4Vectors::DataFrame(cl5)
>
> # data exploration to see how samples are related to each other
> library("DESeq2")
> ddsTCGA <- DESeqDataSet(se,design=~CancerType)
>
> ddsTCGA <- estimateSizeFactors(ddsTCGA)
> log2tcga <- log2(1+counts(ddsTCGA,normalized=TRUE))
> se_log2tcga <- SummarizedExperiment(assays=list(log2tcga))
> colData(se_log2tcga) <- colData(ddsTCGA) # the rlog transform takes
> very long time, so just a quick and dirty check
>
> pca_d4 <- function (x, intgroup = "condition", ntop = 500,
> returnData = FALSE,title=NULL,
>  pcX = 1, pcY = 2,text_labels=TRUE,point_size=3)
> # customized principal components
> {
>library("DESeq2")
>library("genefilter")
>library("ggplot2")
>rv <- rowVars(assay(x))
>select <- order(rv, decreasing =
> TRUE)[seq_len(min(ntop,length(rv)))]
>pca <- prcomp(t(assay(x)[select, ]))
>percentVar <- pca$sdev^2/sum(pca$sdev^2)
>
>intgroup.df <- as.data.frame(colData(x)[, intgroup, drop = FALSE])
>group <- factor(apply(intgroup.df, 1, paste, collapse = " : "))
>d <- data.frame(PC1 = pca$x[, pcX], PC2 = pca$x[, pcY], group =
> group,
>intgroup.df, names = colnames(x))
>colnames(d)[1] <- paste0("PC",pcX)
>colnames(d)[2] <- paste0("PC",pcY)
>if (returnData) {
>  attr(d, "percentVar") <- percentVar[1:2]
>  return(d)
>}
># clever way of positioning the labels
>d$hjust = ifelse((sign(d[,paste0("PC",pcX)])==1),0.9,0.1)# (1 +
> varname.adjust * sign(PC1))/2)
>g <- ggplot(data = d, aes_string(x = paste0("PC",pcX), y =
> paste0("PC",pcY), color = "group")) +
>  geom_point(size = point_size) +
>  xlab(paste0("PC",pcX,": ", round(percentVar[pcX] * 100,digits =
> 2), "% variance")) +
>  ylab(paste0("PC",pcY,": ", round(percentVar[pcY] * 100,digits =
> 2), "% variance"))
>if(text_labels) g <- g + geom_text(mapping =
> aes(label=names,hjust=hjust, vjust=-0.5), show.legend = F)
>if(!is.null(title)) g <- g + ggtitle(title)
>g
> }
>
> pdf("allTCGA_diy.pdf",height=30,width=30)
> 

[Bioc-devel] ExperimentHub::GSE62944 outdated

2016-06-02 Thread Ludwig Geistlinger
Hi,

I would like to do some analysis on the TCGA data as provided in
ExperimentHub's GSE62944 ExpressionSet.

The Description of the dataset reads:

"TCGA re-processed RNA-Seq data from 9264 Tumor Samples and 741 normal
samples across 24 cancer types"

However, when loading the dataset via

> eh <- ExperimentHub()
> query(eh , "GSE62944")
> tcga_data <- eh[["EH1"]]

and counting the samples

> dim(tcga_data)
Features  Samples
   23368 7706

as well as the cancer types

> length(table(pData(tcga_data)[,"CancerType"]))

results in the observed discrepancies with the above description,
indicating that this is an outdated version of the dataset.

Is it possible to

(1) update it accordingly
(2) include a varLabel, i.e. pData column indicating whether this is a
tumor or an adjacent normal sample for the respective cancer type.

That would be great!

Thx & Best,
Ludwig

-- 
Dr. Ludwig Geistlinger

Lehr- und Forschungseinheit für Bioinformatik
Institut für Informatik
Ludwig-Maximilians-Universität München
Amalienstrasse 17, 2. Stock, Büro A201
80333 München

Tel.: 089-2180-4067
eMail: ludwig.geistlin...@bio.ifi.lmu.de

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel