Salut Emilie, On Thu, Jul 1, 2010 at 10:13 AM, EmilieT <temilie...@gmail.com> wrote: > Hello, > > I am using your R framework with a set of Affymetrix SNP 6 data and I > have a problem with the extractDataFrame function. > The result is an incomplete matrix with row duplication. > >> sessionInfo() > R version 2.11.1 (2010-05-31) > x86_64-apple-darwin9.8.0 > > locale: > [1] fr_FR.UTF-8/fr_FR.UTF-8/C/C/fr_FR.UTF-8/fr_FR.UTF-8 > > attached base packages: > [1] stats graphics grDevices utils datasets methods > base > > other attached packages: > [1] aroma.cn_0.5.0 aroma.affymetrix_1.6.0 > aroma.apd_0.1.7 affxparser_1.20.0 R.huge_0.2.0 > [6] aroma.core_1.6.0 matrixStats_0.2.1 > R.rsp_0.3.6 R.cache_0.3.0 R.filesets_0.8.2 > [11] digest_0.4.2 R.utils_1.4.0 > R.oo_1.7.2 aroma.light_1.16.0 R.methodsS3_1.2.0 > > I use the standard doCRMAv2 function : > > ds <- doCRMAv2("data", > chipType="GenomeWideSNP_6",combineAlleles=FALSE); > >> ds > $total > AromaUnitTotalCnBinarySet: > Name: data > Tags: ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY > Full name: data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY > Number of files: 14 > Names: A,B, ..., C [14] > Path (to the first file): totalAndFracBData/data,ACC,ra,-XY,BPN,- > XY,AVG,FLN,-XY/GenomeWideSNP_6 > Total file size: 99.13 MB > RAM: 0.02MB > > $fracB > AromaUnitFracBCnBinarySet: > Name: data > Tags: ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY > Full name: data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY > Number of files: 14 > Names: A,B, ..., C [14] > Path (to the first file): totalAndFracBData/data,ACC,ra,-XY,BPN,- > XY,AVG,FLN,-XY/GenomeWideSNP_6 > Total file size: 99.13 MB > RAM: 0.02MB > > It seems to be impossible to use this 'ds' object (or ds$fracB or ds > $total) as an entrance for the extractDataFrame() function.
Yes: this is because extractDataFrame is meant to extract *chip effects* (http://aroma-project.org/howtos/extractDataFrame) in your case total and allele-specific *intensities*, and your ds$total and ds$fracB are already one step further in the analysis: they are AromaUnit*CnBinaryFile:s. For these you can use writeDataFrame (http://aroma-project.org/howtos/writeDataFrame) as you seem to be doing below. > So I must do : > >> rootPath <- "totalAndFracBData" >> dataSet <- "data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" >> ds <- AromaUnitFracBCnBinarySet$byName(dataSet, chipType="GenomeWideSNP_6", >> paths=rootPath); >> ds > AromaUnitFracBCnBinarySet: > Name: data > Tags: ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY > Full name: data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY > Number of files: 14 > Names: A,B, ..., C [14] > Path (to the first file): totalAndFracBData/data,ACC,ra,-XY,BPN,- > XY,AVG,FLN,-XY/GenomeWideSNP_6 > Total file size: 99.13 MB > RAM: 0.02MB You don't really to do this: your new 'ds' is exactly your previous 'ds$fracB' (more on this below). > > When I use the extractDataFrame function, I obtain the folowing > object : Below you are using writeDataFrame, not extractDataFrame. Right ? > >> dfTxt <- writeDataFrame(ds, columns=c("unitName", "chromosome", "position", >> "*")) >> d <- readDataFrame(dfTxt) >> str(d) > 'data.frame': 1857154 obs. of 17 variables: > $ unitName : Factor w/ 71429 levels > "AFFX-5Q-123",..: 1 2 3 4 487 490 493 496 499 502 ... > $ chromosome : int NA NA NA NA NA NA NA NA NA NA ... > $ position : int NA NA NA NA NA NA NA NA NA > NA ... > $ A,fracB : num NA NA NA NA NA NA NA NA NA > NA ... > $ B,fracB : num NA NA NA NA NA NA NA NA NA > NA ... > $ C,fracB : num NA NA NA NA NA NA NA NA NA > NA ... > $ ... > > First of all, you can see that there is only the fracB columns. The > first "ds" object had a "total" item, it seems to have been lost. The > directory > /totalAndFracBData/data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY/GenomeWideSNP_6 > also contain the ....,total.asb files. There is maybe a problem with > my new 'ds' object (which refers to only 14 files). Yes, this is expected because your new 'ds' has been created using ds <- AromaUnitFracBCnBinarySet$byName(dataSet, chipType="GenomeWideSNP_6", paths=rootPath); As the "FracB" indicates, this 'ds' only contains allele B fractions. You can do totalDs <- AromaUnitTotalCnBinarySet$byName(dataSet, chipType="GenomeWideSNP_6", paths=rootPath); to get the corresponding total CN data. > > There is also a problem of row duplication : you can see that the > number of row is the same as Affymetrix SNP 6 number of units (so the > result seems to be good). Well, I've tried to reproduce what you have and I'm getting 2000000 rows: > str(d); 'data.frame': 2000000 obs. of 5 variables: $ unitName : Factor w/ 500000 levels "AFFX-5Q-123",..: 1 2 3 4 487 490 493 496 499 502 ... $ chromosome : int NA NA NA NA NA NA NA NA NA NA ... $ position : int NA NA NA NA NA NA NA NA NA NA ... $ STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB: num NA NA NA NA NA NA NA NA NA NA ... $ STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E04_238456,fracB: num NA NA NA NA NA NA NA NA NA NA ... - attr(*, "fileHeader")=List of 6 ..$ comments: chr "# name: TumorBoostPaper" "# tags: pairs,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" "# fullName: TumorBoostPaper,pairs,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" "# nbrOfFiles: 2" ... ..$ sep : chr "\t" ..$ quote : chr "\"" ..$ skip : num 0 ..$ topRows :List of 10 .. ..$ : chr "unitName" "chromosome" "position" "STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB" ... .. ..$ : chr "AFFX-5Q-123" "NA" "NA" "NA" ... .. ..$ : chr "AFFX-5Q-456" "NA" "NA" "NA" ... .. ..$ : chr "AFFX-5Q-789" "NA" "NA" "NA" ... .. ..$ : chr "AFFX-5Q-ABC" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A02_SB" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A04_SB" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A06_SB" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A08_SB" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A10_SB" "NA" "NA" "NA" ... ..$ columns : chr "unitName" "chromosome" "position" "STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB" ... > But there is only 71429 unique unitNames. In > fact, there is only 71429 unique rows : > >> str(unique(d)) > 'data.frame': 71429 obs. of 17 variables: > $ unitName : Factor w/ 71429 levels "AFFX-5Q-123",..: 1 > 2 3 4 487 490 493 496 499 502 ... > $ chromosome : int NA NA NA NA NA NA NA NA NA NA ... > $ position : int NA NA NA NA NA NA NA NA NA NA ... > $ A,fracB : num NA NA NA NA NA NA NA NA NA NA ... > $ B,fracB : num NA NA NA NA NA NA NA NA NA NA ... > $ C,fracB : num NA NA NA NA NA NA NA NA NA NA ... > $ ... > > Each row seems to be duplicated 26 times : >> unique(table(d$unitName)) > [1] 26 > I can't reproduce this. Here is what I get: > str(unique(d)) 'data.frame': 500000 obs. of 5 variables: $ unitName : Factor w/ 500000 levels "AFFX-5Q-123",..: 1 2 3 4 487 490 493 496 499 502 ... $ chromosome : int NA NA NA NA NA NA NA NA NA NA ... $ position : int NA NA NA NA NA NA NA NA NA NA ... $ STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB: num NA NA NA NA NA NA NA NA NA NA ... $ STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E04_238456,fracB: num NA NA NA NA NA NA NA NA NA NA ... - attr(*, "fileHeader")=List of 6 ..$ comments: chr "# name: TumorBoostPaper" "# tags: pairs,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" "# fullName: TumorBoostPaper,pairs,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" "# nbrOfFiles: 2" ... ..$ sep : chr "\t" ..$ quote : chr "\"" ..$ skip : num 0 ..$ topRows :List of 10 .. ..$ : chr "unitName" "chromosome" "position" "STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB" ... .. ..$ : chr "AFFX-5Q-123" "NA" "NA" "NA" ... .. ..$ : chr "AFFX-5Q-456" "NA" "NA" "NA" ... .. ..$ : chr "AFFX-5Q-789" "NA" "NA" "NA" ... .. ..$ : chr "AFFX-5Q-ABC" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A02_SB" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A04_SB" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A06_SB" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A08_SB" "NA" "NA" "NA" ... .. ..$ : chr "AFR_A10_SB" "NA" "NA" "NA" ... ..$ columns : chr "unitName" "chromosome" "position" "STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB" ... > > I use the extractDataFrame function on the ugp object and it seems to > work so my ugp file is probably correct. What have you done exactly here ? > I also notice that the 71429 unitNames of the 'd' object are the first > 71429 lines of my ugp matrix. > Can you delete the txt file and (re)do ds <- doCRMAv2("data", chipType="GenomeWideSNP_6",combineAlleles=FALSE); dfTxt <- writeDataFrame(ds$fracB, columns=c("unitName", "chromosome", "position", "*")) d <- readDataFrame(dfTxt) ? Do you stil have the same problem ? Pierre > I hope you can help me out. Thank you > > -- > When reporting problems on aroma.affymetrix, make sure 1) to run the latest > version of the package, 2) to report the output of sessionInfo() and > traceback(), and 3) to post a complete code example. > > > You received this message because you are subscribed to the Google Groups > "aroma.affymetrix" group with website http://www.aroma-project.org/. > To post to this group, send email to aroma-affymetrix@googlegroups.com > To unsubscribe and other options, go to http://www.aroma-project.org/forum/ > -- When reporting problems on aroma.affymetrix, make sure 1) to run the latest version of the package, 2) to report the output of sessionInfo() and traceback(), and 3) to post a complete code example. You received this message because you are subscribed to the Google Groups "aroma.affymetrix" group with website http://www.aroma-project.org/. To post to this group, send email to aroma-affymetrix@googlegroups.com To unsubscribe and other options, go to http://www.aroma-project.org/forum/