Re: [aroma.affymetrix] uncomplete extractDataFrame()

Pierre Neuvial Fri, 02 Jul 2010 01:40:53 -0700

Salut Emilie,

On Thu, Jul 1, 2010 at 10:13 AM, EmilieT <temilie...@gmail.com> wrote:
> Hello,
>
> I am using your R framework with a set of Affymetrix SNP 6 data and I
> have a problem with the extractDataFrame function.
> The result is an incomplete matrix with row duplication.
>
>> sessionInfo()
> R version 2.11.1 (2010-05-31)
> x86_64-apple-darwin9.8.0
>
> locale:
> [1] fr_FR.UTF-8/fr_FR.UTF-8/C/C/fr_FR.UTF-8/fr_FR.UTF-8
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods
> base
>
> other attached packages:
>  [1] aroma.cn_0.5.0         aroma.affymetrix_1.6.0
> aroma.apd_0.1.7        affxparser_1.20.0      R.huge_0.2.0
>  [6] aroma.core_1.6.0       matrixStats_0.2.1
> R.rsp_0.3.6            R.cache_0.3.0          R.filesets_0.8.2
> [11] digest_0.4.2           R.utils_1.4.0
> R.oo_1.7.2             aroma.light_1.16.0     R.methodsS3_1.2.0
>
> I use the standard doCRMAv2 function :
>  > ds <- doCRMAv2("data",
> chipType="GenomeWideSNP_6",combineAlleles=FALSE);
>
>> ds
> $total
> AromaUnitTotalCnBinarySet:
> Name: data
> Tags: ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY
> Full name: data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY
> Number of files: 14
> Names: A,B, ..., C [14]
> Path (to the first file): totalAndFracBData/data,ACC,ra,-XY,BPN,-
> XY,AVG,FLN,-XY/GenomeWideSNP_6
> Total file size: 99.13 MB
> RAM: 0.02MB
>
> $fracB
> AromaUnitFracBCnBinarySet:
> Name: data
> Tags: ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY
> Full name: data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY
> Number of files: 14
> Names: A,B, ..., C [14]
> Path (to the first file): totalAndFracBData/data,ACC,ra,-XY,BPN,-
> XY,AVG,FLN,-XY/GenomeWideSNP_6
> Total file size: 99.13 MB
> RAM: 0.02MB
>
> It seems to be impossible to use this 'ds' object (or ds$fracB or ds
> $total) as an entrance for the extractDataFrame() function.


Yes: this is because extractDataFrame is meant to extract *chip
effects* (http://aroma-project.org/howtos/extractDataFrame) in your
case total and allele-specific *intensities*, and your ds$total and
ds$fracB are already one step further in the analysis: they are
AromaUnit*CnBinaryFile:s.  For these you can use writeDataFrame
(http://aroma-project.org/howtos/writeDataFrame) as you seem to be
doing below.

> So I must do :
>
>> rootPath <- "totalAndFracBData"
>> dataSet <- "data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY"
>> ds <- AromaUnitFracBCnBinarySet$byName(dataSet, chipType="GenomeWideSNP_6", 
>> paths=rootPath);
>> ds
> AromaUnitFracBCnBinarySet:
> Name: data
> Tags: ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY
> Full name: data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY
> Number of files: 14
> Names: A,B, ..., C [14]
> Path (to the first file): totalAndFracBData/data,ACC,ra,-XY,BPN,-
> XY,AVG,FLN,-XY/GenomeWideSNP_6
> Total file size: 99.13 MB
> RAM: 0.02MB

You don't really to do this: your new 'ds' is exactly your previous
'ds$fracB' (more on this below).

>
> When I use the extractDataFrame function, I obtain the folowing
> object :

Below you are using writeDataFrame, not extractDataFrame. Right ?

>
>> dfTxt <- writeDataFrame(ds, columns=c("unitName", "chromosome", "position", 
>> "*"))
>> d <- readDataFrame(dfTxt)
>> str(d)
> 'data.frame':   1857154 obs. of  17 variables:
>  $ unitName                     : Factor w/ 71429 levels
> "AFFX-5Q-123",..: 1 2 3 4 487 490 493 496 499 502 ...
>  $ chromosome                : int  NA NA NA NA NA NA NA NA NA NA ...
>  $ position                        : int  NA NA NA NA NA NA NA NA NA
> NA ...
>  $ A,fracB                        : num  NA NA NA NA NA NA NA NA NA
> NA ...
>  $ B,fracB                        : num  NA NA NA NA NA NA NA NA NA
> NA ...
>  $ C,fracB                       : num  NA NA NA NA NA NA NA NA NA
> NA ...
>  $ ...
>
> First of all, you can see that there is only the fracB columns. The
> first "ds" object had a "total" item, it seems to have been lost. The
> directory
> /totalAndFracBData/data,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY/GenomeWideSNP_6
> also contain the ....,total.asb files. There is maybe a problem with
> my new 'ds' object (which refers to only 14 files).

Yes, this is expected because your new 'ds' has been created using

ds <- AromaUnitFracBCnBinarySet$byName(dataSet,
chipType="GenomeWideSNP_6", paths=rootPath);

As the "FracB" indicates, this 'ds' only contains allele B fractions. You can do

totalDs <- AromaUnitTotalCnBinarySet$byName(dataSet,
chipType="GenomeWideSNP_6", paths=rootPath);

to get the corresponding total CN data.

>
> There is also a problem of row duplication : you can see that the
> number of row is the same as Affymetrix SNP 6 number of units (so the
> result seems to be good).

Well, I've tried to reproduce what you have and I'm getting 2000000 rows:

> str(d);
'data.frame':   2000000 obs. of  5 variables:
 $ unitName                                                   : Factor
w/ 500000 levels "AFFX-5Q-123",..: 1 2 3 4 487 490 493 496 499 502 ...
 $ chromosome                                                 : int
NA NA NA NA NA NA NA NA NA NA ...
 $ position                                                   : int
NA NA NA NA NA NA NA NA NA NA ...
 $ STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB: num
NA NA NA NA NA NA NA NA NA NA ...
 $ STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E04_238456,fracB: num
NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, "fileHeader")=List of 6
  ..$ comments: chr  "# name: TumorBoostPaper" "# tags:
pairs,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" "# fullName:
TumorBoostPaper,pairs,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" "# nbrOfFiles:
2" ...
  ..$ sep     : chr "\t"
  ..$ quote   : chr "\""
  ..$ skip    : num 0
  ..$ topRows :List of 10
  .. ..$ : chr  "unitName" "chromosome" "position"
"STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB" ...
  .. ..$ : chr  "AFFX-5Q-123" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFFX-5Q-456" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFFX-5Q-789" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFFX-5Q-ABC" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A02_SB" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A04_SB" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A06_SB" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A08_SB" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A10_SB" "NA" "NA" "NA" ...
  ..$ columns : chr  "unitName" "chromosome" "position"
"STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB" ...

> But there is only 71429 unique unitNames. In
> fact, there is only 71429 unique rows :
>
>> str(unique(d))
> 'data.frame':   71429 obs. of  17 variables:
>  $ unitName               : Factor w/ 71429 levels "AFFX-5Q-123",..: 1
> 2 3 4 487 490 493 496 499 502 ...
>  $ chromosome          : int  NA NA NA NA NA NA NA NA NA NA ...
>  $ position                  : int  NA NA NA NA NA NA NA NA NA NA ...
>  $ A,fracB                  : num  NA NA NA NA NA NA NA NA NA NA ...
>  $ B,fracB                  : num  NA NA NA NA NA NA NA NA NA NA ...
>  $ C,fracB                  : num  NA NA NA NA NA NA NA NA NA NA ...
>  $ ...
>
> Each row seems to be duplicated 26 times :
>> unique(table(d$unitName))
> [1] 26
>

I can't reproduce this.  Here is what I get:

> str(unique(d))
'data.frame':   500000 obs. of  5 variables:
 $ unitName                                                   : Factor
w/ 500000 levels "AFFX-5Q-123",..: 1 2 3 4 487 490 493 496 499 502 ...
 $ chromosome                                                 : int
NA NA NA NA NA NA NA NA NA NA ...
 $ position                                                   : int
NA NA NA NA NA NA NA NA NA NA ...
 $ STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB: num
NA NA NA NA NA NA NA NA NA NA ...
 $ STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E04_238456,fracB: num
NA NA NA NA NA NA NA NA NA NA ...
 - attr(*, "fileHeader")=List of 6
  ..$ comments: chr  "# name: TumorBoostPaper" "# tags:
pairs,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" "# fullName:
TumorBoostPaper,pairs,ACC,ra,-XY,BPN,-XY,AVG,FLN,-XY" "# nbrOfFiles:
2" ...
  ..$ sep     : chr "\t"
  ..$ quote   : chr "\""
  ..$ skip    : num 0
  ..$ topRows :List of 10
  .. ..$ : chr  "unitName" "chromosome" "position"
"STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB" ...
  .. ..$ : chr  "AFFX-5Q-123" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFFX-5Q-456" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFFX-5Q-789" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFFX-5Q-ABC" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A02_SB" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A04_SB" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A06_SB" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A08_SB" "NA" "NA" "NA" ...
  .. ..$ : chr  "AFR_A10_SB" "NA" "NA" "NA" ...
  ..$ columns : chr  "unitName" "chromosome" "position"
"STAIR_p_TCGA_Batch7_Affx_N_GenomeWideSNP_6_E03_238454,fracB" ...

>
> I use the extractDataFrame function on the ugp object and it seems to
> work so my ugp file is probably correct.

What have you done exactly here ?

> I also notice that the 71429 unitNames of the 'd' object are the first
> 71429 lines of my ugp matrix.
>

Can you delete the txt file and (re)do

ds <- doCRMAv2("data", chipType="GenomeWideSNP_6",combineAlleles=FALSE);
dfTxt <- writeDataFrame(ds$fracB, columns=c("unitName", "chromosome",
"position", "*"))
d <- readDataFrame(dfTxt)

?  Do you stil have the same problem ?

Pierre

> I hope you can help me out. Thank you
>
> --
> When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
> version of the package, 2) to report the output of sessionInfo() and 
> traceback(), and 3) to post a complete code example.
>
>
> You received this message because you are subscribed to the Google Groups 
> "aroma.affymetrix" group with website http://www.aroma-project.org/.
> To post to this group, send email to aroma-affymetrix@googlegroups.com
> To unsubscribe and other options, go to http://www.aroma-project.org/forum/
>

-- 
When reporting problems on aroma.affymetrix, make sure 1) to run the latest 
version of the package, 2) to report the output of sessionInfo() and 
traceback(), and 3) to post a complete code example.


You received this message because you are subscribed to the Google Groups 
"aroma.affymetrix" group with website http://www.aroma-project.org/.
To post to this group, send email to aroma-affymetrix@googlegroups.com
To unsubscribe and other options, go to http://www.aroma-project.org/forum/

Re: [aroma.affymetrix] uncomplete extractDataFrame()

Reply via email to