Hi everyone,

I just attended the Human Cell Atlas meeting in Stanford, and people were 
talking about gene expression matrices for >1 million cells. If we assume that 
we can get non-zero expression profiles for ~5000 genes, we�d be talking about 
a 5000 x 1 million matrix for the raw count data. This would be 20-40 GB in 
size, which would clearly benefit from sparse (via Matrix) or disk-backed 
representations (bigmatrix, BufferedMatrix, rhdf5, etc.).

I�m wondering whether there is any appetite amongst us for making a consistent 
BioC API to handle these matrices, sort of like what BiocParallel does for 
multicore and snow. It goes without saying that the different matrix 
representations should have consistent functions at the R level (rbind/cbind, 
etc.) but it would also be nice to have an integrated C/C++ API (accessible via 
LinkedTo). There�s many non-trivial things that can be done with this type of 
data, and it is often faster and more memory efficient to do these complex 
operations in compiled code.

I was thinking of something that you could supply any supported matrix 
representation to a registered function via .Call; the C++ constructor would 
recognise the type of matrix during class instantiation; and operations 
(row/column/random read access, also possibly various ways of writing a matrix) 
would be overloaded and behave as required for the class. Only the 
implementation of the API would need to care about the nitty gritty of each 
representation, and we would all be free to write code that actually does the 
interesting analytical stuff.

Anyway, just throwing some thoughts out there. Any comments appreciated.

Cheers,

Aaron

        [[alternative HTML version deleted]]

_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

Reply via email to