Aaron
Can you describe use cases, i.e. intended computations on these
matrices, esp. those for which C++ access is needed for?
I'm asking b/c the goals of efficient code and abstraction from how the
data are stored may be conflicting - in which case critical algorithms
may end up circumventing a prematurely defined API.
Wolfgang
25.2.17 00:37, Aaron Lun scripsit:
Yes, I think double-precision would be necessary for general use. Only the raw
count data would be integer, and even then that's not guaranteed (e.g., if
people are using kallisto or salmon for quantification).
-Aaron
________________________________
From: Vincent Carey <st...@channing.harvard.edu>
Sent: Saturday, 25 February 2017 9:25 AM
To: Aaron Lun
Cc: Tim Triche, Jr.; bioc-devel@r-project.org
Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
What is the data type for an expression value? Is it assumed that double
precision will be needed?
On Fri, Feb 24, 2017 at 4:50 PM, Aaron Lun
<a...@wehi.edu.au<mailto:a...@wehi.edu.au>> wrote:
It's a good place to start, though it would be very handy to have a C(++) API
that can be linked against. I'm not sure how much work that would entail but it
would give downstream developers a lot more options. Sort of like how we can
link to Rhtslib, which speeds up a lot of BAM file processing, instead of just
relying on Rsamtools.
-Aaron
________________________________
From: Tim Triche, Jr. <tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>
Sent: Saturday, 25 February 2017 8:34:58 AM
To: Aaron Lun
Cc: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>
Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
yes
the DelayedArray framework that handles HDF5Array, etc. seems like the right
choice?
--t
On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun
<a...@wehi.edu.au<mailto:a...@wehi.edu.au><mailto:a...@wehi.edu.au<mailto:a...@wehi.edu.au>>>
wrote:
Hi everyone,
I just attended the Human Cell Atlas meeting in Stanford, and people were talking
about gene expression matrices for >1 million cells. If we assume that we can
get non-zero expression profiles for ~5000 genes, we�d be talking about a 5000 x 1
million matrix for the raw count data. This would be 20-40 GB in size, which would
clearly benefit from sparse (via Matrix) or disk-backed representations
(bigmatrix, BufferedMatrix, rhdf5, etc.).
I�m wondering whether there is any appetite amongst us for making a consistent
BioC API to handle these matrices, sort of like what BiocParallel does for
multicore and snow. It goes without saying that the different matrix
representations should have consistent functions at the R level (rbind/cbind,
etc.) but it would also be nice to have an integrated C/C++ API (accessible via
LinkedTo). There�s many non-trivial things that can be done with this type of
data, and it is often faster and more memory efficient to do these complex
operations in compiled code.
I was thinking of something that you could supply any supported matrix
representation to a registered function via .Call; the C++ constructor would
recognise the type of matrix during class instantiation; and operations
(row/column/random read access, also possibly various ways of writing a matrix)
would be overloaded and behave as required for the class. Only the
implementation of the API would need to care about the nitty gritty of each
representation, and we would all be free to write code that actually does the
interesting analytical stuff.
Anyway, just throwing some thoughts out there. Any comments appreciated.
Cheers,
_______________________________________________
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel