Probably way easier to add the generics to the Matrix package and everyone just depends on that.
On Wed, Nov 1, 2017 at 1:59 PM, Hervé Pagès <hpa...@fredhutch.org> wrote: > That's probably a good idea but a clean solution would need to > involve all players, including the Matrix package. Right now there > are conflicts for some S4 generics defined in Matrix and in > BiocGenerics (e.g. rowSums). I'm not sure that moving rowSums from > BiocGenerics to a new MatrixGenerics package would address this. > Unless MatrixGenerics is on CRAN and Matrix depends on it ;-) > > How likely is this to happen? > > H. > > > On 11/01/2017 01:44 PM, Peter Hickey wrote: > >> I think that's a good idea, Kylie. >> Pete (DelayedMatrixStats developer) >> >> On Thu., 2 Nov. 2017, 6:09 am Kasper Daniel Hansen, < >> kasperdanielhan...@gmail.com> wrote: >> >> I think it makes sense. A lot of sense. Might be useful to involve Henrik >>> (matrixStats) as well. >>> >>> Who are the players, apart from DelayedArray/DelayedMatrixStats and >>> matter? >>> (and some very old stuff in Biobase which should really be deprecated in >>> favor of matrixStats). >>> >>> Best, >>> Kasper >>> >>> On Wed, Nov 1, 2017 at 3:03 PM, Bemis, Kylie <k.be...@northeastern.edu> >>> wrote: >>> >>> Hi all, >>>> >>>> To continue a variant of this conversation, with the latest BioC >>>> release, >>>> we now have quite a few packages that are implementing various >>>> matrix-related S4 generic functions, many of them relying on matrixStats >>>> >>> as >>> >>>> a template. >>>> >>>> I was wondering if there is any interest or intention to create a common >>>> MatrixGenerics/ArrayGenerics package on which we can depend to import >>>> the >>>> relevant S4 generic functions. Although BiocGeneric has a few like >>>> ‘rowSums()’ and ‘colMeans()’, etc., there are many more that are >>>> implemented across ‘DelayedArray', ‘DelayedMatrixStats', my own package >>>> ‘matter', etc., including ‘apply()’, ‘rowSds()’, ‘colVars()’, and so >>>> >>> forth. >>> >>>> >>>> It would be nice to have a single package with minimal additional >>>> dependencies (a la BiocGenerics) where we could import the various S4 >>>> generics and avoid unwanted namespace collisions. >>>> >>>> Have there been any thoughts on this? >>>> >>>> Many thanks, >>>> Kylie >>>> >>>> ~~~ >>>> Kylie Ariel Bemis >>>> Future Faculty Fellow >>>> College of Computer and Information Science >>>> Northeastern University >>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url >>>> ?u=https-3A__kuwisdelu.github.io&d=DwIGaQ&c=eRAMFD45gAfqt84V >>>> tBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bg >>>> dmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=jvekQlr-c1DbU0g- >>>> P5b_FApuAd33vBk3IMDG5F_slQo&e=> >>>> >>>> >>>> >>>> >>>> >>>> On Mar 3, 2017, at 11:27 AM, Kasper Daniel Hansen < >>>> kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>> >>>> >>> wrote: >>> >>>> >>>> >>>> >>>> On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey < >>>> >>> st...@channing.harvard.edu >>> >>>> <mailto:st...@channing.harvard.edu>> wrote: >>>> >>>> >>>> On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen < >>>> kasperdanielhan...@gmail.com<mailto:kasperdanielhan...@gmail.com>> >>>> >>> wrote: >>> >>>> Some comment on Aaron's stuff >>>> >>>> One possibility for doing things like this is if your code can be done >>>> in >>>> C++ using a subset of rows or columns. That can sometimes give the >>>> necessary speed up. What I mean is this >>>> >>>> Say you can safely process 1000 cells (not matrix cells, but biological >>>> cells, aka columns) at a time in RAM >>>> >>>> iterate in R: >>>> get chunk i containing 1000 cells from the backend data storage >>>> do something on this sub matrix where everything is in a normal >>>> matrix >>>> and you just use C++ >>>> write results out to whatever backend you're using >>>> >>>> Then, with a million cells you iterate over 1000 chunks in R. And you >>>> don't need to "touch" the full dataset which can be stored on an >>>> >>> arbitrary >>> >>>> backend. >>>> >>>> you "touch" it, but you never ingest the whole thing at any time, is >>>> that >>>> what you mean? >>>> >>>> Yes, you load the chunk into RAM and then just deal with it. >>>> >>>> Think of doing 10^10 linear models. If this was 10^6 I would just use >>>> lmFit. But 10^10 doesn't fit into memory. So I load 10^7 into memory, >>>> >>> run >>> >>>> lmFit, store results, redo. This is bound to be much more efficient >>>> than >>>> loading a single row into memory and doing lm 10^10 times, because lmFit >>>> >>> is >>> >>>> written to do many linear models at the same time. >>>> >>>> I am suggesting that this is a potential general strategy. >>>> >>>> >>>> And this approach could be run even (potentially) with different chunks >>>> >>> on >>> >>>> different nodes. >>>> >>>> that seems to me to be an important if not essential desideratum. >>>> >>>> what then is the role of C++? extracting a chunk? preexisting >>>> >>> utilities? >>> >>>> >>>> When I say C++ I just mean write an efficient implementation that works >>>> >>> on >>> >>>> a chunk, like lmFit. It is true that anything that works on a chunk >>>> will >>>> work on a single row/column (like lmFit) but there are possibilities for >>>> optimization when you work at the chunk level. >>>> >>>> Obviously not all computations can be done chunkwise. But for those >>>> that >>>> can, this is a strategy which is independent of the data backend. >>>> >>>> I wonder whether this "obviously not" needs to be rethought. Algorithms >>>> that are implemented to work with data holistically may need >>>> to be reexpressed so that they can succeed with chunkwise access. Is >>>> >>> this >>> >>>> a new mindset needed for holist developers, or can the >>>> effective data decompositions occur autonomously? >>>> >>>> Well, I would say it is obvious that not all computations can be done >>>> chunkwise. But of course, in the limit of extremely large data, >>>> >>> algorithms >>> >>>> which needs to cycle over everything no longer scale. So in that case >>>> >>> all >>> >>>> practical computations can be done chunkwise, out of necessity. For >>>> >>> single >>> >>>> cell right now where it is just millions of cells on the horizon people >>>> will think that they can get "standard" holistic approaches to work (and >>>> that is probably true). If they had a billion cells they probably >>>> >>> wouldn't >>> >>>> think about that. >>>> >>>> Kasper >>>> >>>> If you need direct access to the data in the backend in C++ it will be >>>> extremely backend dependent what is fast and how to do it. That doesn't >>>> mean we shouldn't do it though. >>>> >>>> Best, >>>> Kasper >>>> >>>> >>>> >>>> On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey < >>>> >>> st...@channing.harvard.edu< >>> >>>> mailto:st...@channing.harvard.edu>> wrote: >>>> Kylie, thanks for reminding us of matter -- I saw you speak about this >>>> at >>>> the first Bioconductor Boston Meetup, but it >>>> went like lightning. For developers contemplating an approach to >>>> representing high-volume rectangular data, >>>> where there is no dominant legacy format, it is natural to wonder >>>> whether >>>> HDF5 would be adequate, and, >>>> further, to wonder how to demonstrate that it is or is not dominated by >>>> some other approach for a given set >>>> of tasks. Should we devise a set of bioinformatic benchmark problems to >>>> foster comparison and informed >>>> decisionmaking? @becker.gabe: is ALTREP far enough along that one could >>>> contemplate benchmarking with it? >>>> >>>> On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie <k.be...@northeastern.edu >>>> < >>>> mailto:k.be...@northeastern.edu>> >>>> wrote: >>>> >>>> It’s not there yet, but I plan to expose a C++ API for my disk-backed >>>>> matrix objects in the next version of my ‘matter’ package. >>>>> >>>>> It’s getting easier to interchange matter/HDF5Array/bigmemory/etc. >>>>> objects at the R level, especially if using a frontend like >>>>> >>>> DelayedArray >>> >>>> on >>>> >>>>> top of them, but it would be nice to have a common C++ API that I could >>>>> hook into as well (a la Rcpp), so new C/C++ could be re-used across >>>>> >>>> various >>>> >>>>> backends more easily. >>>>> >>>>> Kylie >>>>> >>>>> ~~~ >>>>> Kylie Ariel Bemis >>>>> Future Faculty Fellow >>>>> College of Computer and Information Science >>>>> Northeastern University >>>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url >>>>> ?u=http-3A__kuwisdelu.github.io_&d=DwIGaQ&c=eRAMFD45gAfqt84 >>>>> VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5b >>>>> gdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=fSRhAUD8T-r7DYaWBk >>>>> 9MoCQJeITrNmKX-1ZwZVtaISk&e=><https:// >>>>> >>>> kuwisdelu.github.io<https://urldefense.proofpoint.com/v2/url >>>> ?u=https-3A__kuwisdelu.github.io_&d=DwIGaQ&c=eRAMFD45gAfqt84 >>>> VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5b >>>> gdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=wgiAIZjLv2OCvDPgV8 >>>> 0yWizDZZN_Icla1Xs84hAieOI&e=>> >>>> >>>>> >>>>> >>>>> >>>>> >>>>> On Feb 24, 2017, at 4:50 PM, Aaron Lun <a...@wehi.edu.au<mailto:alun@ >>>>> >>>> wehi.edu.au><mailto:alun@<mailto:alun@> >>>> >>>>> wehi.edu.au<https://urldefense.proofpoint.com/v2/url?u=http- >>>>> 3A__wehi.edu.au_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3Xe >>>>> AvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt >>>>> -mzOZWaJ649FWWr_wCcCEs&s=mdmCex6kampvZYMbfvpZeocnDj-4bnCjsld >>>>> 5yo_CJsE&e=>>> wrote: >>>>> >>>>> It's a good place to start, though it would be very handy to have a >>>>> >>>> C(++) >>> >>>> API that can be linked against. I'm not sure how much work that would >>>>> entail but it would give downstream developers a lot more options. Sort >>>>> >>>> of >>>> >>>>> like how we can link to Rhtslib, which speeds up a lot of BAM file >>>>> processing, instead of just relying on Rsamtools. >>>>> >>>>> >>>>> -Aaron >>>>> >>>>> ________________________________ >>>>> From: Tim Triche, Jr. <tim.tri...@gmail.com<mailto: >>>>> >>>> tim.tri...@gmail.com >>> >>>> <mailto:tim.tri...@gmail.com<mailto:tim.tri...@gmail.com>>> >>>>> Sent: Saturday, 25 February 2017 8:34:58 AM >>>>> To: Aaron Lun >>>>> Cc: bioc-devel@r-project.org<mailto:bioc-devel@r-project.org><mailto: >>>>> >>>> bioc-devel@r-project.org<mailto:bioc-devel@r-project.org>> >>>> >>>>> Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package? >>>>> >>>>> yes >>>>> >>>>> the DelayedArray framework that handles HDF5Array, etc. seems like the >>>>> right choice? >>>>> >>>>> --t >>>>> >>>>> On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun <a...@wehi.edu.au<mailto: >>>>> >>>> a...@wehi.edu.au><mailto:alun@<mailto:alun@> >>>> >>>>> wehi.edu.au<https://urldefense.proofpoint.com/v2/url?u=http- >>>>> 3A__wehi.edu.au_&d=DwIGaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3Xe >>>>> AvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB5bgdmBaBGWPSNktamrt >>>>> -mzOZWaJ649FWWr_wCcCEs&s=mdmCex6kampvZYMbfvpZeocnDj-4bnCjsld >>>>> 5yo_CJsE&e=>><mailto:a...@wehi.edu.au<mailto: >>>>> >>>> a...@wehi.edu.au>>> wrote: >>>> >>>>> Hi everyone, >>>>> >>>>> I just attended the Human Cell Atlas meeting in Stanford, and people >>>>> >>>> were >>> >>>> talking about gene expression matrices for >1 million cells. If we >>>>> >>>> assume >>> >>>> that we can get non-zero expression profiles for ~5000 genes, we�d be >>>>> talking about a 5000 x 1 million matrix for the raw count data. This >>>>> >>>> would >>>> >>>>> be 20-40 GB in size, which would clearly benefit from sparse (via >>>>> >>>> Matrix) >>> >>>> or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5, >>>>> >>>> etc.). >>> >>>> >>>>> I�m wondering whether there is any appetite amongst us for making a >>>>> consistent BioC API to handle these matrices, sort of like what >>>>> BiocParallel does for multicore and snow. It goes without saying that >>>>> >>>> the >>> >>>> different matrix representations should have consistent functions at >>>>> >>>> the >>> >>>> R >>>> >>>>> level (rbind/cbind, etc.) but it would also be nice to have an >>>>> >>>> integrated >>> >>>> C/C++ API (accessible via LinkedTo). There�s many non-trivial things >>>>> >>>> that >>> >>>> can be done with this type of data, and it is often faster and more >>>>> >>>> memory >>>> >>>>> efficient to do these complex operations in compiled code. >>>>> >>>>> I was thinking of something that you could supply any supported matrix >>>>> representation to a registered function via .Call; the C++ constructor >>>>> would recognise the type of matrix during class instantiation; and >>>>> operations (row/column/random read access, also possibly various ways >>>>> >>>> of >>> >>>> writing a matrix) would be overloaded and behave as required for the >>>>> >>>> class. >>>> >>>>> Only the implementation of the API would need to care about the nitty >>>>> gritty of each representation, and we would all be free to write code >>>>> >>>> that >>>> >>>>> actually does the interesting analytical stuff. >>>>> >>>>> Anyway, just throwing some thoughts out there. Any comments >>>>> >>>> appreciated. >>> >>>> >>>>> Cheers, >>>>> >>>>> Aaron >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org><mailto: >>>>> >>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>><mailto: >>>> >>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org>> mailing >>>>> >>>> list >>> >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >>>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >>>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB >>>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ >>>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e= >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list >>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >>>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >>>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB >>>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ >>>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e= >>>>> >>>>> >>>>> [[alternative HTML version deleted]] >>>>> >>>>> _______________________________________________ >>>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list >>>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >>>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >>>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB >>>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ >>>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e= >>>>> >>>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioc-devel@r-project.org<mailto:Bioc-devel@r-project.org> mailing list >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB >>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ >>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e= >>>> >>>> >>>> >>>> >>>> >>>> [[alternative HTML version deleted]] >>>> >>>> _______________________________________________ >>>> Bioc-devel@r-project.org mailing list >>>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >>>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >>>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB >>>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ >>>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e= >>>> >>>> >>> [[alternative HTML version deleted]] >>> >>> _______________________________________________ >>> Bioc-devel@r-project.org mailing list >>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >>> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >>> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB >>> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ >>> 4RNrEUnjYFUZouU2GPwLkclQf3E&e= >>> >> >> [[alternative HTML version deleted]] >> >> _______________________________________________ >> Bioc-devel@r-project.org mailing list >> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.et >> hz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIGaQ&c=eRAMFD45gAfqt >> 84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=rB >> 5bgdmBaBGWPSNktamrt-mzOZWaJ649FWWr_wCcCEs&s=P-fiYfDDO79lpqLQ >> 4RNrEUnjYFUZouU2GPwLkclQf3E&e= >> >> > -- > Hervé Pagès > > Program in Computational Biology > Division of Public Health Sciences > Fred Hutchinson Cancer Research Center > 1100 Fairview Ave. N, M1-B514 > P.O. Box 19024 > Seattle, WA 98109-1024 > > E-mail: hpa...@fredhutch.org > Phone: (206) 667-5791 > Fax: (206) 667-1319 > > > _______________________________________________ > Bioc-devel@r-project.org mailing list > https://stat.ethz.ch/mailman/listinfo/bioc-devel > [[alternative HTML version deleted]] _______________________________________________ Bioc-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/bioc-devel