[Bioc-devel] Bioconductor's GIT transition

2017-03-03 Thread Turaga, Nitesh
Dear Bioconductor Developers,

Big news! We are planning to migrate from SVN to git. This is a major change in 
our version control model. We understand this may be disruptive to some 
developers and are working to make the transition as smooth as possible.  The 
end goal is to provide a versioning system that supports both robust code 
development as well as community coding.

Git has really emerged as a modern replacement to SVN, and is widely used in 
the bioinformatics community. Many Bioconductor packages are maintained 
primarily on git and perhaps a majority of the commits to our SVN repository 
are from git. Git encourages broad community participation in development of 
both Bioconductor infrastructure and contributed software / experiment data 
packages. Recent developers of new packages have been using git and Github for 
their package development, and this has worked very well for both developers 
and core team reviewers.

We are producing robust scripts to create git repositories of each package in 
our current SVN repository. The git repositories contain the complete commit 
history for 'devel', and for all releases as branches. Details for interaction 
with our git server, including the role of github, are still being finalized.

More information about the specifics of the transition plan will be announced 
in the middle of March. We anticipate a fully functional 'beta' version 
available for broad testing immediately after the next Bioconductor release, in 
mid-April. Once we are confident in the new repositories and work flows, we 
will switch to an exclusively git-based version control model; SVN repositories 
will remain available as a 'read only' resource for as long as is feasible.

We welcome all feedback during this test period; please respond to this post 
with comments, or contact 
nitesh.tur...@roswellpark.org directly.

Best,

Nitesh Turaga
Bioconductor Core Team


This email message may contain legally privileged and/or confidential 
information.  If you are not the intended recipient(s), or the employee or 
agent responsible for the delivery of this message to the intended 
recipient(s), you are hereby notified that any disclosure, copying, 
distribution, or use of this email message is prohibited.  If you have received 
this message in error, please notify the sender immediately by e-mail and 
delete this email message from your computer. Thank you.
[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel


Re: [Rd] Control statements with condition with greater than one should give error (not just warning) [PATCH]

2017-03-03 Thread Henrik Bengtsson
On Fri, Mar 3, 2017 at 9:55 AM, Hadley Wickham  wrote:
>> But, how you propose a warning-to-error transition should be made
>> without wreaking havoc?  Just flip the switch in R-devel and see CRAN
>> and Bioconductor packages break overnight?  Particularly Bioconductor
>> devel might become non-functional (since at times it requires
>> R-devel).  For my own code / packages, I would be able to handle such
>> a change, but I'm completely out of control if one of the package I'm
>> depending on does not provide a quick fix (with the only option to
>> remove package tests for those dependencies).
>
> Generally, a package can not be on CRAN if it has any warnings, so I
> don't think this change would have any impact on CRAN packages.  Isn't
> this also true for bioconductor?

Having a tests/warn.R file with:

warning("boom")

passes through R CMD check --as-cran unnoticed.  Same with:

if (sample(2) == 1) message("It's your lucky day today!")

/Henrik

PS. Does testthat signal that?

>
> Hadley
>
> --
> http://hadley.nz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Control statements with condition with greater than one should give error (not just warning) [PATCH]

2017-03-03 Thread Henrik Bengtsson
On Fri, Mar 3, 2017 at 9:22 AM, Martin Maechler
 wrote:
>> Henrik Bengtsson 
>> on Fri, 3 Mar 2017 00:52:16 -0800 writes:
>
> > I'd like to propose that the whenever the length of condition passed
> > to an if or a while statement differs from one, an error is produced
> > rather than just a warning as today:
>
> >> x <- 1:2
> >> if (x == 1) message("x == 1")
> > x == 1
> > Warning message:
> > In if (x == 1) message("x == 1") :
> > the condition has length > 1 and only the first element will be used
>
> > There are probably legacy reasons for why this is accepted by R in the
> > first place, but I cannot imagine than anyone wants to use an if/while
> > statement this way on purpose.  The warning about this misuse, was
> > introduced in November 2002 (R-devel thread 'vector arguments to
> > if()'; 
> https://stat.ethz.ch/pipermail/r-devel/2002-November/025537.html).
>
> yes, before, there was *no* warning at all and so the problem existed
> in several partly important R packages.
>
> Now is a different time, I agree, and I even tend to agree we
> should make this an error... probably however not for the
> upcoming R 3.4.0 (in April which is somewhat soon) but rather
> for the next version.
>
>
> > Below is patch (also attached) that introduces option
> > 'check.condition' such that when TRUE,
>
> ouch ouch ouch!   There are many sayings starting with
>   "The way to hell "
>
> Here:
>
> The way to R hell starts (or "widens", your choice) by
> introducing options() that influence basic language semantics
>
> !!
>
> For robust code you will start to test all code of R for all
> different possible combinations of these options set  I am
> sure you would not want this.

You only want to test with check.condition = TRUE.  No new code,
package updates etc should be allowed to only pass if they have to use
check.condition = FALSE.

>
> No --- don't even think of allowing an option for something such basic!

But, how you propose a warning-to-error transition should be made
without wreaking havoc?  Just flip the switch in R-devel and see CRAN
and Bioconductor packages break overnight?  Particularly Bioconductor
devel might become non-functional (since at times it requires
R-devel).  For my own code / packages, I would be able to handle such
a change, but I'm completely out of control if one of the package I'm
depending on does not provide a quick fix (with the only option to
remove package tests for those dependencies).

My idea is that with this option, then this can be tested at runtime
locally by users and developers (cf. warnPartialMatchArgs), but also
via R CMD check.  It would also provide CRAN with a way to check it on
incoming submissions as on the test farm - eventually all CRAN
packages pass without errors.  This option would only exist for a
number of R releases (first default to FALSE, then TRUE) and then
eventually be deprecated and removed.  Does this clarify my design?

As an alternative to an option, one could use an environment variable
R_CHECK_CONDITION that is a bit "hidden" from misuse.

/Henrik

>
> Martin Maechler
> ETH Zurich (and R Core)
>
> > it will generate an error
> > rather than a warning (default).  This option allows for a smooth
> > migration as it can be added to 'R CMD check --as-cran' and developers
> > can give time to check and fix their packages.  Eventually,
> > check.condition=TRUE can become the new default.
>
> > With options(check.condition = TRUE), one gets:
>
> >> x <- 1:2
> >> if (x == 1) message("x == 1")
> > Error in if (x == 1) message("x == 1") : the condition has length > 1
>
> > and
>
> >> while (x < 2) message("x < 2")
> > Error in while (x < 2) message("x < 2") : the condition has length > 1
>
>
> > Index: src/library/base/man/options.Rd
> > ===
> > --- src/library/base/man/options.Rd (revision 72298)
> > +++ src/library/base/man/options.Rd (working copy)
> > @@ -86,6 +86,11 @@
> > vector (atomic or \code{\link{list}}) is extended, by something
> > like \code{x <- 1:3; x[5] <- 6}.}
>
> > +\item{\code{check.condition}:}{logical, defaulting to 
> \code{FALSE}.  If
> > +  \code{TRUE}, an error is produced whenever the condition to an
> > +  \code{if} or a \code{while} control statement is of length 
> greater
> > +  than one.  If \code{FALSE}, a \link{warning} is produced.}
> > +
> > \item{\code{CBoundsCheck}:}{logical, controlling whether
> > \code{\link{.C}} and \code{\link{.Fortran}} make copies to check for
> > array over-runs on the atomic vector arguments.
> > @@ -445,6 +450,7 @@
> > \tabular{ll}{
> > \code{add.smooth} \tab \code{TRUE}\cr
> > \code{check.bounds} \tab \code{FALSE}\cr
> > +\code{check.condition} 

[Rd] Trouble installing packages when history mechanism is modified by user profile

2017-03-03 Thread Hugo Raguet
I tried installing the 'ks' package from my interactive R session, it
failed with the following

Erreur dans .External2(C_loadhistory, file) :
  aucun mécanisme d'historique des commandes disponible
Calls: 
Exécution arrêtée

second line is french for "no command history mechanism available", fouth
is "execution stopped".
This does not happen when I comment out the following line from my
.Rprofile:
utils::loadhistory(file = "~/.Rhistory")

On Stack Overflow, someone else has similar trouble with another package,
which seems to be also related to command history:
http://stackoverflow.com/questions/18240863/installing-packages-on-r-fails-when-loading-rprofile#18256224

Is this a bug in R, or in the concerned packages?

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Control statements with condition with greater than one should give error (not just warning) [PATCH]

2017-03-03 Thread Martin Maechler
> Henrik Bengtsson 
> on Fri, 3 Mar 2017 00:52:16 -0800 writes:

> I'd like to propose that the whenever the length of condition passed
> to an if or a while statement differs from one, an error is produced
> rather than just a warning as today:

>> x <- 1:2
>> if (x == 1) message("x == 1")
> x == 1
> Warning message:
> In if (x == 1) message("x == 1") :
> the condition has length > 1 and only the first element will be used

> There are probably legacy reasons for why this is accepted by R in the
> first place, but I cannot imagine than anyone wants to use an if/while
> statement this way on purpose.  The warning about this misuse, was
> introduced in November 2002 (R-devel thread 'vector arguments to
> if()'; https://stat.ethz.ch/pipermail/r-devel/2002-November/025537.html).

yes, before, there was *no* warning at all and so the problem existed
in several partly important R packages.

Now is a different time, I agree, and I even tend to agree we
should make this an error... probably however not for the
upcoming R 3.4.0 (in April which is somewhat soon) but rather
for the next version.


> Below is patch (also attached) that introduces option
> 'check.condition' such that when TRUE, 

ouch ouch ouch!   There are many sayings starting with
  "The way to hell "

Here:

The way to R hell starts (or "widens", your choice) by
introducing options() that influence basic language semantics

!!

For robust code you will start to test all code of R for all
different possible combinations of these options set  I am
sure you would not want this.

No --- don't even think of allowing an option for something such basic!

Martin Maechler
ETH Zurich (and R Core)

> it will generate an error
> rather than a warning (default).  This option allows for a smooth
> migration as it can be added to 'R CMD check --as-cran' and developers
> can give time to check and fix their packages.  Eventually,
> check.condition=TRUE can become the new default.

> With options(check.condition = TRUE), one gets:

>> x <- 1:2
>> if (x == 1) message("x == 1")
> Error in if (x == 1) message("x == 1") : the condition has length > 1

> and

>> while (x < 2) message("x < 2")
> Error in while (x < 2) message("x < 2") : the condition has length > 1


> Index: src/library/base/man/options.Rd
> ===
> --- src/library/base/man/options.Rd (revision 72298)
> +++ src/library/base/man/options.Rd (working copy)
> @@ -86,6 +86,11 @@
> vector (atomic or \code{\link{list}}) is extended, by something
> like \code{x <- 1:3; x[5] <- 6}.}

> +\item{\code{check.condition}:}{logical, defaulting to \code{FALSE}.  
If
> +  \code{TRUE}, an error is produced whenever the condition to an
> +  \code{if} or a \code{while} control statement is of length greater
> +  than one.  If \code{FALSE}, a \link{warning} is produced.}
> +
> \item{\code{CBoundsCheck}:}{logical, controlling whether
> \code{\link{.C}} and \code{\link{.Fortran}} make copies to check for
> array over-runs on the atomic vector arguments.
> @@ -445,6 +450,7 @@
> \tabular{ll}{
> \code{add.smooth} \tab \code{TRUE}\cr
> \code{check.bounds} \tab \code{FALSE}\cr
> +\code{check.condition} \tab \code{FALSE}\cr
> \code{continue} \tab \code{"+ "}\cr
> \code{digits} \tab \code{7}\cr
> \code{echo} \tab \code{TRUE}\cr
> Index: src/library/utils/R/completion.R
> ===
> --- src/library/utils/R/completion.R (revision 72298)
> +++ src/library/utils/R/completion.R (working copy)
> @@ -1304,8 +1304,8 @@
> "plt", "ps", "pty", "smo", "srt", "tck", "tcl", "usr",
> "xaxp", "xaxs", "xaxt", "xpd", "yaxp", "yaxs", "yaxt")

> -options <- c("add.smooth", "browser", "check.bounds", "continue",
> - "contrasts", "defaultPackages", "demo.ask", "device",
> +options <- c("add.smooth", "browser", "check.bounds", 
"check.condition",
> +"continue", "contrasts", "defaultPackages", "demo.ask", "device",
> "digits", "dvipscmd", "echo", "editor", "encoding",
> "example.ask", "expressions", "help.search.types",
> "help.try.all.packages", "htmlhelp", "HTTPUserAgent",
> Index: src/main/eval.c
> ===
> --- src/main/eval.c (revision 72298)
> +++ src/main/eval.c (working copy)
> @@ -1851,9 +1851,13 @@
> Rboolean cond = NA_LOGICAL;

> if (length(s) > 1) {
> + int check = asInteger(GetOption1(install("check.condition")));
> PROTECT(s); /* needed as per PR#15990.  call gets protected by
> warningcall() */
> - warningcall(call,
> -_("the condition has length > 1 and only the first 

Re: [Rd] Bug in nlm()

2017-03-03 Thread Martin Maechler
> Boehnstedt, Marie 
> on Fri, 3 Mar 2017 10:23:12 + writes:

> Dear all,
> I have found a bug in nlm() and would like to submit a report on this.
> Since nlm() is in the stats-package, which is maintained by the R Core 
team, bug reports should be submitted to R's Bugzilla. However, I'm not a 
member of Bugzilla. Could anyone be so kind to add me to R's Bugzilla members 
or let me know to whom I should send the bug report?

Dear Marie,

I can do this ... but  are you really sure?  There is
 https://www.r-project.org/bugs.html
which you should spend some time reading if you haven't already.

I think you would post a MRE (Minimal Reproducible Example) here
{or on stackoverflow or ...} if you'd follow what the 'R bugs' web
page (above) recommends and only report a bug after some
feedback from "the public".

Of course, I could be wrong.. and happy if you explain / tell me why.

Best,
Martin Maechler

> Thank you in advance.

> Kind regards,
> Marie B�hnstedt


> Marie B�hnstedt, MSc
> Research Scientist
> Max Planck Institute for Demographic Research
> Konrad-Zuse-Str. 1, 18057 Rostock, Germany
> www.demogr.mpg.de




> --
> This mail has been sent through the MPI for Demographic ...{{dropped:9}}


> --
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Bioc-devel] any interest in a BiocMatrix core package?

2017-03-03 Thread Kasper Daniel Hansen
On Fri, Mar 3, 2017 at 10:22 AM, Vincent Carey 
wrote:

>
>
> On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <
> kasperdanielhan...@gmail.com> wrote:
>
>> Some comment on Aaron's stuff
>>
>> One possibility for doing things like this is if your code can be done in
>> C++ using a subset of rows or columns.  That can sometimes give the
>> necessary speed up.  What I mean is this
>>
>> Say you can safely process 1000 cells (not matrix cells, but biological
>> cells, aka columns) at a time in RAM
>>
>> iterate in R:
>>   get chunk i containing 1000 cells from the backend data storage
>>   do something on this sub matrix where everything is in a normal matrix
>> and you just use C++
>>   write results out to whatever backend you're using
>>
>> Then, with a million cells you iterate over 1000 chunks in R.  And you
>> don't need to "touch" the full dataset which can be stored on an arbitrary
>> backend.
>>
>
> you "touch" it, but you never ingest the whole thing at any time, is that
> what you mean?
>

Yes, you load the chunk into RAM and then just deal with it.

Think of doing 10^10 linear models.  If this was 10^6 I would just use
lmFit.  But 10^10 doesn't fit into memory.  So I load 10^7 into memory, run
lmFit, store results, redo.  This is bound to be much more efficient than
loading a single row into memory and doing lm 10^10 times, because lmFit is
written to do many linear models at the same time.

I am suggesting that this is a potential general strategy.


And this approach could be run even (potentially) with different chunks on
>> different nodes.
>>
>
> that seems to me to be an important if not essential desideratum.
>
> what then is the role of C++?  extracting a chunk?  preexisting utilities?
>

When I say C++ I just mean write an efficient implementation that works on
a chunk, like lmFit.  It is true that anything that works on a chunk will
work on a single row/column (like lmFit) but there are possibilities for
optimization when you work at the chunk level.

Obviously not all computations can be done chunkwise.  But for those that
>> can, this is a strategy which is independent of the data backend.
>>
>
> I wonder whether this "obviously not" needs to be rethought.  Algorithms
> that are implemented to work with data holistically may need
> to be reexpressed so that they can succeed with chunkwise access.  Is this
> a new mindset needed for holist developers, or can the
> effective data decompositions occur autonomously?
>

Well, I would say it is obvious that not all computations can be done
chunkwise.  But of course, in the limit of extremely large data, algorithms
which needs to cycle over everything no longer scale.  So in that case all
practical computations can be done chunkwise, out of necessity.  For single
cell right now where it is just millions of cells on the horizon people
will think that they can get "standard" holistic approaches to work (and
that is probably true).  If they had a billion cells they probably wouldn't
think about that.

Kasper

If you need direct access to the data in the backend in C++  it will be
>> extremely backend dependent what is fast and how to do it.  That doesn't
>> mean we shouldn't do it though.
>>
>> Best,
>> Kasper
>>
>>
>>
>> On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey > > wrote:
>>
>>> Kylie, thanks for reminding us of matter -- I saw you speak about this at
>>> the first Bioconductor Boston Meetup, but it
>>> went like lightning.   For developers contemplating an approach to
>>> representing high-volume rectangular data,
>>> where there is no dominant legacy format, it is natural to wonder whether
>>> HDF5 would be adequate, and,
>>> further, to wonder how to demonstrate that it is or is not dominated by
>>> some other approach for a given set
>>> of tasks.  Should we devise a set of bioinformatic benchmark problems to
>>> foster comparison and informed
>>> decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
>>> contemplate benchmarking with it?
>>>
>>> On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie 
>>> wrote:
>>>
>>> > It’s not there yet, but I plan to expose a C++ API for my disk-backed
>>> > matrix objects in the next version of my ‘matter’ package.
>>> >
>>> > It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
>>> > objects at the R level, especially if using a frontend like
>>> DelayedArray on
>>> > top of them, but it would be nice to have a common C++ API that I could
>>> > hook into as well (a la Rcpp), so new C/C++ could be re-used across
>>> various
>>> > backends more easily.
>>> >
>>> > Kylie
>>> >
>>> > ~~~
>>> > Kylie Ariel Bemis
>>> > Future Faculty Fellow
>>> > College of Computer and Information Science
>>> > Northeastern University
>>> > kuwisdelu.github.io
>>> >
>>> >
>>> >
>>> >
>>> > On Feb 24, 2017, at 4:50 PM, Aaron Lun >> > wehi.edu.au>> 

Re: [Bioc-devel] any interest in a BiocMatrix core package?

2017-03-03 Thread Vincent Carey
On Fri, Mar 3, 2017 at 10:07 AM, Kasper Daniel Hansen <
kasperdanielhan...@gmail.com> wrote:

> Some comment on Aaron's stuff
>
> One possibility for doing things like this is if your code can be done in
> C++ using a subset of rows or columns.  That can sometimes give the
> necessary speed up.  What I mean is this
>
> Say you can safely process 1000 cells (not matrix cells, but biological
> cells, aka columns) at a time in RAM
>
> iterate in R:
>   get chunk i containing 1000 cells from the backend data storage
>   do something on this sub matrix where everything is in a normal matrix
> and you just use C++
>   write results out to whatever backend you're using
>
> Then, with a million cells you iterate over 1000 chunks in R.  And you
> don't need to "touch" the full dataset which can be stored on an arbitrary
> backend.
>

you "touch" it, but you never ingest the whole thing at any time, is that
what you mean?


> And this approach could be run even (potentially) with different chunks on
> different nodes.
>

that seems to me to be an important if not essential desideratum.

what then is the role of C++?  extracting a chunk?  preexisting utilities?


>
> Obviously not all computations can be done chunkwise.  But for those that
> can, this is a strategy which is independent of the data backend.
>

I wonder whether this "obviously not" needs to be rethought.  Algorithms
that are implemented to work with data holistically may need
to be reexpressed so that they can succeed with chunkwise access.  Is this
a new mindset needed for holist developers, or can the
effective data decompositions occur autonomously?


>
> If you need direct access to the data in the backend in C++  it will be
> extremely backend dependent what is fast and how to do it.  That doesn't
> mean we shouldn't do it though.
>
> Best,
> Kasper
>
>
>
> On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey 
> wrote:
>
>> Kylie, thanks for reminding us of matter -- I saw you speak about this at
>> the first Bioconductor Boston Meetup, but it
>> went like lightning.   For developers contemplating an approach to
>> representing high-volume rectangular data,
>> where there is no dominant legacy format, it is natural to wonder whether
>> HDF5 would be adequate, and,
>> further, to wonder how to demonstrate that it is or is not dominated by
>> some other approach for a given set
>> of tasks.  Should we devise a set of bioinformatic benchmark problems to
>> foster comparison and informed
>> decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
>> contemplate benchmarking with it?
>>
>> On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie 
>> wrote:
>>
>> > It’s not there yet, but I plan to expose a C++ API for my disk-backed
>> > matrix objects in the next version of my ‘matter’ package.
>> >
>> > It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
>> > objects at the R level, especially if using a frontend like
>> DelayedArray on
>> > top of them, but it would be nice to have a common C++ API that I could
>> > hook into as well (a la Rcpp), so new C/C++ could be re-used across
>> various
>> > backends more easily.
>> >
>> > Kylie
>> >
>> > ~~~
>> > Kylie Ariel Bemis
>> > Future Faculty Fellow
>> > College of Computer and Information Science
>> > Northeastern University
>> > kuwisdelu.github.io
>> >
>> >
>> >
>> >
>> > On Feb 24, 2017, at 4:50 PM, Aaron Lun > > wehi.edu.au>> wrote:
>> >
>> > It's a good place to start, though it would be very handy to have a
>> C(++)
>> > API that can be linked against. I'm not sure how much work that would
>> > entail but it would give downstream developers a lot more options. Sort
>> of
>> > like how we can link to Rhtslib, which speeds up a lot of BAM file
>> > processing, instead of just relying on Rsamtools.
>> >
>> >
>> > -Aaron
>> >
>> > 
>> > From: Tim Triche, Jr. > >>
>> > Sent: Saturday, 25 February 2017 8:34:58 AM
>> > To: Aaron Lun
>> > Cc: bioc-devel@r-project.org
>> > Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>> >
>> > yes
>> >
>> > the DelayedArray framework that handles HDF5Array, etc. seems like the
>> > right choice?
>> >
>> > --t
>> >
>> > On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun  alun@
>> > wehi.edu.au>> wrote:
>> > Hi everyone,
>> >
>> > I just attended the Human Cell Atlas meeting in Stanford, and people
>> were
>> > talking about gene expression matrices for >1 million cells. If we
>> assume
>> > that we can get non-zero expression profiles for ~5000 genes, we�d be
>> > talking about a 5000 x 1 million matrix for the raw count data. This
>> would
>> > be 20-40 GB in size, which would clearly benefit from sparse (via
>> Matrix)
>> > or disk-backed representations (bigmatrix, 

Re: [Bioc-devel] any interest in a BiocMatrix core package?

2017-03-03 Thread Kasper Daniel Hansen
Some comment on Aaron's stuff

One possibility for doing things like this is if your code can be done in
C++ using a subset of rows or columns.  That can sometimes give the
necessary speed up.  What I mean is this

Say you can safely process 1000 cells (not matrix cells, but biological
cells, aka columns) at a time in RAM

iterate in R:
  get chunk i containing 1000 cells from the backend data storage
  do something on this sub matrix where everything is in a normal matrix
and you just use C++
  write results out to whatever backend you're using

Then, with a million cells you iterate over 1000 chunks in R.  And you
don't need to "touch" the full dataset which can be stored on an arbitrary
backend.  And this approach could be run even (potentially) with different
chunks on different nodes.

Obviously not all computations can be done chunkwise.  But for those that
can, this is a strategy which is independent of the data backend.

If you need direct access to the data in the backend in C++  it will be
extremely backend dependent what is fast and how to do it.  That doesn't
mean we shouldn't do it though.

Best,
Kasper



On Fri, Mar 3, 2017 at 6:47 AM, Vincent Carey 
wrote:

> Kylie, thanks for reminding us of matter -- I saw you speak about this at
> the first Bioconductor Boston Meetup, but it
> went like lightning.   For developers contemplating an approach to
> representing high-volume rectangular data,
> where there is no dominant legacy format, it is natural to wonder whether
> HDF5 would be adequate, and,
> further, to wonder how to demonstrate that it is or is not dominated by
> some other approach for a given set
> of tasks.  Should we devise a set of bioinformatic benchmark problems to
> foster comparison and informed
> decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
> contemplate benchmarking with it?
>
> On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie 
> wrote:
>
> > It’s not there yet, but I plan to expose a C++ API for my disk-backed
> > matrix objects in the next version of my ‘matter’ package.
> >
> > It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
> > objects at the R level, especially if using a frontend like DelayedArray
> on
> > top of them, but it would be nice to have a common C++ API that I could
> > hook into as well (a la Rcpp), so new C/C++ could be re-used across
> various
> > backends more easily.
> >
> > Kylie
> >
> > ~~~
> > Kylie Ariel Bemis
> > Future Faculty Fellow
> > College of Computer and Information Science
> > Northeastern University
> > kuwisdelu.github.io
> >
> >
> >
> >
> > On Feb 24, 2017, at 4:50 PM, Aaron Lun  > wehi.edu.au>> wrote:
> >
> > It's a good place to start, though it would be very handy to have a C(++)
> > API that can be linked against. I'm not sure how much work that would
> > entail but it would give downstream developers a lot more options. Sort
> of
> > like how we can link to Rhtslib, which speeds up a lot of BAM file
> > processing, instead of just relying on Rsamtools.
> >
> >
> > -Aaron
> >
> > 
> > From: Tim Triche, Jr.  >>
> > Sent: Saturday, 25 February 2017 8:34:58 AM
> > To: Aaron Lun
> > Cc: bioc-devel@r-project.org
> > Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
> >
> > yes
> >
> > the DelayedArray framework that handles HDF5Array, etc. seems like the
> > right choice?
> >
> > --t
> >
> > On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun  > wehi.edu.au>> wrote:
> > Hi everyone,
> >
> > I just attended the Human Cell Atlas meeting in Stanford, and people were
> > talking about gene expression matrices for >1 million cells. If we assume
> > that we can get non-zero expression profiles for ~5000 genes, we�d be
> > talking about a 5000 x 1 million matrix for the raw count data. This
> would
> > be 20-40 GB in size, which would clearly benefit from sparse (via Matrix)
> > or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5, etc.).
> >
> > I�m wondering whether there is any appetite amongst us for making a
> > consistent BioC API to handle these matrices, sort of like what
> > BiocParallel does for multicore and snow. It goes without saying that the
> > different matrix representations should have consistent functions at the
> R
> > level (rbind/cbind, etc.) but it would also be nice to have an integrated
> > C/C++ API (accessible via LinkedTo). There�s many non-trivial things that
> > can be done with this type of data, and it is often faster and more
> memory
> > efficient to do these complex operations in compiled code.
> >
> > I was thinking of something that you could supply any supported matrix
> > representation to a registered function via .Call; the C++ constructor
> > would recognise 

[Rd] Bug in nlm()

2017-03-03 Thread Boehnstedt, Marie
Dear all,

I have found a bug in nlm() and would like to submit a report on this.
Since nlm() is in the stats-package, which is maintained by the R Core team, 
bug reports should be submitted to R's Bugzilla. However, I'm not a member of 
Bugzilla. Could anyone be so kind to add me to R's Bugzilla members or let me 
know to whom I should send the bug report?
Thank you in advance.

Kind regards,
Marie B�hnstedt


Marie B�hnstedt, MSc
Research Scientist
Max Planck Institute for Demographic Research
Konrad-Zuse-Str. 1, 18057 Rostock, Germany
www.demogr.mpg.de




--
This mail has been sent through the MPI for Demographic ...{{dropped:9}}

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Bioc-devel] any interest in a BiocMatrix core package?

2017-03-03 Thread Vincent Carey
Kylie, thanks for reminding us of matter -- I saw you speak about this at
the first Bioconductor Boston Meetup, but it
went like lightning.   For developers contemplating an approach to
representing high-volume rectangular data,
where there is no dominant legacy format, it is natural to wonder whether
HDF5 would be adequate, and,
further, to wonder how to demonstrate that it is or is not dominated by
some other approach for a given set
of tasks.  Should we devise a set of bioinformatic benchmark problems to
foster comparison and informed
decisionmaking?  @becker.gabe: is ALTREP far enough along that one could
contemplate benchmarking with it?

On Fri, Feb 24, 2017 at 7:08 PM, Bemis, Kylie 
wrote:

> It’s not there yet, but I plan to expose a C++ API for my disk-backed
> matrix objects in the next version of my ‘matter’ package.
>
> It’s getting easier to interchange matter/HDF5Array/bigmemory/etc.
> objects at the R level, especially if using a frontend like DelayedArray on
> top of them, but it would be nice to have a common C++ API that I could
> hook into as well (a la Rcpp), so new C/C++ could be re-used across various
> backends more easily.
>
> Kylie
>
> ~~~
> Kylie Ariel Bemis
> Future Faculty Fellow
> College of Computer and Information Science
> Northeastern University
> kuwisdelu.github.io
>
>
>
>
> On Feb 24, 2017, at 4:50 PM, Aaron Lun  wehi.edu.au>> wrote:
>
> It's a good place to start, though it would be very handy to have a C(++)
> API that can be linked against. I'm not sure how much work that would
> entail but it would give downstream developers a lot more options. Sort of
> like how we can link to Rhtslib, which speeds up a lot of BAM file
> processing, instead of just relying on Rsamtools.
>
>
> -Aaron
>
> 
> From: Tim Triche, Jr. >
> Sent: Saturday, 25 February 2017 8:34:58 AM
> To: Aaron Lun
> Cc: bioc-devel@r-project.org
> Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?
>
> yes
>
> the DelayedArray framework that handles HDF5Array, etc. seems like the
> right choice?
>
> --t
>
> On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun  wehi.edu.au>> wrote:
> Hi everyone,
>
> I just attended the Human Cell Atlas meeting in Stanford, and people were
> talking about gene expression matrices for >1 million cells. If we assume
> that we can get non-zero expression profiles for ~5000 genes, we�d be
> talking about a 5000 x 1 million matrix for the raw count data. This would
> be 20-40 GB in size, which would clearly benefit from sparse (via Matrix)
> or disk-backed representations (bigmatrix, BufferedMatrix, rhdf5, etc.).
>
> I�m wondering whether there is any appetite amongst us for making a
> consistent BioC API to handle these matrices, sort of like what
> BiocParallel does for multicore and snow. It goes without saying that the
> different matrix representations should have consistent functions at the R
> level (rbind/cbind, etc.) but it would also be nice to have an integrated
> C/C++ API (accessible via LinkedTo). There�s many non-trivial things that
> can be done with this type of data, and it is often faster and more memory
> efficient to do these complex operations in compiled code.
>
> I was thinking of something that you could supply any supported matrix
> representation to a registered function via .Call; the C++ constructor
> would recognise the type of matrix during class instantiation; and
> operations (row/column/random read access, also possibly various ways of
> writing a matrix) would be overloaded and behave as required for the class.
> Only the implementation of the API would need to care about the nitty
> gritty of each representation, and we would all be free to write code that
> actually does the interesting analytical stuff.
>
> Anyway, just throwing some thoughts out there. Any comments appreciated.
>
> Cheers,
>
> Aaron
>
>[[alternative HTML version deleted]]
>
>
> ___
> Bioc-devel@r-project.org Bioc-devel@r-project.org> mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
> [[alternative HTML version deleted]]
>
> ___
> Bioc-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

[[alternative HTML version deleted]]

___
Bioc-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/bioc-devel

[Rd] Control statements with condition with greater than one should give error (not just warning) [PATCH]

2017-03-03 Thread Henrik Bengtsson
I'd like to propose that the whenever the length of condition passed
to an if or a while statement differs from one, an error is produced
rather than just a warning as today:

> x <- 1:2
> if (x == 1) message("x == 1")
x == 1
Warning message:
In if (x == 1) message("x == 1") :
  the condition has length > 1 and only the first element will be used

There are probably legacy reasons for why this is accepted by R in the
first place, but I cannot imagine than anyone wants to use an if/while
statement this way on purpose.  The warning about this misuse, was
introduced in November 2002 (R-devel thread 'vector arguments to
if()'; https://stat.ethz.ch/pipermail/r-devel/2002-November/025537.html).

Below is patch (also attached) that introduces option
'check.condition' such that when TRUE, it will generate an error
rather than a warning (default).  This option allows for a smooth
migration as it can be added to 'R CMD check --as-cran' and developers
can give time to check and fix their packages.  Eventually,
check.condition=TRUE can become the new default.

With options(check.condition = TRUE), one gets:

> x <- 1:2
> if (x == 1) message("x == 1")
Error in if (x == 1) message("x == 1") : the condition has length > 1

and

> while (x < 2) message("x < 2")
Error in while (x < 2) message("x < 2") : the condition has length > 1


Index: src/library/base/man/options.Rd
===
--- src/library/base/man/options.Rd (revision 72298)
+++ src/library/base/man/options.Rd (working copy)
@@ -86,6 +86,11 @@
   vector (atomic or \code{\link{list}}) is extended, by something
   like \code{x <- 1:3; x[5] <- 6}.}

+\item{\code{check.condition}:}{logical, defaulting to \code{FALSE}.  If
+  \code{TRUE}, an error is produced whenever the condition to an
+  \code{if} or a \code{while} control statement is of length greater
+  than one.  If \code{FALSE}, a \link{warning} is produced.}
+
 \item{\code{CBoundsCheck}:}{logical, controlling whether
   \code{\link{.C}} and \code{\link{.Fortran}} make copies to check for
   array over-runs on the atomic vector arguments.
@@ -445,6 +450,7 @@
   \tabular{ll}{
 \code{add.smooth} \tab \code{TRUE}\cr
 \code{check.bounds} \tab \code{FALSE}\cr
+\code{check.condition} \tab \code{FALSE}\cr
 \code{continue} \tab \code{"+ "}\cr
 \code{digits} \tab \code{7}\cr
 \code{echo} \tab \code{TRUE}\cr
Index: src/library/utils/R/completion.R
===
--- src/library/utils/R/completion.R (revision 72298)
+++ src/library/utils/R/completion.R (working copy)
@@ -1304,8 +1304,8 @@
   "plt", "ps", "pty", "smo", "srt", "tck", "tcl", "usr",
   "xaxp", "xaxs", "xaxt", "xpd", "yaxp", "yaxs", "yaxt")

-options <- c("add.smooth", "browser", "check.bounds", "continue",
- "contrasts", "defaultPackages", "demo.ask", "device",
+options <- c("add.smooth", "browser", "check.bounds", "check.condition",
+"continue", "contrasts", "defaultPackages", "demo.ask", "device",
  "digits", "dvipscmd", "echo", "editor", "encoding",
  "example.ask", "expressions", "help.search.types",
  "help.try.all.packages", "htmlhelp", "HTTPUserAgent",
Index: src/main/eval.c
===
--- src/main/eval.c (revision 72298)
+++ src/main/eval.c (working copy)
@@ -1851,9 +1851,13 @@
 Rboolean cond = NA_LOGICAL;

 if (length(s) > 1) {
+ int check = asInteger(GetOption1(install("check.condition")));
  PROTECT(s); /* needed as per PR#15990.  call gets protected by
warningcall() */
- warningcall(call,
-_("the condition has length > 1 and only the first element will be used"));
+ if(check != NA_INTEGER && check > 0)
+errorcall(call, _("the condition has length > 1"));
+ else
+warningcall(call,
+ _("the condition has length > 1 and only the first element will be used"));
  UNPROTECT(1);
 }
 if (length(s) > 0) {
Index: src/main/options.c
===
--- src/main/options.c (revision 72298)
+++ src/main/options.c (working copy)
@@ -65,6 +65,7 @@
  * "timeout" ./connections.c

  * "check.bounds"
+ * "check.condition"
  * "error"
  * "error.messages"
  * "show.error.messages"
@@ -248,9 +249,9 @@
 char *p;

 #ifdef HAVE_RL_COMPLETION_MATCHES
+PROTECT(v = val = allocList(22));
+#else
 PROTECT(v = val = allocList(21));
-#else
-PROTECT(v = val = allocList(20));
 #endif

 SET_TAG(v, install("prompt"));
@@ -289,6 +290,10 @@
 SETCAR(v, ScalarLogical(0)); /* no checking */
 v = CDR(v);

+SET_TAG(v, install("check.condition"));
+SETCAR(v, ScalarLogical(0)); /* no checking */
+v = CDR(v);
+
 p = getenv("R_KEEP_PKG_SOURCE");
 R_KeepSource = (p && (strcmp(p, "yes") == 0)) ? 1 : 0;


I'm happy to file this via https://bugs.r-project.org, if preferred.

/Henrik
Index: 

Re: [Bioc-devel] any interest in a BiocMatrix core package?

2017-03-03 Thread Wolfgang Huber

Dear Aaron

Thank you. I think it's an important simplification of a potential API 
when you are saying that what you mostly need are accessors

  m[i, ] and m[, i]
with i scalar or a short contiguous range, such that the value of that 
could be a relatively small ordinary matrix. (Compared to operations 
like matrix multiplication, SVD or other decompositions.)


Wolfgang

PS Loops per se in today's R are not as slow as some think: depending on 
the algorithm, the time "wasted" by the R interpreter on looking up 
symbols etc may (or may not) be negligible compared to the actual 
computations that are done at the C level anyway:


g = function(n) {
s = 0
for (i in seq_len(n))
s = s + i
s
}

cg = compiler::cmpfun(g)

print(system.time( g(1e6)))
   user  system elapsed
  0.161   0.000   0.161

print(system.time(cg(1e6)))
  user  system elapsed
  0.043   0.000   0.043



2.3.17 20:05, Aaron Lun scripsit:

I'll give two examples from the scran package. In both cases, the count
matrix is such that rows are genes and columns are cells. The first
example involves cell cycle phase assignment (from the cyclone()
function, FYI). Briefly, upon entry to C++, the function:

1. Loops through the cells, one at a time.
2. For each cell, it applies a classifier to the counts for that cell
(i.e., a column of the count matrix). This is not a straightforward
operation and also involves a number of random permutations.
3. Returns a set of scores representing the phase assignment.

For a few cells, I could conceivably move the loop into R and just
supply the column counts for each cell via .Call, which would avoid the
need to interact with the matrix in C++. However, if I were to process
one million cells, the slowness of R's loops would really hurt.

The second example involves normalization using a pooling and
deconvolution algorithm (from the computeSumFactors() function). Upon
entry into C++, the function:

1. Loops through an ordered set of cells.
2. At each cell, the neighbouring set of 20-100 cells defines a sliding
window. Counts for all cells in the window are summed together to create
a pooled expression profile.
3. The pooled profile is used to obtain a size factor, by computing the
median of the ratios between the pool and a pseudo-cell.
4. This is repeated for all cells in the set (i.e., all positions of the
window). Each window corresponds to a pool; the function stores the
identity of the cells in the pool and the size factor for the pool.

The output is used to construct a linear system at the R level, which is
solved to obtain cell-specific size factors. Again, the work done within
the loop is not obviously vectorizable with standard functions.

All of the cases I work with involve processing one row or column at a
time; I generally don't do matrix operations that require random access,
at least not at the C++ level.

Another motivation for moving into C++ is the greater control over
memory management. For a decent number of cells, this can make the
difference between something being runnable or not.

Cheers,

Aaron

On 02/03/17 18:09, Wolfgang Huber wrote:

Aaron

Can you describe use cases, i.e. intended computations on these
matrices, esp. those for which C++ access is needed for?

I'm asking b/c the goals of efficient code and abstraction from how the
data are stored may be conflicting - in which case critical algorithms
may end up circumventing a prematurely defined API.

Wolfgang


25.2.17 00:37, Aaron Lun scripsit:

Yes, I think double-precision would be necessary for general use. Only
the raw count data would be integer, and even then that's not
guaranteed (e.g., if people are using kallisto or salmon for
quantification).


-Aaron



From: Vincent Carey 
Sent: Saturday, 25 February 2017 9:25 AM
To: Aaron Lun
Cc: Tim Triche, Jr.; bioc-devel@r-project.org
Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?

What is the data type for an expression value?  Is it assumed that
double precision will be needed?

On Fri, Feb 24, 2017 at 4:50 PM, Aaron Lun
> wrote:
It's a good place to start, though it would be very handy to have a
C(++) API that can be linked against. I'm not sure how much work that
would entail but it would give downstream developers a lot more
options. Sort of like how we can link to Rhtslib, which speeds up a
lot of BAM file processing, instead of just relying on Rsamtools.


-Aaron


From: Tim Triche, Jr. >
Sent: Saturday, 25 February 2017 8:34:58 AM
To: Aaron Lun
Cc: bioc-devel@r-project.org
Subject: Re: [Bioc-devel] any interest in a BiocMatrix core package?

yes

the DelayedArray framework that handles HDF5Array, etc. seems like the
right choice?

--t

On Fri, Feb 24, 2017 at 1:26 PM, Aaron Lun