Re: [Rd] sum() returns NA on a long *logical* vector when nb of TRUE values exceeds 2^31

2018-02-05 Thread Martin Maechler
> Martin Maechler 
> on Thu, 1 Feb 2018 16:34:04 +0100 writes:

> > Hervé Pagès 
> > on Tue, 30 Jan 2018 13:30:18 -0800 writes:
> 
> > Hi Martin, Henrik,
> > Thanks for the follow up.
> 
> > @Martin: I vote for 2) without *any* hesitation :-)
> 
> > (and uniformity could be restored at some point in the
> > future by having prod(), rowSums(), colSums(), and others
> > align with the behavior of length() and sum())
> 
> As a matter of fact, I had procrastinated and worked at
> implementing '2)' already a bit on the weekend and made it work
> - more or less.  It needs a bit more work, and I had also been considering
> replacing the numbers in the current overflow check
> 
>   if (ii++ > 1000) {   \
>   ii = 0; \
>   if (s > 9000L || s < -9000L) {  \
>   if(!updated) updated = TRUE;\
>   *value = NA_INTEGER;\
>   warningcall(call, _("integer overflow - use 
> sum(as.numeric(.))")); \
>   return updated; \
>   }   \
>   }   \
> 
> i.e. think of tweaking the '1000' and '9000L', 
> but decided to leave these and add comments there about why. For
> the moment.
> They may look arbitrary, but are not at all: If you multiply
> them (which looks correct, if we check the sum 's' only every 1000-th
> time ...((still not sure they *are* correct))) you get  9*10^18
> which is only slightly smaller than  2^63 - 1 which may be the
> maximal "LONG_INT" integer we have.
> 
> So, in the end, at least for now, we do not quite go all they way
> but overflow a bit earlier,... but do potentially gain a bit of
> speed, notably with the ITERATE_BY_REGION(..) macros
> (which I did not show above).
> 
> Will hopefully become available in R-devel real soon now.
>
> Martin

After finishing that... I challenged myself that one should be able to do
better, namely "no overflow" (because of large/many
integer/logical), and so introduced  irsum()  which uses a double 
precision accumulator for integer/logical  ... but would really
only be used when the 64-bit int accumulator would get close to
overflow.
The resulting code is not really beautiful, and also contains a
a comment " (a waste, rare; FIXME ?) "
If anybody feels like finding a more elegant version without the
"waste" case, go ahead and be our guest ! 

Testing the code does need access to a platform with enough GB
RAM, say 32 (and I have run the checks only on servers with >
100 GB RAM). This concerns the new checks at the (current) end
of /tests/reg-large.R

In R-devel svn rev >= 74208  for a few minutes now.

Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] CRAN indices out of whack (for at least macOS)

2018-02-05 Thread Thierry Onkelinx
Another benefit of Winston's proposal is that it make it easy to
install specific package versions from source. For the time being I'm
using a construct like
https://github.com/inbo/Rstable/blob/master/cran_install.sh to
generate a Docker image.

Best regards,

ir. Thierry Onkelinx
Statisticus / Statistician

Vlaamse Overheid / Government of Flanders
INSTITUUT VOOR NATUUR- EN BOSONDERZOEK / RESEARCH INSTITUTE FOR NATURE
AND FOREST
Team Biometrie & Kwaliteitszorg / Team Biometrics & Quality Assurance
thierry.onkel...@inbo.be
Havenlaan 88 bus 73, 1000 Brussel
www.inbo.be

///
To call in the statistician after the experiment is done may be no
more than asking him to perform a post-mortem examination: he may be
able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher
The plural of anecdote is not data. ~ Roger Brinner
The combination of some data and an aching desire for an answer does
not ensure that a reasonable answer can be extracted from a given body
of data. ~ John Tukey
///




2018-02-03 20:31 GMT+01:00 Winston Chang :
> Although it may not have been the cause of this particular index
> inconsistency, there are other causes of intermittent index
> inconsistencies. They could be avoided if there were a different
> directory structure on CRAN servers.
>
> One of the causes of inconsistencies is caching. With
> cloud.r-project.org (note that this is not cran.r-project.org), the
> there is a CDN in front of the server; the CDN has caching endpoints
> around the world, and will serve files to the user from the nearest
> endpoint.
>
> The cache timeout for each file is 30 minutes. Suppose a user
> downloads file X from some endpoint at 1:00. If the endpoint doesn't
> already have X in the cache, then it will fetch the file from the
> server, and then send it to the user. The endpoint will consider the
> cached file valid until 1:30. If another user requests X at 1:20, the
> endpoint will serve up the file from its cache without checking with
> the server. If someone requests X at 1:40, the endpoint will check
> with the server to see if its cached version is still valid (and
> download an updated version if necessary), then it wills end the file
> to the user.
>
> Because the caching is on a per-file basis, this can lead to a
> situation where the PACKAGES file served by an endpoint is out of sync
> with the .tgz package files. Imagine this scenario:
>
> 1:00 Someone downloads PACKAGES. It is not yet in the endpoint's
> cache, so it fetches it from the server. This version of PACKAGES says
> that the current version of PkgA is 1.0.
> 1:10 The server performs an rsync from the central CRAN mirror. It
> gets an updated version of PACKAGES, which says that the current
> version of PkgA is 2.0. The rsync also removes the PkgA_1.0.tgz file
> and adds PkgA_2.0.tgz.
> 1:20 Someone else wants to install PkgA, so their R session first
> downloads PACKAGES, which points to PkgA_1.0.tgz. Then R tries to
> download PkgA_1.0.tgz; it is not in the endpoint's cache, so the
> endpoint tries to fetch it from the server, but the file is not
> present there so it sends a 404 missing message. The endpoint passes
> this to the R session, and the package installation fails.
>
> Anyone else who tries to install PkgA (and hits the same CDN endpoint)
> will get the same installation failure, until the cache for PACKAGES
> expires at 1:30. However, another person who happens to hit another
> endpoint may be able to install PkgA, because each endpoint does its
> caching independently.
>
> Something similar even without a CDN, because download.packages()
> caches the contents of PACKAGES. However, that can be worked around by
> telling download.packages() to not use the cache, or by simply
> restarting R.
>
> One reason that package installations fail in these cases is that the
> current version of a package is in one directory, and the old
> (archived) versions of a package are in another directory. If current
> and old versions were in the same directory, then package installation
> would not fail.
>
>
> -Winston
>
>
>
> On Tue, Jan 30, 2018 at 1:19 PM, Dirk Eddelbuettel  wrote:
>>
>> I have received three distinct (non-)bug reports where someone claimed a
>> recent package of mine was broken ... simply because the macOS binary was not
>> there.
>>
>> Is there something wrong with the cronjob providing the indices? Why is it
>> pointing people to binaries that do not exist?
>>
>> Concretely, file
>>
>>   https://cloud.r-project.org/bin/macosx/el-capitan/contrib/3.4/PACKAGES
>>
>> contains
>>
>>   Package: digest
>>   Version: 0.6.15
>>   Title: Create Compact Hash Digests of R Objects
>>   Depends: R (>= 2.4.1)
>>   Suggests: knitr, rmarkdown
>>   Built: R 3.4.3; x86_64-apple-darwin15.6.0; 2018-01-29 05:21:06 UTC; unix
>>   Archs: digest.so.dSYM
>>
>