Re: [R-pkg-devel] [EXTERNAL] Re: CMake on CRAN Systems

2024-01-17 Thread Sameh Abdulah
Thank you, Ivan and everyone else, for your help. We are working on modifying 
the package in line with your suggestions.

Regarding the Rcpp template we previously advertised, we are updating it to 
align better with the rules and constraints of R-exts. If you believe the 
template is still a potential example, we seek the community’s contribution to 
making it more tailored for the CRAN and R-exts rules, if not just ignore it.

Again, thanks for helping.


Best,
--Sameh

From: R-package-devel  on behalf of Reed 
A. Cartwright 
Date: Thursday, January 18, 2024 at 1:59 AM
To: R Package Development 
Subject: [EXTERNAL] Re: [R-pkg-devel] CMake on CRAN Systems
I think this is the same group that advertised an R package template a
while back that also clearly didn't follow R-exts rules or use any of
the best practices mentioned on this mailing list.

https://urldefense.com/v3/__https://github.com/stsds/Template-Rcpp__;!!Nmw4Hv0!2sMfEIGCOkf4K4xnMD01roSPe6BozSTq0MZKf1vY_MT-f6l4c-jX2I6SVogak9830IrKvZrbNEQkBIjkZfFeLm3SrrUW$


On Wed, Jan 17, 2024 at 3:24 PM Simon Urbanek
 wrote:
>
> I had a quick look and that package (assuming it's 
> https://urldefense.com/v3/__https://github.com/stsds/MPCR__;!!IKRxdwAv5BmarQ!Yc-rZLeUomy6UfK2hWnlm7jSdZkb90h9QCvh1B8HkY97GxPB-zes4t2gnSD2fTupOOXR-HtLQODuWJl_l5l6nsZNGlfE$
>  ) does not adhere to any rules from R-exts (hence the removal from CRAN I 
> presume) so the failure to detect cmake is the least problem. I would 
> strongly recommend reading the  R documentation as cmake is just the wrong 
> tool for the job in this case. R already has a fully working build system 
> which will compile the package using the correct flags and tools - you only 
> need to provide the C++ sources. You cannot generate the package shared 
> object with cmake by definition - you must let R build it. [In rare case 
> dependent static libraries are sometimes built with cmake inside the package 
> if there is no other option and cmake is used upstream, but those are rare 
> and you still have to use R to build the final shared object].
>
> Cheers,
> Simon
>
>
> > On Jan 17, 2024, at 8:54 PM, Ivan Krylov via R-package-devel 
> >  wrote:
> >
> > Dear Sameh,
> >
> > Regarding your question about the MPCR package and the use of CMake
> >  >  >:
> > on a Mac, you have to look for the cmake executable in more than one
> > place because it is not guaranteed to be on the $PATH. As described in
> > Writing R Extensions
> >  >  >, the
> > following is one way to work around the problem:
> >
> > if test -z "$CMAKE"; then CMAKE="`which cmake`"; fi
> > if test -z "$CMAKE"; then
> > CMAKE=/Applications/CMake.app/Contents/bin/cmake;
> > fi
> > if test -f "$CMAKE"; then echo "no ‘cmake’ command found"; exit 1; fi
> >
> > Please don't reply to existing threads when starting a new topic on
> > mailing lists. Your message had a mangled link that went to
> > urldefense.com instead of cran-archive.r-project.org, letting Amazon
> > (who host the website) know about every visit to the link:
> > https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-package-devel/2024q1/010328.html__;!!IKRxdwAv5BmarQ!Yc-rZLeUomy6UfK2hWnlm7jSdZkb90h9QCvh1B8HkY97GxPB-zes4t2gnSD2fTupOOXR-HtLQODuWJl_l5l6nnj6jJF1$
> >
> > --
> > Best regards,
> > Ivan
> >
> > __
> > R-package-devel@r-project.org mailing list
> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-package-devel__;!!IKRxdwAv5BmarQ!Yc-rZLeUomy6UfK2hWnlm7jSdZkb90h9QCvh1B8HkY97GxPB-zes4t2gnSD2fTupOOXR-HtLQODuWJl_l5l6nsmyJVkA$
> >
>
> __
> R-package-devel@r-project.org mailing list
> 

Re: [Rd] Determining the size of a package

2024-01-17 Thread Simon Urbanek
William,

the check does not apply to binary installations (such as the Mac builds), 
because those depend heavily on the static libraries included in the package 
binary which can be quite big and generally cannot be reduced in size - for 
example:
https://www.r-project.org/nosvn/R.check/r-release-macos-arm64/terra-00check.html

Cheers,
Simon


> On Jan 18, 2024, at 12:26 PM, William Revelle  wrote:
> 
> Dear fellow developers,
> 
> Is there an easy way to determine how big my packages  (psych and psychTools) 
>  will be on various versions of CRAN?
> 
> I have been running into the dread 'you are bigger than 5 MB" message for 
> some installations of R on CRAN but not others.  The particular problem seems 
> to be some of the mac versions (specifically r-oldrel-macos-arm64 and 
> r-release-macos-X64 )
> 
> When I build it on my Mac M1 it is well within the limits, but when pushing 
> to CRAN,  I run into the size message.
> 
> Is there a way I can find what the size will be on these various 
> implementations without bothering the nice people at CRAN.
> 
> Thanks.
> 
> William Revellepersonality-project.org/revelle.html
> Professorpersonality-project.org
> Department of Psychology www.wcas.northwestern.edu/psych/
> Northwestern Universitywww.northwestern.edu/
> Use R for psychology personality-project.org/r
> It is 90 seconds to midnightwww.thebulletin.org
> 
> __
> R-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


[Rd] Determining the size of a package

2024-01-17 Thread William Revelle
Dear fellow developers,

Is there an easy way to determine how big my packages  (psych and psychTools)  
will be on various versions of CRAN?

I have been running into the dread 'you are bigger than 5 MB" message for some 
installations of R on CRAN but not others.  The particular problem seems to be 
some of the mac versions (specifically r-oldrel-macos-arm64 and 
r-release-macos-X64 )

When I build it on my Mac M1 it is well within the limits, but when pushing to 
CRAN,  I run into the size message.

Is there a way I can find what the size will be on these various 
implementations without bothering the nice people at CRAN.

Thanks.

William Revellepersonality-project.org/revelle.html
Professorpersonality-project.org
Department of Psychology www.wcas.northwestern.edu/psych/
Northwestern Universitywww.northwestern.edu/
Use R for psychology personality-project.org/r
It is 90 seconds to midnightwww.thebulletin.org

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [R-pkg-devel] CMake on CRAN Systems

2024-01-17 Thread Reed A. Cartwright
I think this is the same group that advertised an R package template a
while back that also clearly didn't follow R-exts rules or use any of
the best practices mentioned on this mailing list.

https://github.com/stsds/Template-Rcpp


On Wed, Jan 17, 2024 at 3:24 PM Simon Urbanek
 wrote:
>
> I had a quick look and that package (assuming it's 
> https://urldefense.com/v3/__https://github.com/stsds/MPCR__;!!IKRxdwAv5BmarQ!Yc-rZLeUomy6UfK2hWnlm7jSdZkb90h9QCvh1B8HkY97GxPB-zes4t2gnSD2fTupOOXR-HtLQODuWJl_l5l6nsZNGlfE$
>  ) does not adhere to any rules from R-exts (hence the removal from CRAN I 
> presume) so the failure to detect cmake is the least problem. I would 
> strongly recommend reading the  R documentation as cmake is just the wrong 
> tool for the job in this case. R already has a fully working build system 
> which will compile the package using the correct flags and tools - you only 
> need to provide the C++ sources. You cannot generate the package shared 
> object with cmake by definition - you must let R build it. [In rare case 
> dependent static libraries are sometimes built with cmake inside the package 
> if there is no other option and cmake is used upstream, but those are rare 
> and you still have to use R to build the final shared object].
>
> Cheers,
> Simon
>
>
> > On Jan 17, 2024, at 8:54 PM, Ivan Krylov via R-package-devel 
> >  wrote:
> >
> > Dear Sameh,
> >
> > Regarding your question about the MPCR package and the use of CMake
> >  >  >:
> > on a Mac, you have to look for the cmake executable in more than one
> > place because it is not guaranteed to be on the $PATH. As described in
> > Writing R Extensions
> >  >  >, the
> > following is one way to work around the problem:
> >
> > if test -z "$CMAKE"; then CMAKE="`which cmake`"; fi
> > if test -z "$CMAKE"; then
> > CMAKE=/Applications/CMake.app/Contents/bin/cmake;
> > fi
> > if test -f "$CMAKE"; then echo "no ‘cmake’ command found"; exit 1; fi
> >
> > Please don't reply to existing threads when starting a new topic on
> > mailing lists. Your message had a mangled link that went to
> > urldefense.com instead of cran-archive.r-project.org, letting Amazon
> > (who host the website) know about every visit to the link:
> > https://urldefense.com/v3/__https://stat.ethz.ch/pipermail/r-package-devel/2024q1/010328.html__;!!IKRxdwAv5BmarQ!Yc-rZLeUomy6UfK2hWnlm7jSdZkb90h9QCvh1B8HkY97GxPB-zes4t2gnSD2fTupOOXR-HtLQODuWJl_l5l6nnj6jJF1$
> >
> > --
> > Best regards,
> > Ivan
> >
> > __
> > R-package-devel@r-project.org mailing list
> > https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-package-devel__;!!IKRxdwAv5BmarQ!Yc-rZLeUomy6UfK2hWnlm7jSdZkb90h9QCvh1B8HkY97GxPB-zes4t2gnSD2fTupOOXR-HtLQODuWJl_l5l6nsmyJVkA$
> >
>
> __
> R-package-devel@r-project.org mailing list
> https://urldefense.com/v3/__https://stat.ethz.ch/mailman/listinfo/r-package-devel__;!!IKRxdwAv5BmarQ!Yc-rZLeUomy6UfK2hWnlm7jSdZkb90h9QCvh1B8HkY97GxPB-zes4t2gnSD2fTupOOXR-HtLQODuWJl_l5l6nsmyJVkA$

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] CMake on CRAN Systems

2024-01-17 Thread Simon Urbanek
I had a quick look and that package (assuming it's 
https://github.com/stsds/MPCR) does not adhere to any rules from R-exts (hence 
the removal from CRAN I presume) so the failure to detect cmake is the least 
problem. I would strongly recommend reading the  R documentation as cmake is 
just the wrong tool for the job in this case. R already has a fully working 
build system which will compile the package using the correct flags and tools - 
you only need to provide the C++ sources. You cannot generate the package 
shared object with cmake by definition - you must let R build it. [In rare case 
dependent static libraries are sometimes built with cmake inside the package if 
there is no other option and cmake is used upstream, but those are rare and you 
still have to use R to build the final shared object].

Cheers,
Simon


> On Jan 17, 2024, at 8:54 PM, Ivan Krylov via R-package-devel 
>  wrote:
> 
> Dear Sameh,
> 
> Regarding your question about the MPCR package and the use of CMake
> :
> on a Mac, you have to look for the cmake executable in more than one
> place because it is not guaranteed to be on the $PATH. As described in
> Writing R Extensions
> , the
> following is one way to work around the problem:
> 
> if test -z "$CMAKE"; then CMAKE="`which cmake`"; fi
> if test -z "$CMAKE"; then
> CMAKE=/Applications/CMake.app/Contents/bin/cmake;
> fi
> if test -f "$CMAKE"; then echo "no ‘cmake’ command found"; exit 1; fi
> 
> Please don't reply to existing threads when starting a new topic on
> mailing lists. Your message had a mangled link that went to
> urldefense.com instead of cran-archive.r-project.org, letting Amazon
> (who host the website) know about every visit to the link:
> https://stat.ethz.ch/pipermail/r-package-devel/2024q1/010328.html
> 
> -- 
> Best regards,
> Ivan
> 
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> 

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-17 Thread Dipterix Wang


> 
> We have one in vctrs but it's not exported:
> https://github.com/r-lib/vctrs/blob/main/src/hash.c
> 
> The main use is vectorised hashing:
> 

Thanks for showing me this function. I have read the source code. That's a 
great idea. 

However, I think I might have missed something. When I tried vctrs::obj_hash, I 
couldn't get identical outputs.


``` r
options(keep.source = TRUE)
a <- function(){}
vctrs:::obj_hash(a)
#> [1] 68 e8 5a 0c
a <- function(){}
vctrs:::obj_hash(a)
#> [1] b2 6a 55 9c
a <-   function(){}
vctrs:::obj_hash(a)
#> [1] 01 a9 bc 30
options(keep.source = FALSE)
a <- function(){}
vctrs:::obj_hash(a)
#> [1] 93 d7 f2 72
a <- function(){}
vctrs:::obj_hash(a)
#> [1] f3 1d d2 f4
```

Created on 2024-01-17 with [reprex v2.1.0](https://reprex.tidyverse.org)

> 
> Best,
> Lionel
> 
> On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
>  wrote:
>> 
>> I think one could implement hashing on the fly without any
>> serialization, similarly to how identical works, but I am not aware of
>> any existing implementation. Again, if that wasn't clear: I don't think
>> trying to compute a hash of an object from its serialized representation
>> is a good idea - it is of course convenient, but has problems like the
>> one you have ran into.
>> 
>> In some applications it may still be good enough: if by various tweaks,
>> such as ensuring source references are off in your case, you achieve a
>> state when false alarms are rare (identical objects have different
>> hashes), and hence say unnecessary re-computation is rare, maybe it is
>> good enough.

I really appreciate you answer my questions and solve my puzzles. I went back 
and read the R internal code for `serialize` and totally agree on this, that 
serialization is not a good idea for digesting R objects, especially on 
environments, expressions, and functions. 

What I want is a function that can produce the same and stable hash for 
identical objects. However, there is no function (given our best knowledge) on 
the market that can do this. `digest::digest` and `rlang::hash` are the first 
functions that come into my mind. Both are widely used, but they use serialize. 
The author of `digest` said:
> "As you know,  digest takes and (ahem) "digests" what serialize gives 
it, so you would have to look into what serialize lets you do."

vctrs:::obj_hash is probably the closest to the implementation of `identical`, 
but the above examples give different results for identical objects.

The existence of digest:: digest and rlang::hash shows that there is a huge 
demand for this "ideal" hash function. However, I bet most people are using 
digest/hash "incorrectly".

>> 
>> Tomas
>> 


[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [R-pkg-devel] CMake on CRAN Systems

2024-01-17 Thread Matthias Gondan
Here's an example. It first checks if CMAKE_C_BYTE_ORDER is defined, which is 
available in recent versions of cmake. If it isn't, cmake's own macro 
TestBigEndian is invoked (deprecated, but still available). It would normally 
compile an executable, but we change the compile target to a static library 
(the test for endianness works anyway).

if(DEFINED CMAKE_C_BYTE_ORDER)
  if(CMAKE_C_BYTE_ORDER STREQUAL "BIG_ENDIAN")
set(WORDS_BIGENDIAN 1)
  else()
set(WORDS_BIGENDIAN 0)
  endif()
else()
  # From cmake docs: If CMAKE_OSX_ARCHITECTURES specifies multiple 
architectures, the value
  # of CMAKE__BYTE_ORDER is non-empty only if all architectures share the 
same byte
  # order.
  include(TestBigEndian)
  SET(CMAKE_TRY_COMPILE_TARGET_TYPE_SAVE ${CMAKE_TRY_COMPILE_TARGET_TYPE})
  SET(CMAKE_TRY_COMPILE_TARGET_TYPE STATIC_LIBRARY)
  TEST_BIG_ENDIAN(WORDS_BIGENDIAN)
  SET(CMAKE_TRY_COMPILE_TARGET_TYPE ${CMAKE_TRY_COMPILE_TARGET_TYPE_SAVE})
endif()



> Gesendet: Mittwoch, 17. Januar 2024 um 16:52 Uhr
> Von: "Uwe Ligges" 
> An: "Matthias Gondan" , "Sameh Abdulah" 
> 
> Cc: "R Package Development" 
> Betreff: Re: [R-pkg-devel] CMake on CRAN Systems
>
>
>
> On 17.01.2024 08:59, Matthias Gondan wrote:
> > For package rswipl, cmake still seems to work, but
> >
> > * one has to search for it on MacOS, see the src/Makevars, as well as the 
> > relevant sections in Writing R extensions
> > * Windows Defender (also on CRAN) complains about dubious exe-files when 
> > checking the "endianness" of the target system. That can be circumvented by 
> > telling cmake to compile static libraries instead of executables.
>
> Indeed, currently Windows Defender gives false positives for some
> temprary .exe files that CMake creates. As thge filenames and locations
> are random, there is no straightforward way to tell the defender about
> exceptions. Hence please follow the advice and  tell cmake to compile
> static libraries instead of executables (an excellent idea, thanks!).
> [Microsoft knows about this for several weeks now without action.]
>
> Best,
> Uwe Ligges
>
>
>
> >
> > I am unsure if my response is specific to your problem, but the links below 
> > do not seem to work.
> >
> >> Gesendet: Mittwoch, den 17.01.2024 um 08:37 Uhr
> >> Von: "Sameh Abdulah" 
> >> An: "R Package Development" 
> >> Betreff: [R-pkg-devel] CMake on CRAN Systems
> >>
> >> Hi All,
> >>
> >> We recently encountered an installation issue with our package on CRAN. 
> >> We've been depending on CMake, assuming it is readily available by 
> >> default, but it appears to be only available on the M1mac system but not 
> >> on the others. Should we include the CMake installation within our package?
> >>
> >> We encountered another issue with OpenMP, but we managed to resolve it by 
> >> consulting the manual.
> >>
> >> https://urldefense.com/v3/__https://cran-archive.r-project.org/web/checks/2024/2024-01-12_check_results_MPCR.html__;!!Nmw4Hv0!1cg5mCeLOB9fBslqbEGB1S0_MEcOLMjk6m4hpfWDyXErAlWtm82xz9ZUU3aQ3q6jkXZBM2tNhUp3EI3JmigE4EvCLlrC$
> >>
> >>
> >>
> >> Best,
> >> --Sameh
> >>
> >> --
> >>
> >> This message and its contents, including attachments are intended solely
> >> for the original recipient. If you are not the intended recipient or have
> >> received this message in error, please notify me immediately and delete
> >> this message from your computer system. Any unauthorized use or
> >> distribution is prohibited. Please consider the environment before printing
> >> this email.
> >>
> >>[[alternative HTML version deleted]]
> >>
> >> __
> >> R-package-devel@r-project.org mailing list
> >> https://stat.ethz.ch/mailman/listinfo/r-package-devel
> >
> > __
> > R-package-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-package-devel
>

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] CMake on CRAN Systems

2024-01-17 Thread Uwe Ligges



On 17.01.2024 08:59, Matthias Gondan wrote:

For package rswipl, cmake still seems to work, but

* one has to search for it on MacOS, see the src/Makevars, as well as the 
relevant sections in Writing R extensions
* Windows Defender (also on CRAN) complains about dubious exe-files when checking the 
"endianness" of the target system. That can be circumvented by telling cmake to 
compile static libraries instead of executables.


Indeed, currently Windows Defender gives false positives for some 
temprary .exe files that CMake creates. As thge filenames and locations 
are random, there is no straightforward way to tell the defender about 
exceptions. Hence please follow the advice and  tell cmake to compile 
static libraries instead of executables (an excellent idea, thanks!). 
[Microsoft knows about this for several weeks now without action.]


Best,
Uwe Ligges





I am unsure if my response is specific to your problem, but the links below do 
not seem to work.


Gesendet: Mittwoch, den 17.01.2024 um 08:37 Uhr
Von: "Sameh Abdulah" 
An: "R Package Development" 
Betreff: [R-pkg-devel] CMake on CRAN Systems

Hi All,

We recently encountered an installation issue with our package on CRAN. We've 
been depending on CMake, assuming it is readily available by default, but it 
appears to be only available on the M1mac system but not on the others. Should 
we include the CMake installation within our package?

We encountered another issue with OpenMP, but we managed to resolve it by 
consulting the manual.

https://urldefense.com/v3/__https://cran-archive.r-project.org/web/checks/2024/2024-01-12_check_results_MPCR.html__;!!Nmw4Hv0!1cg5mCeLOB9fBslqbEGB1S0_MEcOLMjk6m4hpfWDyXErAlWtm82xz9ZUU3aQ3q6jkXZBM2tNhUp3EI3JmigE4EvCLlrC$



Best,
--Sameh

--

This message and its contents, including attachments are intended solely
for the original recipient. If you are not the intended recipient or have
received this message in error, please notify me immediately and delete
this message from your computer system. Any unauthorized use or
distribution is prohibited. Please consider the environment before printing
this email.

[[alternative HTML version deleted]]

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel
__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [Rd] cwilcox - new version

2024-01-17 Thread Andrew Robbins via R-devel

Hi All,

Figured I'd put my two cents in here as the welch-lab's LIGER package 
currently uses mann-whitney on datasets much larger than m = 200. Our 
current version uses a modified PRESTO 
(https://github.com/immunogenomics/presto) implementation over the 
inbuilt tests because of the lack of scaling. I stumbled into this 
thread while working on some improvements for it and would like to make 
it known that there is absolutely an audience for the high-member use-case.


Best,

-Andrew Robbins

On 1/17/2024 5:55 AM, Andreas Löffler wrote:


Performance statistics are interesting. If we assume the two populations
have a total of `m` members, then this implementation runs slightly slower
for m < 20, and much slower for 50 < m < 100. However, this implementation
works significantly *faster* for m > 200. The breakpoint is precisely when
each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8
microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5
milliseconds. The new version runs in roughly 1 millisecond for both. This
is probably because of internal logic that requires many more `free/calloc`
calls if either population is larger than `WILCOX_MAX`, which is set to 50.


Also because cwilcox_sigma has to be evaluated, and this is slightly more
demanding since it uses k%d.

There is a tradeoff here between memory usage and time of execution. I am
not a heavy user of the U test but I think the typical use case does not
involve several hundreds of tests in a session so execution time (my 2
cents) is less important. But if R crashes one execution is already
problematic.

But the takeaway is  probably: we should implement both approaches in the
code and leave it to the user which one she prefers. If time is important
and memory not an issue and if m, n are low go for the "traditional
approach". Otherwise, use my formula?

PS (@Aidan): I have applied for an bugzilla account two days ago and heard
not back from them. Also Spam is empty. Is that ok or shall I do something?

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


--
Andrew Robbins
Systems Analyst, Welch Lab
University of Michigan
Department of Computational Medicine and Bioinformatics



OpenPGP_signature.asc
Description: OpenPGP digital signature
__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] cwilcox - new version

2024-01-17 Thread Aidan Lakshman
Hi everyone,

I’ve opened a Bugzilla report for Andreas with the most recent implementation 
here: https://bugs.r-project.org/show_bug.cgi?id=18655. Feedback would be 
greatly appreciated.


The most straight forward approach is likely to implement both methods and 
determine which to use based on population sizes. The cutoff at n=50 is very 
sharp; it would be a large improvement to just call Andreas’s algorithm when 
either population is larger than 50, and use the current method otherwise.

For the Bugzilla report I’ve only submitted the new version for benchmarking 
purposes. I think that if there is a way to improve this algorithm such that it 
matches the performance of the current version for population sizes under 50, 
then it would be significantly cleaner than to have two algorithms with an 
internal switch.

As for remaining performance improvements:

1. cwilcox_sigma is definitely a performance loss. It would improve performance 
to instead just loop from 1 to min(m, sqrt(k)) and from n+1 to min(m+n, 
sqrt(k)), since the formula just finds potential factors of k. Maybe there are 
other ways to improve this, but I think factorization is a notoriously 
intensive problem, so that further optimzation may be intractable.

2. Calculation of the distribution values has quadratic scaling. Maybe there’s 
a way to optimize that further? See lines 91-103 in the most recent version.

Regardless of runtime, memory is certainly improved. For calculation on 
population sizes m,n, the current version has memory complexity O((mn)^2), 
whereas Andreas’s version has complexity O(mn). Running `qwilcox(0.5,500,500)` 
crashes my R session with the old version, but runs successfully in about 10s 
with the new version.

I’ve written up all the information so far on the Bugzilla report, and I’m sure 
Andreas will add more information if necessary when his account is approved. 
Thanks again to Andreas for introducing this algorithm—I’m hopeful that this is 
able to improve performance of the wilcox functions.

-Aidan


---
Aidan Lakshman (he/him)
PhD Candidate, Wright Lab
University of Pittsburgh School of Medicine
Department of Biomedical Informatics
www.AHL27.com
ah...@pitt.edu | (724) 612-9940

On 17 Jan 2024, at 5:55, Andreas Löffler wrote:

>>
>>
>> Performance statistics are interesting. If we assume the two populations
>> have a total of `m` members, then this implementation runs slightly slower
>> for m < 20, and much slower for 50 < m < 100. However, this implementation
>> works significantly *faster* for m > 200. The breakpoint is precisely when
>> each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8
>> microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5
>> milliseconds. The new version runs in roughly 1 millisecond for both. This
>> is probably because of internal logic that requires many more `free/calloc`
>> calls if either population is larger than `WILCOX_MAX`, which is set to 50.
>>
> Also because cwilcox_sigma has to be evaluated, and this is slightly more
> demanding since it uses k%d.
>
> There is a tradeoff here between memory usage and time of execution. I am
> not a heavy user of the U test but I think the typical use case does not
> involve several hundreds of tests in a session so execution time (my 2
> cents) is less important. But if R crashes one execution is already
> problematic.
>
> But the takeaway is  probably: we should implement both approaches in the
> code and leave it to the user which one she prefers. If time is important
> and memory not an issue and if m, n are low go for the "traditional
> approach". Otherwise, use my formula?
>
> PS (@Aidan): I have applied for an bugzilla account two days ago and heard
> not back from them. Also Spam is empty. Is that ok or shall I do something?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Sys.which() caching path to `which`

2024-01-17 Thread Harmen Stoppels
On Friday, January 12th, 2024 at 16:11, Ivan Krylov  wrote:

> unlike `which`, `command -v` returns names of shell builtins if
> something is both an executable and a builtin. So for things like `[`,
> Sys.which would behave differently if changed to use command -v

Then can we revisit my simple fix, which refers to `which` through a
symlink instead of a hard-coded absolute in an R-source file:

>From 3f2b1b6c94460fd4d3e9f03c9f17a25db2d2b473 Mon Sep 17 00:00:00 2001
From: Harmen Stoppels 
Date: Wed, 10 Jan 2024 12:40:40 +0100
Subject: [PATCH] base: use a symlink for which instead of hard-coded string

---
 share/make/basepkg.mk | 8 
 src/library/base/R/unix/system.unix.R | 6 +++---
 2 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/share/make/basepkg.mk b/share/make/basepkg.mk
index c0a69c8a0af..4cf63878709 100644
--- a/share/make/basepkg.mk
+++ b/share/make/basepkg.mk
@@ -72,16 +72,16 @@ mkRbase:
  else \
cat $(RSRC) > "$${f}"; \
  fi; \
- f2=$${TMPDIR:-/tmp}/R2; \
- sed -e "s:@WHICH@:${WHICH}:" "$${f}" > "$${f2}"; \
- rm -f "$${f}"; \
- $(SHELL) $(top_srcdir)/tools/move-if-change "$${f2}" all.R)
+ $(SHELL) $(top_srcdir)/tools/move-if-change "$${f}" all.R)
@if ! test -f $(top_builddir)/library/$(pkg)/R/$(pkg); then \
  $(INSTALL_DATA) all.R $(top_builddir)/library/$(pkg)/R/$(pkg); \
else if test all.R -nt $(top_builddir)/library/$(pkg)/R/$(pkg); then \
  $(INSTALL_DATA) all.R $(top_builddir)/library/$(pkg)/R/$(pkg); \
  fi \
fi
+   @if ! test -f $(top_builddir)/library/$(pkg)/R/which; then \
+ cd $(top_builddir)/library/$(pkg)/R/ && $(LN_S) $(WHICH) which; \
+   fi
 
 mkdesc:
@if test -f DESCRIPTION; then \
diff --git a/src/library/base/R/unix/system.unix.R 
b/src/library/base/R/unix/system.unix.R
index 3bb7d0cb27c..78271c8c12c 100644
--- a/src/library/base/R/unix/system.unix.R
+++ b/src/library/base/R/unix/system.unix.R
@@ -114,9 +114,9 @@ system2 <- function(command, args = character(),
 Sys.which <- function(names)
 {
 res <- character(length(names)); names(res) <- names
-## hopefully configure found [/usr]/bin/which
-which <- "@WHICH@"
-if (!nzchar(which)) {
+which <- file.path(R.home(), "library", "base", "R", "which")
+## which should be a symlink to the system's which
+if (!file.exists(which)) {
 warning("'which' was not found on this platform")
 return(res)
 }

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] cwilcox - new version

2024-01-17 Thread Andreas Löffler
>
>
> Performance statistics are interesting. If we assume the two populations
> have a total of `m` members, then this implementation runs slightly slower
> for m < 20, and much slower for 50 < m < 100. However, this implementation
> works significantly *faster* for m > 200. The breakpoint is precisely when
> each population has a size of 50; `qwilcox(0.5,50,50)` runs in 8
> microseconds in the current version, but `qwilcox(0.5, 50, 51)` runs in 5
> milliseconds. The new version runs in roughly 1 millisecond for both. This
> is probably because of internal logic that requires many more `free/calloc`
> calls if either population is larger than `WILCOX_MAX`, which is set to 50.
>
Also because cwilcox_sigma has to be evaluated, and this is slightly more
demanding since it uses k%d.

There is a tradeoff here between memory usage and time of execution. I am
not a heavy user of the U test but I think the typical use case does not
involve several hundreds of tests in a session so execution time (my 2
cents) is less important. But if R crashes one execution is already
problematic.

But the takeaway is  probably: we should implement both approaches in the
code and leave it to the user which one she prefers. If time is important
and memory not an issue and if m, n are low go for the "traditional
approach". Otherwise, use my formula?

PS (@Aidan): I have applied for an bugzilla account two days ago and heard
not back from them. Also Spam is empty. Is that ok or shall I do something?

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-17 Thread Lionel Henry via R-devel
> I think one could implement hashing on the fly without any
> serialization, similarly to how identical works, but I am not aware of
> any existing implementation

We have one in vctrs but it's not exported:
https://github.com/r-lib/vctrs/blob/main/src/hash.c

The main use is vectorised hashing:

```
# Non-vectorised
vctrs:::obj_hash(1:10)
#> [1] 1e 77 ce 48

# Vectorised
vctrs:::vec_hash(1L)
#> [1] 70 a2 85 ef
vctrs:::vec_hash(1:2)
#> [1] 70 a2 85 ef bf 3c 2c cf

# vctrs semantics so dfs are vectors of rows
length(vctrs:::vec_hash(mtcars)) / 4
#> [1] 32
nrow(mtcars)
#> [1] 32
```

Best,
Lionel

On Wed, Jan 17, 2024 at 10:32 AM Tomas Kalibera
 wrote:
>
> On 1/16/24 20:16, Dipterix Wang wrote:
> > Could you recommend any packages/functions that compute hash such that
> > the source references and sexpinfo_struct are ignored? Basically a
> > version of `serialize` that convert R objects to raw without storing
> > the ancillary source reference and sexpinfo.
> > I think most people would think of `digest` but that package uses
> > `serialize` (see discussion
> > https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)
>
> I think one could implement hashing on the fly without any
> serialization, similarly to how identical works, but I am not aware of
> any existing implementation. Again, if that wasn't clear: I don't think
> trying to compute a hash of an object from its serialized representation
> is a good idea - it is of course convenient, but has problems like the
> one you have ran into.
>
> In some applications it may still be good enough: if by various tweaks,
> such as ensuring source references are off in your case, you achieve a
> state when false alarms are rare (identical objects have different
> hashes), and hence say unnecessary re-computation is rare, maybe it is
> good enough.
>
> Tomas
>
> >
> >> On Jan 12, 2024, at 11:33 AM, Tomas Kalibera
> >>  wrote:
> >>
> >>
> >> On 1/12/24 06:11, Dipterix Wang wrote:
> >>> Dear R devs,
> >>>
> >>> I was digging into a package issue today when I realized R serialize
> >>> function not always generate the same results on equivalent objects
> >>> when users choose to run differently. For example, the following code
> >>>
> >>> serialize(with(new.env(), { function(){} }), NULL, TRUE)
> >>>
> >>> generates different results when I copy-paste into console vs when I
> >>> use ctrl+shift+enter to source the file in RStudio.
> >>>
> >>> With a deeper inspect into the cause, I found that function and
> >>> language get source reference when getOption("keep.source") is TRUE.
> >>> This means the source reference will make the functions different
> >>> while in most cases, whether keeping function source might not
> >>> impact how a function behaves.
> >>>
> >>> While it's OK that function serialize generates different results,
> >>> functions such as `rlang::hash` and `digest::digest`, which depend
> >>> on `serialize` might eventually deliver false positives on same
> >>> inputs. I've checked source code in digest package hoping to get
> >>> around this issue (for example serialize(..., refhook = ...)).
> >>> However, my workaround did not work. It seems that the markers to
> >>> the objects are different even if I used `refhook` to force srcref
> >>> to be the same. I also tried `removeSource` and `rlang::zap_srcref`.
> >>> None of them works directly on nested environments with multiple
> >>> functions.
> >>>
> >>> I wonder how hard it would be to have options to discard source when
> >>> serializing R objects?
> >>>
> >>> Currently my analyses heavily depend on digest function to generate
> >>> file caches and automatically schedule pipelines (to update cache)
> >>> when changes are detected. The pipelines save the hashes of source
> >>> code, inputs, and outputs together so other people can easily verify
> >>> the calculation without accessing the original data (which could be
> >>> sensitive), or running hour-long analyses, or having to buy servers.
> >>> All of these require `serialize` to produce the same results
> >>> regardless of how users choose to run the code.
> >>>
> >>> It would be great if this feature could be in the future R. Other
> >>> pipeline packages such as `targets` and `drake` can also benefit
> >>> from it.
> >>
> >> I don't think such functionality would belong to serialize(). This
> >> function is not meant to produce stable results based on the input,
> >> the serialized representation may even differ based on properties not
> >> seen by users.
> >>
> >> I think an option to ignore source code would belong to a function
> >> that computes the hash, as other options of identical().
> >>
> >> Tomas
> >>
> >>
> >>> Thanks,
> >>>
> >>> - Dipterix
> >>> [[alternative HTML version deleted]]
> >>>
> >>> __
> >>> R-devel@r-project.orgmailing list
> >>> https://stat.ethz.ch/mailman/listinfo/r-devel
> >
>
> __
> R-devel@r-project.org 

Re: [R-pkg-devel] CMake on CRAN Systems

2024-01-17 Thread Tomas Kalibera



On 1/17/24 08:37, Sameh Abdulah wrote:

Hi All,

We recently encountered an installation issue with our package on CRAN. We've 
been depending on CMake, assuming it is readily available by default, but it 
appears to be only available on the M1mac system but not on the others. Should 
we include the CMake installation within our package?


Re Windows, see the documentation:

https://cran.r-project.org/bin/windows/base/howto-R-devel.html
https://cran.r-project.org/bin/windows/base/howto-R-4.3.html

cmake executable is available. But another issue is how well maintained 
are the cmake configuration files to find the software, etc.


You have most control when you specify the libraries for linking 
explicitly by yourself (and use just make/Makevars files), even though 
this can sometimes require manual changes for newer versions of Rtools 
(some libraries change linking often, most don't). This is the common 
way to do it (see the documentation).


You can also use pkg-config with the latest Rtools, and pkg-config is 
used internally by MXE, which provides some testing, and I've manually 
fixed a number of issues not detected by that testing. The advantage of 
pkg-config is that you don't have to specify the libraries yourself and 
it should reduce the need for updating your Makevars on Rtools updates. 
At the same time, it is much more likely to work than cmake, yet you 
could still easily run to issues, typically some dependency omitted.


I would not use cmake for an R package on Windows.

Tomas



We encountered another issue with OpenMP, but we managed to resolve it by 
consulting the manual.

https://urldefense.com/v3/__https://cran-archive.r-project.org/web/checks/2024/2024-01-12_check_results_MPCR.html__;!!Nmw4Hv0!1cg5mCeLOB9fBslqbEGB1S0_MEcOLMjk6m4hpfWDyXErAlWtm82xz9ZUU3aQ3q6jkXZBM2tNhUp3EI3JmigE4EvCLlrC$



Best,
--Sameh



__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [Rd] Choices to remove `srcref` (and its buddies) when serializing objects

2024-01-17 Thread Tomas Kalibera

On 1/16/24 20:16, Dipterix Wang wrote:
Could you recommend any packages/functions that compute hash such that 
the source references and sexpinfo_struct are ignored? Basically a 
version of `serialize` that convert R objects to raw without storing 
the ancillary source reference and sexpinfo.
I think most people would think of `digest` but that package uses 
`serialize` (see discussion 
https://github.com/eddelbuettel/digest/issues/200#issuecomment-1894289875)


I think one could implement hashing on the fly without any 
serialization, similarly to how identical works, but I am not aware of 
any existing implementation. Again, if that wasn't clear: I don't think 
trying to compute a hash of an object from its serialized representation 
is a good idea - it is of course convenient, but has problems like the 
one you have ran into.


In some applications it may still be good enough: if by various tweaks, 
such as ensuring source references are off in your case, you achieve a 
state when false alarms are rare (identical objects have different 
hashes), and hence say unnecessary re-computation is rare, maybe it is 
good enough.


Tomas



On Jan 12, 2024, at 11:33 AM, Tomas Kalibera 
 wrote:



On 1/12/24 06:11, Dipterix Wang wrote:

Dear R devs,

I was digging into a package issue today when I realized R serialize 
function not always generate the same results on equivalent objects 
when users choose to run differently. For example, the following code


serialize(with(new.env(), { function(){} }), NULL, TRUE)

generates different results when I copy-paste into console vs when I 
use ctrl+shift+enter to source the file in RStudio.


With a deeper inspect into the cause, I found that function and 
language get source reference when getOption("keep.source") is TRUE. 
This means the source reference will make the functions different 
while in most cases, whether keeping function source might not 
impact how a function behaves.


While it's OK that function serialize generates different results, 
functions such as `rlang::hash` and `digest::digest`, which depend 
on `serialize` might eventually deliver false positives on same 
inputs. I've checked source code in digest package hoping to get 
around this issue (for example serialize(..., refhook = ...)). 
However, my workaround did not work. It seems that the markers to 
the objects are different even if I used `refhook` to force srcref 
to be the same. I also tried `removeSource` and `rlang::zap_srcref`. 
None of them works directly on nested environments with multiple 
functions.


I wonder how hard it would be to have options to discard source when 
serializing R objects?


Currently my analyses heavily depend on digest function to generate 
file caches and automatically schedule pipelines (to update cache) 
when changes are detected. The pipelines save the hashes of source 
code, inputs, and outputs together so other people can easily verify 
the calculation without accessing the original data (which could be 
sensitive), or running hour-long analyses, or having to buy servers. 
All of these require `serialize` to produce the same results 
regardless of how users choose to run the code.


It would be great if this feature could be in the future R. Other 
pipeline packages such as `targets` and `drake` can also benefit 
from it.


I don't think such functionality would belong to serialize(). This 
function is not meant to produce stable results based on the input, 
the serialized representation may even differ based on properties not 
seen by users.


I think an option to ignore source code would belong to a function 
that computes the hash, as other options of identical().


Tomas



Thanks,

- Dipterix
[[alternative HTML version deleted]]

__
R-devel@r-project.orgmailing list
https://stat.ethz.ch/mailman/listinfo/r-devel




__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel


Re: [R-pkg-devel] Additional Issues: Intel

2024-01-17 Thread Tomas Kalibera



On 1/17/24 09:41, Ivan Krylov via R-package-devel wrote:

В Wed, 17 Jan 2024 10:30:36 +1100
Hugh Parsonage  пишет:


I am unable to immediately see where in the test suite this error has
occurred.

Without testthat, you would have gotten a line by line printout of the code, 
letting you pinpoint the (top-level) place of the crash. With
testthat, you will need a more verbose reporter that would print tests
as they are executed to find out which test causes the crash.


The only hunch I have is that the package uses C code and includes
structs with arrays on the stack, which perhaps are excessive for the
Intel check machine, but am far from confident that's the issue.

According to GNU cflow, your only recursive C functions are
getListElement (from getListElement.c) and nthOffset (from Offset.c),
but the recursion seems bounded in both cases.

I've tried looking for variable-length arrays in your code using a
Coccinelle patch, but found none. If you had variable-bounded recursion
or variable-length stack arrays (VLA or alloca()), it would be prudent
to use R_CheckStack() or R_CheckStack2(size_of_VLA), but your C code
contains neither, so there's no obvious culprit. If you know about
R-level recursion happening in your code and have a way to reduce it,
that might help too.

Otherwise, it's time to install Intel Everything and reproduce and
debug the problem the hard way.


You could also try debugging using your toolchain, but with reduced 
stack size (e.g. ulimit -s). If you can make the error appear with a 
smaller but still reasonable stack size, chances are it is due to the 
same underlying problem.


Tomas





__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] Additional Issues: Intel

2024-01-17 Thread Ivan Krylov via R-package-devel
В Wed, 17 Jan 2024 10:30:36 +1100
Hugh Parsonage  пишет:

> I am unable to immediately see where in the test suite this error has
> occurred.

Without testthat, you would have gotten a line by line printout of the code, 
letting you pinpoint the (top-level) place of the crash. With
testthat, you will need a more verbose reporter that would print tests
as they are executed to find out which test causes the crash.

> The only hunch I have is that the package uses C code and includes
> structs with arrays on the stack, which perhaps are excessive for the
> Intel check machine, but am far from confident that's the issue.

According to GNU cflow, your only recursive C functions are
getListElement (from getListElement.c) and nthOffset (from Offset.c),
but the recursion seems bounded in both cases.

I've tried looking for variable-length arrays in your code using a
Coccinelle patch, but found none. If you had variable-bounded recursion
or variable-length stack arrays (VLA or alloca()), it would be prudent
to use R_CheckStack() or R_CheckStack2(size_of_VLA), but your C code
contains neither, so there's no obvious culprit. If you know about
R-level recursion happening in your code and have a way to reduce it,
that might help too.

Otherwise, it's time to install Intel Everything and reproduce and
debug the problem the hard way.

-- 
Best regards,
Ivan

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel


Re: [R-pkg-devel] CMake on CRAN Systems

2024-01-17 Thread Matthias Gondan
For package rswipl, cmake still seems to work, but

* one has to search for it on MacOS, see the src/Makevars, as well as the 
relevant sections in Writing R extensions
* Windows Defender (also on CRAN) complains about dubious exe-files when 
checking the "endianness" of the target system. That can be circumvented by 
telling cmake to compile static libraries instead of executables.

I am unsure if my response is specific to your problem, but the links below do 
not seem to work.

> Gesendet: Mittwoch, den 17.01.2024 um 08:37 Uhr
> Von: "Sameh Abdulah" 
> An: "R Package Development" 
> Betreff: [R-pkg-devel] CMake on CRAN Systems
>
> Hi All,
>
> We recently encountered an installation issue with our package on CRAN. We've 
> been depending on CMake, assuming it is readily available by default, but it 
> appears to be only available on the M1mac system but not on the others. 
> Should we include the CMake installation within our package?
>
> We encountered another issue with OpenMP, but we managed to resolve it by 
> consulting the manual.
>
> https://urldefense.com/v3/__https://cran-archive.r-project.org/web/checks/2024/2024-01-12_check_results_MPCR.html__;!!Nmw4Hv0!1cg5mCeLOB9fBslqbEGB1S0_MEcOLMjk6m4hpfWDyXErAlWtm82xz9ZUU3aQ3q6jkXZBM2tNhUp3EI3JmigE4EvCLlrC$
>
>
>
> Best,
> --Sameh
>
> --
>
> This message and its contents, including attachments are intended solely
> for the original recipient. If you are not the intended recipient or have
> received this message in error, please notify me immediately and delete
> this message from your computer system. Any unauthorized use or
> distribution is prohibited. Please consider the environment before printing
> this email.
>
>   [[alternative HTML version deleted]]
>
> __
> R-package-devel@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-package-devel

__
R-package-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-package-devel