from:"Michael Chirico"

[Rd] Petition to set warnPartialMatch* options to TRUE during R CMD check by default

2024-04-22 Thread Michael Chirico

Hi all,

What it says in the title. This is likely to cause a lot of CRAN packages
to fail (I can try and quantify this if seen fit), but I think it's for the
best. Packages should not (IMHO) be relying on partial matching in package
code / tests. One might be more permissive for vignette/examples code,
though I still find it poor practice.

Among many reasons why package authors should resist using partial
matching, today I saw this:

upstream package A adds a new argument 'nb' to its print method, now there
are two arguments 'na' and 'nb' starting with 'n'
downstream package B importing A starts failing because it was using
print(n=...) and relying on that resolving to na= but now it's ambiguous

This feels like an unnecessary risk for considerate package A authors to
need to take into account when designing their API. Yes, downstream package
B "should" be broken & resubmitted to CRAN, but (1) there is some perceived
"blame" for package A "causing B to be removed" and (2) the re-submitted
package is by no means assured to be "safe" -- time-constrained B authors
may just fix the minimum set of partially-matched calls while leaving
potentially many other similar at-risk call sites unchanged.

Better to enforce exact matching in package code globally, I think.

It seems likely (given how options work in R) that committed authors will
be able to sidestep the issue by disabling the PartialMatch warnings, but
better to make this inconvenient with a stricter default.

Mike C

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] [External] Re: Repeated library() of one package with different include.only= entries

2024-04-15 Thread Michael Chirico

OK, I've moved your suggestions into the tracker for further discussion:

https://bugs.r-project.org/show_bug.cgi?id=18703
https://bugs.r-project.org/show_bug.cgi?id=18704
https://bugs.r-project.org/show_bug.cgi?id=18705

I will tackle a patch for 18705 to start with as the lowest-hanging fruit.

On Mon, Apr 15, 2024 at 6:46 AM Martin Maechler 
wrote:

> I think we should try to advance and hopefully finalize this
> thread before we forget about it ..
>
> >>>>> Michael Chirico  n Thu, 11 Apr 2024 09:10:11 -0700 writes:
>
> >> I would assume that
>
> >> library(Matrix, include.only="isDiagonal")
>
> >> implies that only `isDiagonal` ends up on the search path
>
> > This could also be a reasonable behavior, but neither does that
> happen
> > today.
>
> Indeed; I think we should signal a warning at least in the case
> it does not happen --- namely when "parts of Matrix" are already
> in the search path.
>
> >> I think a far better approach to solve Michael's problem is simply
> to use
>
> >> fac2sparse <- Matrix::fac2sparse
>
> > This does not fully simulate attachment, e.g. running package hooks &
> > resolving Depends.
>
> (indeed; as library() is really about search() patch attchment)
>
> >> ?library could also mention using detach() followed by library() or
> >> attachNamespace() with a new include.only specification.
>
> > This is the "quite hard to accomplish" I alluded to, admittedly I
> hadn't
> > forced myself to write it all out -- maybe it's not as bad as all
> that.
> > After some iterations, today I think we'd want to do...
>
> > modify_attach <- function(pkg, new_names) {
> >   if (!startsWith(pkg, "package:")) pkg <- paste0("package:", pkg)
> >   old <- tryCatch(ls(pkg, all.names=TRUE), error=\(c) character())
> >   if (length(old)) detach(pkg)
> >   attachNamespace(.rmpkg(pkg), include.only=c(new_names, old))
> > }
>
> > Perhaps detach() could invisibly return the exported names to make
> this a
> > tiny bit easier (today it returns NULL):
>
> I agree that such changed detach() behavior would be "nice" here;
> but I wouldn't like to change its default behavior, notably as
> in 99.5% of its use, the names would not be used.
> I'd agree to add a new option for detach() to return the names
> (visibly in that case).
>
> > modify_attach <- function(pkg, new_names) {
> >   if (!startsWith(pkg, "package:")) pkg <- paste0("package:", pkg)
> >   old <- tryCatch(detach(pkg), error=\(c) character())
> >   attachNamespace(.rmpkg(pkg), include.only=c(new_names, old))
> > }
>
> > Regardless, I think your suggestion to just point to
> > detach()+attachNamespace() is reasonable enough, the rare users that
> care
> > about this are likely to be able to figure out the rest from there.
>
> So I think we agree here;  mentioning such a modify_attach()
> ... I'd use an example *without* tryCatch() as I think the user
> should choose their own line of action in such cases
> ...
> on the help page would then be appropriate.
>
> Martin
>
> > On Thu, Apr 11, 2024 at 7:37 AM  wrote:
>
> >> On Thu, 11 Apr 2024, Duncan Murdoch wrote:
> >>
> >> > On 11/04/2024 7:04 a.m., Martin Maechler wrote:
> >> >>>>>>> Michael Chirico
> >> >>>>>>>  on Mon, 8 Apr 2024 10:19:29 -0700 writes:
> >> >>
> >> >>  > Right now, attaching the same package with different
> >> include.only= has no effect:
> >> >>
> >> >>  > library(Matrix, include.only="fac2sparse")
> >> >>  > library(Matrix)
> >> >>  > ls("package:Matrix")
> >> >>  > # [1] "fac2sparse"
> >> >>
> >> >>  > ?library does not cover this case -- what is covered is
> the _loading_
> >> >>  > behavior of repeated calls:
> >> >>
> >> >>  >> [library and require] check and update the list of
> currently attached
> >> >>  > packages and do not reload a namespace which is already
> loaded
> >> >>
> >> >>  > But here we're looking at

Re: [Rd] [External] Re: Repeated library() of one package with different include.only= entries

2024-04-11 Thread Michael Chirico

> I would assume that
>   library(Matrix, include.only="isDiagonal")
> implies that only `isDiagonal` ends up on the search path

This could also be a reasonable behavior, but neither does that happen
today.
> I think a far better approach to solve Michael's problem is simply to use
>   fac2sparse <- Matrix::fac2sparse

This does not fully simulate attachment, e.g. running package hooks &
resolving Depends.

> ?library could also mention using detach() followed by library() or
> attachNamespace() with a new include.only specification.

This is the "quite hard to accomplish" I alluded to, admittedly I hadn't
forced myself to write it all out -- maybe it's not as bad as all that.
After some iterations, today I think we'd want to do...

modify_attach = function(pkg, new_names) {
  if (!startsWith(pkg, "package:")) pkg <- paste0("package:", pkg)
  old <- tryCatch(ls(pkg, all.names=TRUE), error=\(c) character())
  if (length(old)) detach(pkg)
  attachNamespace(.rmpkg(pkg), include.only=c(new_names, old))
}

Perhaps detach() could invisibly return the exported names to make this a
tiny bit easier (today it returns NULL):

modify_attach = function(pkg, new_names) {
  if (!startsWith(pkg, "package:")) pkg <- paste0("package:", pkg)
  old <- tryCatch(detach(pkg), error=\(c) character())
  attachNamespace(.rmpkg(pkg), include.only=c(new_names, old))
}

Regardless, I think your suggestion to just point to
detach()+attachNamespace() is reasonable enough, the rare users that care
about this are likely to be able to figure out the rest from there.

On Thu, Apr 11, 2024 at 7:37 AM  wrote:

> On Thu, 11 Apr 2024, Duncan Murdoch wrote:
>
> > On 11/04/2024 7:04 a.m., Martin Maechler wrote:
> >>>>>>> Michael Chirico
> >>>>>>>  on Mon, 8 Apr 2024 10:19:29 -0700 writes:
> >>
> >>  > Right now, attaching the same package with different
> include.only=
> >> has no
> >>  > effect:
> >>
> >>  > library(Matrix, include.only="fac2sparse")
> >>  > library(Matrix)
> >>  > ls("package:Matrix")
> >>  > # [1] "fac2sparse"
> >>
> >>  > ?library does not cover this case -- what is covered is the
> >> _loading_
> >>  > behavior of repeated calls:
> >>
> >>  >> [library and require] check and update the list of currently
> >> attached
> >>  > packages and do not reload a namespace which is already loaded
> >>
> >>  > But here we're looking at the _attach_ behavior of repeated
> calls.
> >>
> >>  > I am particularly interested in allowing the exports of a
> package to
> >> be
> >>  > built up gradually:
> >>
> >>  > library(Matrix, include.only="fac2sparse")
> >>  > library(Matrix, include.only="isDiagonal") # want:
> >> ls("package:Matrix") -->
> >>  > c("fac2sparse", "isDiagonal")
> >>  > ...
> >>
> >>  > It seems quite hard to accomplish this at the moment. Is the
> >> behavior to
> >>  > ignore new inclusions intentional? Could there be an argument to
> get
> >>  > different behavior?
> >>
> >> As you did not get an answer yet, ..., some remarks by an
> >> R-corer who has tweaked library() behavior in the past :
> >>
> >> - The `include.only = *` argument to library() has been a
> >>*relatively* recent addition {given the 25+ years of R history}:
> >>
> >>It was part of the extensive new features by Luke Tierney for
> >>R 3.6.0  [r76248 | luke | 2019-03-18 17:29:35 +0100], with NEWS entry
> >>
> >>  • library() and require() now allow more control over handling
> >>search path conflicts when packages are attached. The policy is
> >>controlled by the new conflicts.policy option.
> >>
> >> - I haven't seen these (then) new features been used much,
> unfortunately,
> >>also not from R-core members, but I'd be happy to be told a
> different
> >> story.
> >>
> >> For the above reasons, it could well be that the current
> >> implementation {of these features} has not been exercised a lot
> >> yet, and limitations as you found them haven't been noticed yet,
> >> or at least not noticed on the public R mailing lists, nor
> >> otherwise by R-core (?).
>

[Rd] Repeated library() of one package with different include.only= entries

2024-04-08 Thread Michael Chirico

Right now, attaching the same package with different include.only= has no
effect:

library(Matrix, include.only="fac2sparse")
library(Matrix)
ls("package:Matrix")
# [1] "fac2sparse"

?library does not cover this case -- what is covered is the _loading_
behavior of repeated calls:

> [library and require] check and update the list of currently attached
packages and do not reload a namespace which is already loaded

But here we're looking at the _attach_ behavior of repeated calls.

I am particularly interested in allowing the exports of a package to be
built up gradually:

library(Matrix, include.only="fac2sparse")
library(Matrix, include.only="isDiagonal") # want: ls("package:Matrix") -->
c("fac2sparse", "isDiagonal")
...

It seems quite hard to accomplish this at the moment. Is the behavior to
ignore new inclusions intentional? Could there be an argument to get
different behavior?

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] readChar() could read the whole file by default?

2024-01-26 Thread Michael Chirico

I am curious why readLines() has a default (n=-1L) to read the full
file while readChar() has no default for nchars= (i.e., readChar(file)
is an error). Is there a technical reason for this?

I often[1] see code like paste(readLines(f), collapse="\n") which
would be better served by readChar(), especially given issues with the
global string cache I've come across[2]. But lacking the default, the
replacement might come across less clean.

For my own purposes the incantation readChar(file, file.size(file)) is
ubiquitous. Taking CRAN code[3] as a sample[4], 41% of readChar()
calls use either readChar(f, file.info(f)$size) or readChar(f,
file.size(f))[5].

Thanks for the consideration and feedback,
Mike C

[1] e.g. a quick search shows O(100) usages in CRAN packages:
https://github.com/search?q=org%3Acran+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code,
and O(1000) usages generally on GitHub:
https://github.com/search?q=lang%3AR+%2Fpaste%5B%28%5D%5Cs*readLines%5B%28%5D.*%5B%29%5D%2C%5Cs*collapse%5Cs*%3D%5Cs*%5B%27%22%5D%5B%5C%5C%5D%2F+lang%3AR&type=code
[2] AIUI the readLines() approach "pollutes" the global string cache
with potentially 1000s/1s of strings for each line, only to get
them gc()'d after combining everything with paste(collapse="\n")
[3] The mirror on GitHub, which includes archived packages as well as
current (well, eventually-consistent) versions.
[4] Note that usage in packages is likely not representative of usage
in scripts, e.g. I often saw readChar(f, 1), or eol-finders like
readChar(f, 500) + grep("[\n\r]"), which makes more sense to me as
something to find in package internals than in analysis scripts. FWIW
I searched an internal codebase (scripts and packages) and found 70%
of usages reading the full file.
[5] repro: 
https://gist.github.com/MichaelChirico/247ea9500460dca239f031e74bdcf76b
requires GitHub PAT in env GITHUB_PAT for API permissions.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] c(NA, 0+1i) not the same as c(as.complex(NA), 0+1i)?

2023-11-06 Thread Michael Chirico

Thanks Martin. My hang-up was not on what the outcome of as.complex(NA)
should be, but rather, how I should read code like c(x, y) generally. Till
now, I have thought of it like 'c(x, y)' is c(as(x, typeof(y)), y)` when
"type(y) > type(x)". Basically in my mind, "coercion" in R <->
as.(.) (or coerceVector() in C).

So I tracked down the source (which admittedly has been this way for much
longer than the present discussion) to see what exactly c() is doing in
this case:

https://github.com/r-devel/r-svn/blob/71e7480b07767f3b7d5c45a4247959aa4d83d910/src/main/bind.c#L418-L425

And indeed! It's not "coercion" in the sense I just described... there's a
branch for the 'x == NA_LOGICAL' case to _convert_ to NA_complex_.

On Mon, Nov 6, 2023 at 3:08 AM Martin Maechler 
wrote:

> >>>>> Michael Chirico
> >>>>> on Sun, 5 Nov 2023 09:41:42 -0800 writes:
>
> > This is another follow-up to the thread from September
> > "Recent changes to as.complex(NA_real_)".
>
> > A test in data.table was broken by the changes for NA
> > coercion to complex; the breakage essentially comes from
>
> > c(NA, 0+1i)
> > # vs
> > c(as.complex(NA), 0+1i)
>
> > The former is the output we tested against; the latter is
> essentially (via
> > coerceVector() in C) what's generated by our data.table::shift()
>
> > However, these are now (r85472) different:
>
> > Im(c(NA, 0+1i))
> > # [1] NA  1
> > Im(c(as.complex(NA), 0+1i))
> > # [1] 0 1
>
>
> > The former matches the behavior of directly using NA_complex_:
>
> > Im(c(NA_complex_, 0+1i))
> > # [1] NA  1
>
> > On R4.3.2, they both match the NA_complex_ behavior:
> > Im(c(NA, 0+1i))
> > # [1] NA  1
> > Im(c(as.complex(NA), 0+1i))
> > # [1] NA 1
>
> > Is this intended behavior, does something need to be updated for c()
> as
> > well?
>
> > Certainly it's messing with my understanding of how c() behaves,
> e.g. in ?c
>
> >> All arguments are coerced to a common type which is the type of the
> > returned value
>
> I think you have confused yourself, and everything behaves as expected:
>
> As we now have (in R-devel, since {r85233 | maechler | 2023-09-29 })
>
>   • ‘as.complex(x)’ now returns ‘complex(real=x, imaginary=0)’
> for _all_ numerical and logical ‘x’, notably also for ‘NA’
> or ‘NA_integer_’.
>
> ==> as.complex(NA) is indeed  complex(real = NA, imaginary = 0)
>
> And now, in your
>
>c(as.complex(NA), 0+1i)
>
> you are calling c() on two complex numbers, i.e., there is *no* coercion
> (and c(.) is rather "trivial"),  and the same is true for
>
>c(NA_complex_, 0+1i)
>
>
> However, in 85233, I had only modified & added examples to  ?as.complex,
> and now have added more (corresponding to the above NEWS entry);
> -> svn rev 85475
>
> .
>
> The underlying "dilemma" that nobody can help us with is that
> "almost infinitely" many different complex numbers z fulfill
>  is.na(z) |--> TRUE
> and only one of them is  NA_complex_  and that may be unintuitive.
>
> OTOH, we already have for the doubles that there are at least two
> different x fulfulling is.na(x), namely  NaN and NA
> and from C's point of view there are even considerably more
> different NaN's .. but now I'm definitely digressing.
>
> Martin
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] c(NA, 0+1i) not the same as c(as.complex(NA), 0+1i)?

2023-11-05 Thread Michael Chirico

This is another follow-up to the thread from September "Recent changes to
as.complex(NA_real_)".

A test in data.table was broken by the changes for NA coercion to complex;
the breakage essentially comes from

c(NA, 0+1i)
# vs
c(as.complex(NA), 0+1i)

The former is the output we tested against; the latter is essentially (via
coerceVector() in C) what's generated by our data.table::shift()

However, these are now (r85472) different:

Im(c(NA, 0+1i))
# [1] NA  1
Im(c(as.complex(NA), 0+1i))
# [1] 0 1

The former matches the behavior of directly using NA_complex_:

Im(c(NA_complex_, 0+1i))
# [1] NA  1

On R4.3.2, they both match the NA_complex_ behavior:
Im(c(NA, 0+1i))
# [1] NA  1
Im(c(as.complex(NA), 0+1i))
# [1] NA 1

Is this intended behavior, does something need to be updated for c() as
well?

Certainly it's messing with my understanding of how c() behaves, e.g. in ?c

> All arguments are coerced to a common type which is the type of the
returned value

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] FR: valid_regex() to test string validity as a regular expression

2023-10-11 Thread Michael Chirico via R-devel

Great to know this exists in package space!

Of course, using re2 validation for a regex to be executed with TRE
(via grep*) is just begging for trouble (e.g. [1] suggests re2 is
closer to PCRE, [2] says "mostly" PCRE compatible). And overhauling
everything to use re2 just for regex validation is hardly practical.

[1] https://laurikari.net/tre/google-releases-the-re2-library/
[2] https://hackerboss.com/is-your-regex-matcher-up-to-snuff/

On Wed, Oct 11, 2023 at 4:02 PM Toby Hocking  wrote:
>
> Hi Michael, it sounds like you don't want to use a CRAN package for
> this, but you may try re2, see below.
>
> > grepl("(invalid","subject",perl=TRUE)
> Error in grepl("(invalid", "subject", perl = TRUE) :
>   invalid regular expression '(invalid'
> In addition: Warning message:
> In grepl("(invalid", "subject", perl = TRUE) :
>   PCRE pattern compilation error
> 'missing closing parenthesis'
> at ''
>
> > grepl("(invalid","subject",perl=FALSE)
> Error in grepl("(invalid", "subject", perl = FALSE) :
>   invalid regular expression '(invalid', reason 'Missing ')''
> In addition: Warning message:
> In grepl("(invalid", "subject", perl = FALSE) :
>   TRE pattern compilation error 'Missing ')''
>
> > re2::re2_regexp("(invalid")
> Error: missing ): (invalid
>
> On Tue, Oct 10, 2023 at 7:57 AM Michael Chirico via R-devel
>  wrote:
> >
> > > Grepping an empty string might work in many cases...
> >
> > That's precisely why a base R offering is important, as a surer way of
> > validating in all cases. To be clear I am trying to directly access the
> > results of tre_regcomp().
> >
> > > it is probably more portable to simply be prepared to propagate such
> > errors from the actual use on real inputs
> >
> > That works best in self-contained calls -- foo(re) and we execute re inside
> > foo().
> >
> > But the specific context where I found myself looking for a regex validator
> > is more complicated (https://github.com/r-lib/lintr/pull/2225). User
> > supplies a regular expression in a configuration file, only "later" is it
> > actually supplied to grepl().
> >
> > Till now, we've done your suggestion -- just surface the regex error at run
> > time. But our goal is to make it friendlier and fail earlier at "compile
> > time" as the config is loaded, "long" before any regex is actually executed.
> >
> > At a bare minimum this is a good place to return a classed warning (say
> > invalid_regex_warning) to allow finer control than tryCatch(condition=).
> >
> > On Mon, Oct 9, 2023, 11:30 PM Tomas Kalibera 
> > wrote:
> >
> > >
> > > On 10/10/23 01:57, Michael Chirico via R-devel wrote:
> > >
> > > It will be useful to package authors trying to validate input which is
> > > supposed to be a valid regular expression.
> > >
> > > As near as I can tell, the only way we can do so now is to run any
> > > regex function and check for the warning and/or condition to bubble
> > > up:
> > >
> > > valid_regex <- function(str) {
> > >   stopifnot(is.character(str), length(str) == 1L)
> > >   !inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
> > > }
> > >
> > > That's pretty hefty/inscrutable for such a simple validation. I see a
> > > variety of similar approaches in CRAN packages [1], all slightly
> > > different. It would be good for R to expose a "canonical" way to run
> > > this validation.
> > >
> > > At root, the problem is that R does not expose the regex compilation
> > > routines like 'tre_regcomp', so from the R side we have to resort to
> > > hacky approaches.
> > >
> > > Hi Michael,
> > >
> > > I don't think you need compilation functions for that. If a regular
> > > expression is found invalid by a specific third party library R uses, the
> > > library should return and error to R and R should return an error to you,
> > > and you should probably propagate that to your users. Grepping an empty
> > > string might work in many cases as a test, but it is probably more 
> > > portable
> > > to simply be prepared to propagate such errors from the actual use on real
> > > inputs. In theory, there could be some optimization for a parti

Re: [Rd] FR: valid_regex() to test string validity as a regular expression

2023-10-10 Thread Michael Chirico via R-devel

> Grepping an empty string might work in many cases...

That's precisely why a base R offering is important, as a surer way of
validating in all cases. To be clear I am trying to directly access the
results of tre_regcomp().

> it is probably more portable to simply be prepared to propagate such
errors from the actual use on real inputs

That works best in self-contained calls -- foo(re) and we execute re inside
foo().

But the specific context where I found myself looking for a regex validator
is more complicated (https://github.com/r-lib/lintr/pull/2225). User
supplies a regular expression in a configuration file, only "later" is it
actually supplied to grepl().

Till now, we've done your suggestion -- just surface the regex error at run
time. But our goal is to make it friendlier and fail earlier at "compile
time" as the config is loaded, "long" before any regex is actually executed.

At a bare minimum this is a good place to return a classed warning (say
invalid_regex_warning) to allow finer control than tryCatch(condition=).

On Mon, Oct 9, 2023, 11:30 PM Tomas Kalibera 
wrote:

>
> On 10/10/23 01:57, Michael Chirico via R-devel wrote:
>
> It will be useful to package authors trying to validate input which is
> supposed to be a valid regular expression.
>
> As near as I can tell, the only way we can do so now is to run any
> regex function and check for the warning and/or condition to bubble
> up:
>
> valid_regex <- function(str) {
>   stopifnot(is.character(str), length(str) == 1L)
>   !inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
> }
>
> That's pretty hefty/inscrutable for such a simple validation. I see a
> variety of similar approaches in CRAN packages [1], all slightly
> different. It would be good for R to expose a "canonical" way to run
> this validation.
>
> At root, the problem is that R does not expose the regex compilation
> routines like 'tre_regcomp', so from the R side we have to resort to
> hacky approaches.
>
> Hi Michael,
>
> I don't think you need compilation functions for that. If a regular
> expression is found invalid by a specific third party library R uses, the
> library should return and error to R and R should return an error to you,
> and you should probably propagate that to your users. Grepping an empty
> string might work in many cases as a test, but it is probably more portable
> to simply be prepared to propagate such errors from the actual use on real
> inputs. In theory, there could be some optimization for a particular case,
> the checking may not be the same - but that is the same say for compilation
> and checking.
>
> Things get slightly complicated by encoding/useBytes modes
> (tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb,
> tre_regncompb; all in tre.h), but all are already present in other
> regex routines, so this is doable.
>
> Re encodings, simply R strings should be valid in their encoding. This is
> not just for regular expressions but also for anything else. You shouldn't
> assume that R can handle invalid strings in any reasonable way. Definitely
> you shouldn't try adding invalid strings in tests - behavior with invalid
> strings is unspecified. To test whether a string is valid, there is
> validEnc() (or validUTF8()). But, again, it is probably safest to propagate
> errors from the regular expression R functions (in case the checks differ,
> particularly for non-UTF-8), also, duplicating the encoding checks can be a
> non-trivial overhead.
>
> If there was a strong need to have an automated way to somehow classify
> specifically errors from the regex libraries, perhaps R could attach some
> classes to them when the library tells.
>
> Tomas
>
> Exposing a function to compile regular expressions is common in other
> languages, e.g. Go [2], Python [3], JavaScript [4].
>
> [1] 
> https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code
> [2] https://pkg.go.dev/regexp#Compile
> [3] https://docs.python.org/3/library/re.html#re.compile
> [4] 
> https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp
>
> __r-de...@r-project.org mailing 
> listhttps://stat.ethz.ch/mailman/listinfo/r-devel
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] FR: valid_regex() to test string validity as a regular expression

2023-10-09 Thread Michael Chirico via R-devel

It will be useful to package authors trying to validate input which is
supposed to be a valid regular expression.

As near as I can tell, the only way we can do so now is to run any
regex function and check for the warning and/or condition to bubble
up:

valid_regex <- function(str) {
  stopifnot(is.character(str), length(str) == 1L)
  !inherits(tryCatch(grepl(str, ""), condition = identity), "condition")
}

That's pretty hefty/inscrutable for such a simple validation. I see a
variety of similar approaches in CRAN packages [1], all slightly
different. It would be good for R to expose a "canonical" way to run
this validation.

At root, the problem is that R does not expose the regex compilation
routines like 'tre_regcomp', so from the R side we have to resort to
hacky approaches.

Things get slightly complicated by encoding/useBytes modes
(tre_regwcomp, tre_regncomp, tre_regwncomp, tre_regcompb,
tre_regncompb; all in tre.h), but all are already present in other
regex routines, so this is doable.

Exposing a function to compile regular expressions is common in other
languages, e.g. Go [2], Python [3], JavaScript [4].

[1] 
https://github.com/search?q=lang%3AR+%2Fis%5Ba-zA-Z0-9._%5D*reg%5Ba-zA-Z0-9._%5D*ex.*%28%3C-%7C%3D%29%5Cs*function%2F+org%3Acran&type=code
[2] https://pkg.go.dev/regexp#Compile
[3] https://docs.python.org/3/library/re.html#re.compile
[4] 
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/RegExp

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Unclear provenance of message from S4

2023-08-14 Thread Michael Chirico via R-devel

MRE to produce the message is the following:

setClass("Foo")
setOldClass("Bar")
setAs("Bar", "Foo", \(x) x)
# NOTE: arguments in definition for coerce changed from (x) to (from)

In an interactive setting, that may be fine, but I first encountered
this message in the install log of a package for which I am not the
author (nor do I have any context on the package, really).

As such it took me much longer than it otherwise would have to pick up
on the connection between 'coerce' and 'setAs' (as searching the
sources for 'coerce' yielded nothing). That the faulty argument is the
ubiquitous 'x' is also a major impediment.

It would be much better to surface the relationship to 'setAs' in this message.

I believe that logic has to live in setAs itself as the delegated
substituteFunctionArgs loses the proper context (unless we dare to
resort to sys.calls()).

Sharing here first per usual as I am not very familiar with S4, in
case this all is much clearer to someone with a sharper eye for S4.

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Improving user-friendliness of S4 dispatch failure when mis-naming arguments?

2023-08-11 Thread Michael Chirico via R-devel

> I'm not entirely sure about the extra call. = FALSE

My thinking is the signature of the internal .InheritForDispatch() is
not likely to help anyone,
in fact having the opposite effect for beginners unsure how to use that info.

> Now I'd like to find an example that only uses packages with priority base | 
> Recommended

Sure, here are a few.

library(Matrix)
# searching for Matrix-owned generics
matrix_generics <- getGenerics(where = asNamespace("Matrix"))
matrix_generics@.Data[matrix_generics@package == "Matrix"]

# simple signature, one argument 'x'
symmpart()
# Error: unable to find an inherited method for function ‘symmpart’
for signature ‘x="missing"’

# more complicated signature, especially including ...
Cholesky(a = 1)
# Error: unable to find an inherited method for function ‘Cholesky’
for signature ‘A="missing"’
Cholesky(a = 1, perm = TRUE)
# Error: unable to find an inherited method for function ‘Cholesky’
for signature ‘A="missing"’
Cholesky(a = 1, perm = TRUE, IMult = 2)
# Error: unable to find an inherited method for function ‘Cholesky’
for signature ‘A="missing"’

---

'base' is a bit harder since stats4 just provides classes over stats
functions, so the missigness error comes from non-S4 code.

library(stats4)
coef()
# Error in coef.default() : argument "object" is missing, with no default

Defining our own generic:

setGeneric("BaseGeneric", \(a, ...) standardGeneric("BaseGeneric"))
BaseGeneric()
# Error: unable to find an inherited method for function ‘BaseGeneric’
for signature ‘a="missing"’

# getting multiple classes to show up requires setting the signature:
setMethod("BaseGeneric", signature(x = "double", y = "double"), \(x,
y, ...) x + y)
BaseGeneric(X = 1, Y = 2)
# Error: unable to find an inherited method for function ‘BaseGeneric’
for signature ‘x="missing", y="missing"’


On Fri, Aug 11, 2023 at 2:26 AM Martin Maechler
 wrote:
>
> >>>>> Michael Chirico via R-devel
> >>>>> on Thu, 10 Aug 2023 23:56:45 -0700 writes:
>
> > Here's a trivial patch that offers some improvement:
>
> > Index: src/library/methods/R/methodsTable.R
> > ===
> > --- src/library/methods/R/methodsTable.R (revision 84931)
> > +++ src/library/methods/R/methodsTable.R (working copy)
> > @@ -752,11 +752,12 @@
> > if(length(methods) == 1L)
> > return(methods[[1L]]) # the method
> > else if(length(methods) == 0L) {
> > -cnames <- paste0("\"", vapply(classes, as.character, ""), "\"",
> > +cnames <- paste0(head(fdef@signature, length(classes)), "=\"",
> > vapply(classes, as.character, ""), "\"",
> > collapse = ", ")
> > stop(gettextf("unable to find an inherited method for function %s
> > for signature %s",
> > sQuote(fdef@generic),
> > sQuote(cnames)),
> > + call. = FALSE,
> > domain = NA)
> > }
> > else
>
> > Here's the upshot for the example on DBI:
>
> > dbGetQuery(connection = conn, query = query)
> > Error: unable to find an inherited method for function ‘dbGetQuery’
> > for signature ‘conn="missing", statement="missing"’
>
> > I don't have any confidence about edge cases / robustness of this
> > patch for generic S4 use cases (make check-all seems fine),
>
> Good you checked, but you are right that that's not all enough to be sure:
>
> Checking error *messages* is not something we do often {not
> the least because you'd need to consider message translations
> and hence ensure you only check in case of English ...}.
>
> > but I don't suppose a full patch would be dramatically different from 
> the
> > above.
>
> I agree: The patch looks to make sense to me, too,
> while I'm not entirely sure about the extra   call. = FALSE
>  (which I of course understand you'd prefer for the above example)
>
> Now I'd like to find an example that only uses packages with priority
>  base | Recommended
>
> Martin
>
> --
> Martin Maechler
> ETH Zurich  and  R Core team
>
>
> > Mike C
>
> > On Thu, Aug 10, 2023 at 12:39 PM Gabriel Becker  
> wrote:
> >>
> >> I just want to add my 2 cents that I think it would be very useful and 
> beneficial to improve S4 to surface that information as well.
> >>
> >> Mo

Re: [Rd] Improving user-friendliness of S4 dispatch failure when mis-naming arguments?

2023-08-10 Thread Michael Chirico via R-devel

Here's a trivial patch that offers some improvement:

Index: src/library/methods/R/methodsTable.R
===
--- src/library/methods/R/methodsTable.R (revision 84931)
+++ src/library/methods/R/methodsTable.R (working copy)
@@ -752,11 +752,12 @@
   if(length(methods) == 1L)
 return(methods[[1L]]) # the method
   else if(length(methods) == 0L) {
-cnames <- paste0("\"", vapply(classes, as.character, ""), "\"",
+cnames <- paste0(head(fdef@signature, length(classes)), "=\"",
vapply(classes, as.character, ""), "\"",
  collapse = ", ")
 stop(gettextf("unable to find an inherited method for function %s
for signature %s",
   sQuote(fdef@generic),
   sQuote(cnames)),
+ call. = FALSE,
  domain = NA)
   }
   else

Here's the upshot for the example on DBI:

dbGetQuery(connection = conn, query = query)
Error: unable to find an inherited method for function ‘dbGetQuery’
for signature ‘conn="missing", statement="missing"’

I don't have any confidence about edge cases / robustness of this
patch for generic S4 use cases (make check-all seems fine), but I
don't suppose a full patch would be dramatically different from the
above.

Mike C

On Thu, Aug 10, 2023 at 12:39 PM Gabriel Becker  wrote:
>
> I just want to add my 2 cents that I think it would be very useful and 
> beneficial to improve S4 to surface that information as well.
>
> More information about the way that the dispatch failed would be of great 
> help in situations like the one Michael pointed out.
>
> ~G
>
> On Thu, Aug 10, 2023 at 9:59 AM Michael Chirico via R-devel 
>  wrote:
>>
>> I forwarded that along to the original reporter with positive feedback
>> -- including the argument names is definitely a big help for cuing
>> what exactly is missing.
>>
>> Would a patch to do something similar for S4 be useful?
>>
>> On Thu, Aug 10, 2023 at 6:46 AM Hadley Wickham  wrote:
>> >
>> > Hi Michael,
>> >
>> > I can't help with S4, but I can help to make sure this isn't a problem
>> > with S7. What do you think of the current error message? Do you see
>> > anything obvious we could do to improve?
>> >
>> > library(S7)
>> >
>> > dbGetQuery <- new_generic("dbGetQuery", c("conn", "statement"))
>> > dbGetQuery(connection = NULL, query = NULL)
>> > #> Error: Can't find method for generic `dbGetQuery(conn, statement)`:
>> > #> - conn : MISSING
>> > #> - statement: MISSING
>> >
>> > Hadley
>> >
>> > On Wed, Aug 9, 2023 at 10:02 PM Michael Chirico via R-devel
>> >  wrote:
>> > >
>> > > I fielded a debugging request from a non-expert user today. At root
>> > > was running the following:
>> > >
>> > > dbGetQuery(connection = conn, query = query)
>> > >
>> > > The problem is that they've named the arguments incorrectly -- it
>> > > should have been [1]:
>> > >
>> > > dbGetQuery(conn = conn, statement = query)
>> > >
>> > > The problem is that the error message "looks" highly confusing to the
>> > > untrained eye:
>> > >
>> > > Error in (function (classes, fdef, mtable)  :   unable to find an
>> > > inherited method for function ‘dbGetQuery’ for signature ‘"missing",
>> > > "missing"’
>> > >
>> > > In retrospect, of course, this makes sense -- the mis-named arguments
>> > > are getting picked up by '...', leaving the required arguments
>> > > missing.
>> > >
>> > > But I was left wondering how we could help users right their own ship 
>> > > here.
>> > >
>> > > Would it help to mention the argument names? To include some code
>> > > checking for weird combinations of missing arguments? Any other
>> > > suggestions?
>> > >
>> > > Mike C
>> > >
>> > > [1] 
>> > > https://github.com/r-dbi/DBI/blob/97934c885749dd87a6beb10e8ccb6a5ebea3675e/R/dbGetQuery.R#L62-L64
>> > >
>> > > __
>> > > R-devel@r-project.org mailing list
>> > > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>> >
>> >
>> > --
>> > http://hadley.nz
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Improving user-friendliness of S4 dispatch failure when mis-naming arguments?

2023-08-10 Thread Michael Chirico via R-devel

I forwarded that along to the original reporter with positive feedback
-- including the argument names is definitely a big help for cuing
what exactly is missing.

Would a patch to do something similar for S4 be useful?

On Thu, Aug 10, 2023 at 6:46 AM Hadley Wickham  wrote:
>
> Hi Michael,
>
> I can't help with S4, but I can help to make sure this isn't a problem
> with S7. What do you think of the current error message? Do you see
> anything obvious we could do to improve?
>
> library(S7)
>
> dbGetQuery <- new_generic("dbGetQuery", c("conn", "statement"))
> dbGetQuery(connection = NULL, query = NULL)
> #> Error: Can't find method for generic `dbGetQuery(conn, statement)`:
> #> - conn     : MISSING
> #> - statement: MISSING
>
> Hadley
>
> On Wed, Aug 9, 2023 at 10:02 PM Michael Chirico via R-devel
>  wrote:
> >
> > I fielded a debugging request from a non-expert user today. At root
> > was running the following:
> >
> > dbGetQuery(connection = conn, query = query)
> >
> > The problem is that they've named the arguments incorrectly -- it
> > should have been [1]:
> >
> > dbGetQuery(conn = conn, statement = query)
> >
> > The problem is that the error message "looks" highly confusing to the
> > untrained eye:
> >
> > Error in (function (classes, fdef, mtable)  :   unable to find an
> > inherited method for function ‘dbGetQuery’ for signature ‘"missing",
> > "missing"’
> >
> > In retrospect, of course, this makes sense -- the mis-named arguments
> > are getting picked up by '...', leaving the required arguments
> > missing.
> >
> > But I was left wondering how we could help users right their own ship here.
> >
> > Would it help to mention the argument names? To include some code
> > checking for weird combinations of missing arguments? Any other
> > suggestions?
> >
> > Mike C
> >
> > [1] 
> > https://github.com/r-dbi/DBI/blob/97934c885749dd87a6beb10e8ccb6a5ebea3675e/R/dbGetQuery.R#L62-L64
> >
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel
>
>
>
> --
> http://hadley.nz

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Improving user-friendliness of S4 dispatch failure when mis-naming arguments?

2023-08-09 Thread Michael Chirico via R-devel

I fielded a debugging request from a non-expert user today. At root
was running the following:

dbGetQuery(connection = conn, query = query)

The problem is that they've named the arguments incorrectly -- it
should have been [1]:

dbGetQuery(conn = conn, statement = query)

The problem is that the error message "looks" highly confusing to the
untrained eye:

Error in (function (classes, fdef, mtable)  :   unable to find an
inherited method for function ‘dbGetQuery’ for signature ‘"missing",
"missing"’

In retrospect, of course, this makes sense -- the mis-named arguments
are getting picked up by '...', leaving the required arguments
missing.

But I was left wondering how we could help users right their own ship here.

Would it help to mention the argument names? To include some code
checking for weird combinations of missing arguments? Any other
suggestions?

Mike C

[1] 
https://github.com/r-dbi/DBI/blob/97934c885749dd87a6beb10e8ccb6a5ebea3675e/R/dbGetQuery.R#L62-L64

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] tab-complete for non-syntactic names could attempt backtick-wrapping

2023-03-02 Thread Michael Chirico via R-devel

I personally wouldn't like using a string, and this comment makes me
think it's against the r-core preference as well:

https://bugs.r-project.org/show_bug.cgi?id=18429#c1

Thanks both for catching the sloppy mistake in vapply() :)

Let's continue discussion on the bug/PR.

On Thu, Mar 2, 2023 at 12:39 AM Ivan Krylov  wrote:
>
> There turn out to be a few more things to fix.
>
> One problem is easy to solve: vapply() needs a third argument
> specifying the type of the return value. (Can we have unit tests for
> tab completion?)
>
> The other problem is harder: `comps` defaults to an empty string, and
> you can't have a symbol consisting of an empty string, because this
> value is internally reserved for missing function arguments. I think
> you can return(paste0(prefix, op)) if length(comps) == 0L, but this is
> still somewhat worrying. R tries to prevent empty names, so I wouldn't
> expect specialOpCompletionsHelper() to return an empty string, but I
> can't prove it right now.
>
> On the other hand, x$'a string' is the same as x$`a string`. Could we
> just drop as.name() and keep the return value being a string literal?
> I'm not sure about this, either.
>
> --
> Best regards,
> Ivan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] tab-complete for non-syntactic names could attempt backtick-wrapping

2023-03-01 Thread Michael Chirico via R-devel

Great suggestion! I've started a patch:
https://bugs.r-project.org/show_bug.cgi?id=18479

On Wed, Mar 1, 2023 at 1:56 AM Ivan Krylov  wrote:
>
> В Wed, 1 Mar 2023 01:36:02 -0800
> Michael Chirico via R-devel  пишет:
>
> > +comps[non_syntactic] <- paste0("`", comps[non_syntactic], "`")
>
> There are a few more corner cases. For example, comps could contain
> backticks (which should be escaped with backslashes) and backslashes
> (which should also be escaped). Thankfully, \u-style Unicode escape
> sequences are not currently supported inside backticks, and "escape the
> backslash" rule already takes care of them.
>
> The deparse() function already knows these rules:
>
> name <- 'hello world ` \\uFF'
> cat(deparse1(as.name(name), backtick=TRUE), '\n')
> # `hello world \` \\uFF`
> `hello world \` \\uFF` <- 'hello'
> `hello world \` \\uFF`
> # [1] "hello"
>
> --
> Best regards,
> Ivan

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] tab-complete for non-syntactic names could attempt backtick-wrapping

2023-03-01 Thread Michael Chirico via R-devel

Consider:

x <- list(`a b` = 1)
x$a

(i.e., press the 'tab' key after typing 'x$a')

The auto-complete mechanism will fill the buffer like so:
x$a b

This is not particularly helpful because this is now a syntax error.

It seems to me there's a simple fix -- in
utils:::specialCompletions(), we can wrap the result of
utils:::specialOpCompletionsHelper() with backticks for non-syntactic
names ([1]):

comps <- specialOpCompletionsHelper(op, suffix, prefix)
if (length(comps) == 0L) comps <- ""
+non_syntactic <- make.names(comps) != comps
+comps[non_syntactic] <- paste0("`", comps[non_syntactic], "`")
sprintf("%s%s%s", prefix, op, comps)

I'm somewhat surprised this hasn't come up before (I searched for
'completeToken', 'specialCompletions', and
'specialOpCompletionsHelper' here and on Bugzilla), so I'm checking
with the list first if I'm missing anything before filing a patch.

Mike C

[1] 
https://github.com/r-devel/r-svn/blob/4657f65a377cb5ef318c6548bc264e3b0f9517a0/src/library/utils/R/completion.R#L536-L538

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] FR: names= argument for load()

2023-02-28 Thread Michael Chirico via R-devel

I have three use cases in mind for an argument to load specifying
which variables to load from the input 'file':

1. Avoid overwriting existing variables. ?load recommends using envir=
or attach() to avoid overwrites. An argument like names= would allow
even finer control to avoid collisions.
2. Avoid loading too many (in quantity or in memory size) variables
from a large file. We might save dozens or hundreds of models to the
same file and then load them more lazily.
3. Helping static analysis. Currently, when load() is used in a
script, it becomes impossible to distinguish truly undefined variables
from those defined implicitly by load(). With a names= argument, it
can be possible to know statically precisely which names were
introduced by load, and which might be a bug. (of course there's
nothing stopping authors from having non-literal entries to names=,
but that's a more general issue of static analysis).

Michael Chirico

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Line-terminal \ in character consants -- missing from ?Quotes ?

2023-02-12 Thread Michael Chirico via R-devel

I'm still hung up on ?Quotes -- I can't see mention of 'newline' as a
valid escape. It mentions the literal sequence '\' 'n', where 'n' is
being escaped.

Glanced at the parser blame and apparently the terminal '\' is the
older behavior, and what I'm used to, i.e. literal newlines in char
constants to make multi-line strings, is new (though still 20 years
old):

https://github.com/r-devel/r-svn/commit/bc3f20e4e686be556877bb6bd2882ae8029fd17f

The NEWS entry there does say the same thing as you -- "escaping the
newlines with backslashes".

>From the parser, I think ?Quotes is just missing "newline" as being a
valid escaped character, c.f.

https://github.com/r-devel/r-svn/blob/f55b24945d56e824f124638c596b99887441354a/src/main/gram.y#L2823-L2830
('\n' is treated like '\')
https://github.com/r-devel/r-svn/blob/f55b24945d56e824f124638c596b99887441354a/src/main/gram.y#L2978-L3008
('\n' is in the list of valid items after '\')

I don't see any special handling for '\r', so there may be a gap in
the R parser? Or I just don't understand what I'm reading in the
parser :)

Mike C

On Sun, Feb 12, 2023 at 3:38 AM Duncan Murdoch  wrote:
>
> On 12/02/2023 12:07 a.m., Michael Chirico via R-devel wrote:
> > I'm coming across some code that uses the fact the parser ignores a
> > line-terminal '\', e.g.
> >
> > identical("\
> > ", "\n")
> > # [1] TRUE
> >
> > x = "abc \
> > def"
> > y = "abc \ndef"
> > identical(x, y)
> > # [1] TRUE
> >
> > However:
> > identical("\\n", "\n")
> > # [1] FALSE
> >
> > This appears to be undocumented behavior; the closest thing I see in
> > ?Quotes suggests it should be an error:
> >
> >> Escaping a character not in the following table is an error.
> >
> > ('\n' is in the table, but my understanding is the 'n' is what's being
> > escaped v-a-v the "error", which seems confirmed by the third, FALSE,
> > example above)
> >
> > Is this a bug, is the omission from ?Quotes a bug, or is this just
> > undocumented behavior?
>
> In your first example, you have a backslash which says to escape the
> next char.  The next char is a newline char.  The result is an escaped
> newline, which apparently is a newline.
>
> The same thing happens in the second example.
>
> The third example is an escaped backslash, i.e. a backslash, followed by
> n.  That's not the same as an escaped n, which gives a newline.
>
> So I think the behaviour might be reasonable.
>
> The thing I'd worry about is whether things are handled properly on
> Windows, where the newline is two characters (CR LF).  It might be that
> the backslash at the end of the line escapes the CR, and you get a \r
> out of it instead of a \n.  But maybe not, the parser knows about CR LF
> and internally converts it to \n, so if that happens early enough,
> things would be fine.
>
> Duncan Murdoch
>

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Line-terminal \ in character consants -- missing from ?Quotes ?

2023-02-11 Thread Michael Chirico via R-devel

I'm coming across some code that uses the fact the parser ignores a
line-terminal '\', e.g.

identical("\
", "\n")
# [1] TRUE

x = "abc \
def"
y = "abc \ndef"
identical(x, y)
# [1] TRUE

However:
identical("\\n", "\n")
# [1] FALSE

This appears to be undocumented behavior; the closest thing I see in
?Quotes suggests it should be an error:

> Escaping a character not in the following table is an error.

('\n' is in the table, but my understanding is the 'n' is what's being
escaped v-a-v the "error", which seems confirmed by the third, FALSE,
example above)

Is this a bug, is the omission from ?Quotes a bug, or is this just
undocumented behavior?

Mike C

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Shouldn't "Loading" be "Attaching" when referring to library() calls?

2023-01-09 Thread Michael Chirico via R-devel

require() and library() both emit this message immediately before
running library():

"Loading required package: %s"

https://github.com/r-devel/r-svn/blob/4198a2941b702d965bb2374c2b908f48c369f40a/src/library/base/R/library.R#L967-L968

https://github.com/r-devel/r-svn/blob/4198a2941b702d965bb2374c2b908f48c369f40a/src/library/base/R/library.R#L684-L685

Shouldn't "Loading" be "Attaching" instead?

My understanding is "a package is loaded" is equivalent to
"isNamespaceLoaded()", i.e., loadNamespace() was run. And that "a
package is attached" is equivalent to "pkg %in% .packages()", i.e.,
library(pkg) was run.

Is the terminology not so precise?

There's certainly ambiguity in the internal variable names referenced
above -- in require, we see

loaded <- paste0("package:", package) %in% search()
https://github.com/r-devel/r-svn/blob/4198a2941b702d965bb2374c2b908f48c369f40a/src/library/base/R/library.R#L680

Whereas in library() [via .getRequiredPackages2()] we see

attached <- paste0("package:", pkg) %in% search()
https://github.com/r-devel/r-svn/blob/4198a2941b702d965bb2374c2b908f48c369f40a/src/library/base/R/library.R#L931

Mike C

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] trace() an S3-classed function in .GlobalEnv

2022-06-08 Thread Michael Chirico

Thanks, that seems like a reasonable assessment.

Also worth noting that tracing wipes out a function's attributes,
which is also not documented:

foo = function(x) {
  invisible(1 + 1)
}
attr(foo, "bar") <- 2

trace(foo, quote(message('traced')))
names(attributes(foo))
# [1] "original" "source"   "class"

PS one day I hope to master the dark art of choosing r-devel or
bugzilla for issues :)

On Wed, Jun 8, 2022 at 3:44 AM Martin Maechler
 wrote:
>
> >>>>> Michael Chirico
> >>>>> on Mon, 6 Jun 2022 23:09:29 -0700 writes:
>
> > Consider:
>
> > foo <- function() {
> > invisible(1 + 1)
> > }
> > trace(foo, quote(message("traced")), print=FALSE)
> > foo()
> > # traced
>
> > But with a minimal change:
>
> > class(foo) <- "bar"
> > trace(foo, quote(message("traced")), print=FALSE)
> > # Constructing traceable class “barWithTrace”
> > # Error in .classEnv(className) :
> > #   unable to find an environment containing class “bar”
>
> > I don't see anything like this mentioned in ?trace (nor does a Google
> > search turn up more than a handful of references to this error),
> > and from trying to debug what trace() is doing, we arrive to the error 
> line[1]:
>
> > .makeTraceClass(traceClass, class(original))  #
> > methods:::.makeTraceClass("barWithTrace", "bar")
>
> > I don't quite follow what's going on here, but it looks like trace()
> > is trying to determine an S4 class definition for "bar", but isS4(bar)
> > is FALSE.
>
> > I can (apparently -- not sure if there are as yet-unseen downstream
> > consequences) work around the issue by unsetting the class, tracing,
> > then re-setting:
>
> > class(foo) <- NULL
> > trace(foo, quote(message("traced")), print=FALSE)
> > class(foo) <- "bar"
>
> > But obviously this is a bit clunky. Is this a bug, or am I missing 
> something?
>
> Just a short note of  semi-confirmation:
>
> At the time S4 generics and methods were introduced into R,
> trace() was made much more flexible, notably to be able to trace
> S4 methods.
>
> It can well be that it originally also worked for functions with
> an explicit S3 class, but as such functions are very rare, it
> could well be you've found a bug, namely that trace() assumes
> that if a function has a non-trivial class, it must be an S4
> one.
>
> ... and I know you know how to report bugs ;-)
>
> Thank you in advance!
> Martin
>
> > Mike C
>
> > [1] 
> https://github.com/r-devel/r-svn/blob/e2a64a4e14adbc4e9e8635eaa8cbd2835ce1d764/src/library/methods/R/trace.R#L240
>
> > __
> > R-devel@r-project.org mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] .onLoad, packageStartupMessage, and R CMD check

2021-11-05 Thread Michael Chirico via R-devel

Examining more closely, it's a NOTE produced by R CMD check --
originally I had thought it was a WARNING, which I think would have
been too strong for this case. A NOTE actually seems fine, on second
thought.

For a tiny bit of context, it's common for us to issue messaging
around some state initialization, which has to happen after some
(ex-ante unknown) set of packages are loaded. It's important to do so
whether or not the package is attached, so the proviso in .onLoad()
indeed makes the most sense.

Thanks!

On Thu, Nov 4, 2021 at 1:02 PM Gabriel Becker  wrote:
>
> Hi Michael,
>
> Indeed, just to elaborate further on what I believe Duncan's point is, can 
> you give any examples, "dire" or not, that are appropriate when the package 
> is loaded but not attached (ie none of its symbols are visible to the user 
> without using :::)?
>
> The only things I can think of are a package that changes the behavior of 
> other, attached package code, such as conflicted. Doing so is very much an 
> anti-pattern imo generally, with something like conflicted being an 
> (arguable) exception. And that's assuming conflicted even works/does anything 
> when loaded but not attached (I have not confirmed whether thats the case or 
> not). That or a package that is at end-of-life and is or soon will be 
> unsupported entirely.
>
> The examples don't need to be yours, per se, if you know what those pushing 
> back against your linter were using messages from .onLoad for...
>
> Best,
> ~G
>
>
>
> On Thu, Nov 4, 2021 at 12:37 PM Duncan Murdoch  
> wrote:
>>
>> On 04/11/2021 2:50 p.m., Michael Chirico via R-devel wrote:
>> > I wrote a linter to stop users from using packageStartupMessage() in
>> > their .onLoad() hook because of the R CMD check warning it triggers:
>> >
>> > https://github.com/wch/r-source/blob/8b6625e39cd62424dc23399dade37f20fa8afa91/src/library/tools/R/QC.R#L5167
>> >
>> > However, this received some pushback which I ultimately agree with,
>> > and moreover ?.onLoad seems to agree as well:
>> >
>> >> Loading a namespace should where possible be silent, with startup
>> > messages given by \code{.onAttach}. These messages (**and any essential
>> > ones from \code{.onLoad}**) should use \code{\link{packageStartupMessage}}
>> > so they can be silenced where they would be a distraction.
>> >
>> > **emphasis** mine. That is, if we think some message is _essential_ to
>> > print during loadNamespace(), we are told to use
>> > packageStartupMessage().
>> >
>> > Should we remove this R CMD check warning?
>>
>> The help page doesn't define what an "essential" message would be, but I
>> would assume it's a message about some dire condition, not just "Hi! I
>> just got loaded!".  So I think a note or warning would be appropriate,
>> but not an error.
>>
>> Do you have an example of something that should routinely print, but
>> that triggers a warning when checked?
>>
>> Duncan Murdoch
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] .onLoad, packageStartupMessage, and R CMD check

2021-11-04 Thread Michael Chirico via R-devel

I wrote a linter to stop users from using packageStartupMessage() in
their .onLoad() hook because of the R CMD check warning it triggers:

https://github.com/wch/r-source/blob/8b6625e39cd62424dc23399dade37f20fa8afa91/src/library/tools/R/QC.R#L5167

However, this received some pushback which I ultimately agree with,
and moreover ?.onLoad seems to agree as well:

> Loading a namespace should where possible be silent, with startup
messages given by \code{.onAttach}. These messages (**and any essential
ones from \code{.onLoad}**) should use \code{\link{packageStartupMessage}}
so they can be silenced where they would be a distraction.

**emphasis** mine. That is, if we think some message is _essential_ to
print during loadNamespace(), we are told to use
packageStartupMessage().

Should we remove this R CMD check warning?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Should seq.Date() return double storage?

2021-09-07 Thread Michael Chirico via R-devel

today <- Sys.Date()
typeof(today)
# [1] "double"
typeof(seq(today, by=1, length.out=2))
# [1] "integer"

Clearly minor as it doesn't seem to have come up before (e.g. coercion
to numeric will happen automatically whenever fractional dates are
needed); I only noticed because of a test using identical failing:

identical(
  seq(today, by=1, length.out=10),
  today + 0:9
)
# [1] FALSE

It's easy in this case to fix the test using coercion, but this could
(and did in practice) come up at deeper levels of nesting that become
more onerous to handle. And using all.equal() comes with its own
tradeoffs.

The fix would be easy enough -- at a glance there are two usages of
.Date(seq.int(...)) in seq.Date() that could be replaced by
.Date(as.numeric(seq.int(...))).

Mike C

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] translation domain is not inferred correctly from a package's print methods -- intended behavior?

2021-07-12 Thread Michael Chirico

Here is a reprex:

# initialize reprex package
cd /tmp
mkdir myPkg && cd myPkg
echo "Package: myPkg" > DESCRIPTION
echo "Version: 0.0.1" >> DESCRIPTION
mkdir R
echo "print.my_class = function(x, ...) { cat(gettext(\"'%s' is
deprecated.\"), '\n', gettext(\"'%s' is deprecated.\",
domain='R-myPkg'), '\n') }" > R/foo.R
echo "S3method(print, my_class)" > NAMESPACE
# extract string for translation
Rscript -e "tools::update_pkg_po('.')"
# add dummy translation
msginit -i po/R-myPkg.pot -o po/R-ja.po -l ja --no-translator
head -n -1 po/R-ja.po > tmp && mv tmp po/R-ja.po
echo 'msgstr "%s successfully translated"' >> po/R-ja.po
# install .mo translations
Rscript -e "tools::update_pkg_po('.')"
# install package & test
R CMD INSTALL .
LANGUAGE=ja Rscript -e "library(myPkg); print(structure(1, class = 'my_class'))"
#  '%s' は廃止予定です
#  %s successfully translated

Note that the first gettext() call, which doesn't supply domain=,
returns the corresponding translation from base R (i.e., the output is
the same as gettext("'%s' is deprecated.", domain="R-base")).

The second gettext() call, where domain= is supplied, returns our
dummy translation, which is what I would have expected from the first
execution.

Here is what's in ?gettext:

> If domain is NULL or "", and gettext or ngettext is called from a function in 
> the namespace of package pkg the domain is set to "R-pkg". Otherwise there is 
> no default domain.

Does that mean the S3 print method is not "in the namespace of myPkg"?
Or is there a bug here?

If the former, is the edge case of concern here just S3 methods where
the "top level" S3 method is defined in another package? Can we refine
the manual text wording here to be more clear about when we should
expect we need to supply domain= vs have it set automatically?

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] tools::update_pkg_po() doesn't work on Solaris?

2021-07-11 Thread Michael Chirico

Hi all,

I am about 99% sure tools::update_pkg_po() is using an invocation that
is not supported by the Solaris version of msgfmt.

The invocation[1]:

msgfmt -c --statistics -o /path/to/file.mo /path/to/file.po

AFAICT neither -c nor --statistics appear to be supported by Solaris [2]:

I don't have access to a Solaris machine where I can try this, but I
did manage to get r-hub to pass by disabling those options in potools
which similarly relies on msgfmt to generate .mo files [3], [4].

So I wanted to check with the list if anyone can confirm this before
filing a bug report.

Thanks,
Mike C

[1] per 
https://github.com/wch/r-source/blob/59a1965239143ca6242b9cc948d8834e1194e84a/src/library/tools/R/translations.R#L142
[2] 
https://docs.oracle.com/cd/E36784_01/html/E36870/msgfmt-1.html#REFMAN1msgfmt-1
[3] https://github.com/MichaelChirico/potools/pull/219
[4] 
https://builder.r-hub.io/status/potools_0.2.1.tar.gz-f4e936eec1b74bebb91e07c59c23b1cf

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-21 Thread Michael Chirico

Thanks all, great points well taken. Indeed it seems the default of
100 predates SVN tracking in 1997.

I think a NULL default behaving as "end of string" regardless of
encoding makes sense and avoids the overheads of a $ call and a much
heavier nchar() calculation.

Mike C

On Mon, Jun 21, 2021 at 1:32 AM Martin Maechler
 wrote:
>
> >>>>> Tomas Kalibera
> >>>>> on Mon, 21 Jun 2021 10:08:37 +0200 writes:
>
> > On 6/21/21 9:35 AM, Martin Maechler wrote:
> >>>>>>> Michael Chirico
> >>>>>>> on Sun, 20 Jun 2021 15:20:26 -0700 writes:
> >> > Currently, substring defaults to last=100L, which
> >> > strongly suggests the intent is to default to "nchar(x)"
> >> > without having to compute/allocate that up front.
> >>
> >> > Unfortunately, this default makes no sense for "very
> >> > large" strings which may exceed 100L in "width".
> >>
> >> Yes;  and I tend to agree with you that this default is outdated
> >> (Remember :  R was written to work and run on 2 (or 4?) MB of RAM on 
> the
> >> student lab  Macs in Auckland in ca 1994).
> >>
> >> > The max width of a string is .Machine$integer.max-1:
> >>
> >> (which Brodie showed was only almost true)
> >>
> >> > So it seems to me either .Machine$integer.max or
> >> > .Machine$integer.max-1L would be a more sensible default. Am I 
> missing
> >> > something?
> >>
> >> The "drawback" is of course that .Machine$integer.max  is still
> >> a function call (as R beginners may forget) contrary to L,
> >> but that may even be inlined by the byte compiler (? how would we 
> check ?)
> >> and even if it's not, it does more clearly convey the concept
> >> and idea  *and* would probably even port automatically if ever
> >> integer would be increased in R.
>
> > We still have the problem that we need to count characters, not bytes,
> > if we want the default semantics of "until the end of the string".
>
> > I think we would have to fix this either by really using
> > "nchar(type="c"))" or by using e.g. NULL and then treating this as a
> > special case, that would be probably faster.
>
> > Tomas
>
> You are right, as always, Tomas.
> I agree that would be better and we should do it if/when we change
> the default there.
>
> Martin

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Should last default to .Machine$integer.max-1 for substring()

2021-06-20 Thread Michael Chirico

Currently, substring defaults to last=100L, which strongly
suggests the intent is to default to "nchar(x)" without having to
compute/allocate that up front.

Unfortunately, this default makes no sense for "very large" strings
which may exceed 100L in "width".

The max width of a string is .Machine$integer.max-1:

# works
x = strrep(" ", .Machine$integer.max-1L)
# fails
x = strrep(" ", .Machine$integer.max)
Error in strrep(" ", .Machine$integer.max) :
  'Calloc' could not allocate memory (18446744071562067968 of 1 bytes)

(see also the comment in src/main/character.c: "Character strings in R
are less than 2^31-1 bytes, so we use int not size_t.")

So it seems to me either .Machine$integer.max or
.Machine$integer.max-1L would be a more sensible default. Am I missing
something?

Mike C

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] usage of #import in grDevices/src/qdCocoa.h

2020-10-18 Thread Michael Chirico

I happened to notice that this header file uses

#import 

This is the first time I came across the preprocessor directive #import;
the first thing I found about it is this Q&A suggesting it's not portable
nor standard C:

https://stackoverflow.com/q/39280248/3576984

On the other hand, this exact invocation seems pretty common on GitHub

https://github.com/search?l=C&q=%22%23import+%3CCocoa%2FCocoa.h%3E%22&type=Code

I don't see much in the way of relevant documentation for Cocoa besides
stuff like this page from 2001:

http://cocoadevcentral.com/articles/31.php

Is this something that should be updated to use #include? Other packages
appear to have done so, e.g. from GitHub:

https://github.com/search?q=%22%23include+cocoa%2Fcocoa.h%22&type=code

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Translations and snprintf on Windows

2020-06-05 Thread Michael Chirico

Following up on this after Matt D truffled out the issue -- snprintf was a
*bit* of a red herring.

The root of the issue is the use of positional identifiers (like %1$d, see
https://stackoverflow.com/questions/19327441/gcc-dollar-sign-in-printf-format-string)
in the format string for translations.

These identifiers are quite useful for translations since grammar changes
by language mean the inputs should naturally change order when changing
languages -- however, they are a POSIX extension.

snprintf produced the issue since all other messages are ultimately R
functions, and base R handles the issue (AFAICT via trioremap.h).

On Thu, Apr 30, 2020 at 4:16 PM Michael Chirico 
wrote:

> [a bit unsure on if this is maybe better for r-package-devel]
>
> We recently added translations to messages at the R and C level to
> data.table.
>
> At the C level, we did _() wrapping for char arrays supplied to the
> following functions: error, warning, Rprintf, Error, and snprintf.
>
> This seemed OK but the use of snprintf specifically appears to have caused
> a crash on Windows:
>
> https://github.com/Rdatatable/data.table/issues/4402
>
> Is there any guidance against using gettext with snprintf, or perhaps
> guidance on which "outputters" *are* OK for translation?
>
> Michael Chirico
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Should 0L * NA_integer_ be 0L?

2020-05-23 Thread Michael Chirico

OK, so maybe one way to paraphrase:

For R, the boundedness of integer vectors is an implementation detail,
rather than a deeper mathematical fact that can be exploited for this
case.

One might also expect then that overflow wouldn't result in NA, but
rather automatically cast up to numeric? But that this doesn't happen
for efficiency reasons?

Would it make any sense to have a different carveout for the logical
case? For logical, storage as integer might be seen as a similar type
of implementation detail (though if we're being this strict, the
question arises of what multiplication of logical values even means).

FALSE * NA = 0L


On Sat, May 23, 2020 at 6:49 PM Martin Maechler
 wrote:
>
> >>>>> Michael Chirico
> >>>>> on Sat, 23 May 2020 18:08:22 +0800 writes:
>
> > I don't see this specific case documented anywhere (I also tried to 
> search
> > the r-devel archives, as well as I could); the only close reference
> > mentions NA & FALSE = FALSE, NA | TRUE = TRUE. And there's also this
> > snippet from R-lang:
>
> > In cases where the result of the operation would be the same for all
> >> possible values the NA could take, the operation may return this value.
> >>
>
> > This begs the question -- shouldn't 0L * NA_integer_ be 0L?
>
> > Because this is an integer operation, and according to this definition 
> of
> > NA:
>
> > Missing values in the statistical sense, that is, variables whose value
> >> is not known, have the value @code{NA}
> >>
>
> > NA_integer_ should be an unknown integer value between -2^31+1 and 
> 2^31-1.
> > Multiplying any of these values by 0 results in 0 -- that is, the 
> result of
> > the operation would be 0 for all possible values the NA could take.
>
>
> > This came up from what seems like an inconsistency to me:
>
> > all(NA, FALSE)
> > # [1] FALSE
> > NA * FALSE
> > # [1] NA
>
> > I agree with all(NA, FALSE) being FALSE because we know for sure that 
> all
> > cannot be true. The same can be said of the multiplication -- whether NA
> > represents TRUE or FALSE, the resulting value is 0 (FALSE).
>
> > I also agree with the numeric case, FWIW: NA_real_ * 0 has to be 
> NA_real_,
> > because NA_real_ could be Inf or NaN, for both of which multiplication 
> by 0
> > gives NaN, hence 0 * NA_real_ is either 0 or NaN, hence it must be 
> NA_real_.
>
> I agree about almost everything you say above. ...
> but possibly the main conclusion.
>
> The problem with your proposed change would be that  integer
> arithmetic gives a different result than the corresponding
> "numeric" computation.
> (I don't remember other such cases in R, at least as long as the
>  integer arithmetic does not overflow.)
>
> One principle to decided such problems in S and R has been that
> the user should typically *not* have to know if their data is
> stored in float/double or in integer, and the results should be the same
> (possibly apart from staying integer for some operations).
>
>
> {{Note that there are also situations were it's really
>   undesirable that0 * NA   does *not* give 0 (but NA);
>   notably in sparse matrix operations where you'd very often can
>   now that NA was not Inf (or NaN) and you really would like to
>   preserve sparseness ...}}
>
>
> > [[alternative HTML version deleted]]
>
> (as you did not use plain text ..)

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Should 0L * NA_integer_ be 0L?

2020-05-23 Thread Michael Chirico

I don't see this specific case documented anywhere (I also tried to search
the r-devel archives, as well as I could); the only close reference
mentions NA & FALSE = FALSE, NA | TRUE = TRUE. And there's also this
snippet from R-lang:

In cases where the result of the operation would be the same for all
> possible values the NA could take, the operation may return this value.
>

This begs the question -- shouldn't 0L * NA_integer_ be 0L?

Because this is an integer operation, and according to this definition of
NA:

Missing values in the statistical sense, that is, variables whose value
> is not known, have the value @code{NA}
>

NA_integer_ should be an unknown integer value between -2^31+1 and 2^31-1.
Multiplying any of these values by 0 results in 0 -- that is, the result of
the operation would be 0 for all possible values the NA could take.

This came up from what seems like an inconsistency to me:

all(NA, FALSE)
# [1] FALSE
NA * FALSE
# [1] NA

I agree with all(NA, FALSE) being FALSE because we know for sure that all
cannot be true. The same can be said of the multiplication -- whether NA
represents TRUE or FALSE, the resulting value is 0 (FALSE).

I also agree with the numeric case, FWIW: NA_real_ * 0 has to be NA_real_,
because NA_real_ could be Inf or NaN, for both of which multiplication by 0
gives NaN, hence 0 * NA_real_ is either 0 or NaN, hence it must be NA_real_.

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Translations and snprintf on Windows

2020-04-30 Thread Michael Chirico

[a bit unsure on if this is maybe better for r-package-devel]

We recently added translations to messages at the R and C level to
data.table.

At the C level, we did _() wrapping for char arrays supplied to the
following functions: error, warning, Rprintf, Error, and snprintf.

This seemed OK but the use of snprintf specifically appears to have caused
a crash on Windows:

https://github.com/Rdatatable/data.table/issues/4402

Is there any guidance against using gettext with snprintf, or perhaps
guidance on which "outputters" *are* OK for translation?

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

2019-09-16 Thread Michael Chirico

Awesome. Gabe, since you already have a workshopped version, would you like
to proceed? Feel free to ping me to review the patch once it's posted.

On Mon, Sep 16, 2019 at 3:26 PM Martin Maechler 
wrote:

> >>>>> Michael Chirico
> >>>>> on Sun, 15 Sep 2019 20:52:34 +0800 writes:
>
> > Finally read in detail your response Gabe. Looks great,
> > and I agree it's quite intuitive, as well as agree against
> > non-recycling.
>
> > Once the length(n) == length(dim(x)) behavior is enabled,
> > I don't think there's any need/desire to have head() do
> > x[1:6,1:6] anymore. head(x, c(6, 6)) is quite clear for
> > those familiar with head(x, 6), it would seem to me.
>
> > Mike C
>
> Thank you, Gabe, and Michael.
> I did like Gabe's proposal already back in July but was
> busy and/or vacationing then ...
>
> If you submit this with a patch (that includes changes to both
> *.R and *.Rd , including some example) as "wishlist" item to R's
> bugzilla, I'm willing/happy to check and commit this to R-devel.
>
> Martin
>
>
> > On Sat, Jul 13, 2019 at 8:35 AM Gabriel Becker
> >  wrote:
>
> >> Hi Michael and Abby,
> >>
> >> So one thing that could happen that would be backwards
> >> compatible (with the exception of something that was an
> >> error no longer being an error) is head and tail could
> >> take vectors of length (dim(x)) rather than integers of
> >> length for n, with the default being n=6 being equivalent
> >> to n = c(6, dim(x)[2], <...>, dim(x)[k]), at least for
> >> the deprecation cycle, if not permanently. It not
> >> recycling would be unexpected based on the behavior of
> >> many R functions but would preserve the current behavior
> >> while granting more fine-grained control to users that
> >> feel they need it.
> >>
> >> A rapidly thrown-together prototype of such a method for
> >> the head of a matrix case is as follows:
> >>
> >> head2 = function(x, n = 6L, ...) { indvecs =
> >> lapply(seq_along(dim(x)), function(i) { if(length(n) >=
> >> i) { ni = n[i] } else { ni = dim(x)[i] } if(ni < 0L) ni =
> >> max(nrow(x) + ni, 0L) else ni = min(ni, dim(x)[i])
> >> seq_len(ni) }) lstargs = c(list(x),indvecs, drop = FALSE)
> >> do.call("[", lstargs) }
> >>
> >>
> >> > mat = matrix(1:100, 10, 10)
> >>
> >> > *head(mat)*
> >>
> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >>
> >> [1,] 1 11 21 31 41 51 61 71 81 91
> >>
> >> [2,] 2 12 22 32 42 52 62 72 82 92
> >>
> >> [3,] 3 13 23 33 43 53 63 73 83 93
> >>
> >> [4,] 4 14 24 34 44 54 64 74 84 94
> >>
> >> [5,] 5 15 25 35 45 55 65 75 85 95
> >>
> >> [6,] 6 16 26 36 46 56 66 76 86 96
> >>
> >> > *head2(mat)*
> >>
> >> [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
> >>
> >> [1,] 1 11 21 31 41 51 61 71 81 91
> >>
> >> [2,] 2 12 22 32 42 52 62 72 82 92
> >>
> >> [3,] 3 13 23 33 43 53 63 73 83 93
> >>
> >> [4,] 4 14 24 34 44 54 64 74 84 94
> >>
> >> [5,] 5 15 25 35 45 55 65 75 85 95
> >>
> >> [6,] 6 16 26 36 46 56 66 76 86 96
> >>
> >> > *head2(mat, c(2, 3))*
> >>
> >> [,1] [,2] [,3]
> >>
> >> [1,] 1 11 21
> >>
> >> [2,] 2 12 22
> >>
> >> > *head2(mat, c(2, -9))*
> >>
> >> [,1]
> >>
> >> [1,] 1
> >>
> >> [2,] 2
> >>
> >>
> >> Now one thing to keep in mind here, is that I think we'd
> >> either a) have to make the non-recycling behavior
> >> permanent, or b) have head treat data.frames and matrices
> >> different with respect to the subsets they grab (which
> >> strikes me as a *Bad Plan *(tm)).
> >>
> >> So I don't think the default behavior would ever be
> >> mat[1:6, 1:6], not because of backwards compatibility,
> >> but because at least in my intuition that is just not
> >> what he

Re: [Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

2019-09-15 Thread Michael Chirico

Finally read in detail your response Gabe. Looks great, and I agree it's
quite intuitive, as well as agree against non-recycling.

Once the length(n) == length(dim(x)) behavior is enabled, I don't think
there's any need/desire to have head() do x[1:6,1:6] anymore. head(x, c(6,
6)) is quite clear for those familiar with head(x, 6), it would seem to me.

Mike C

On Sat, Jul 13, 2019 at 8:35 AM Gabriel Becker 
wrote:

> Hi Michael and Abby,
>
> So one thing that could happen that would be backwards compatible (with
> the exception of something that was an error no longer being an error) is
> head and tail could take vectors of length (dim(x)) rather than integers of
> length for n, with the default being n=6 being equivalent to n = c(6,
> dim(x)[2], <...>, dim(x)[k]), at least for the deprecation cycle, if not
> permanently. It not recycling would be unexpected based on the behavior of
> many R functions but would preserve the current behavior while granting
> more fine-grained control to users that feel they need it.
>
> A rapidly thrown-together prototype of such a method for the head of a
> matrix case is as follows:
>
> head2 = function(x, n = 6L, ...) {
> indvecs = lapply(seq_along(dim(x)), function(i) {
> if(length(n) >= i) {
> ni = n[i]
> } else {
> ni =  dim(x)[i]
> }
> if(ni < 0L)
> ni = max(nrow(x) + ni, 0L)
> else
> ni = min(ni, dim(x)[i])
> seq_len(ni)
> })
> lstargs = c(list(x),indvecs, drop = FALSE)
> do.call("[", lstargs)
> }
>
>
> > mat = matrix(1:100, 10, 10)
>
> > *head(mat)*
>
>  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
>
> [1,]1   11   21   31   41   51   61   71   8191
>
> [2,]2   12   22   32   42   52   62   72   8292
>
> [3,]3   13   23   33   43   53   63   73   8393
>
> [4,]4   14   24   34   44   54   64   74   8494
>
> [5,]5   15   25   35   45   55   65   75   8595
>
> [6,]6   16   26   36   46   56   66   76   8696
>
> > *head2(mat)*
>
>  [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
>
> [1,]1   11   21   31   41   51   61   71   8191
>
> [2,]2   12   22   32   42   52   62   72   8292
>
> [3,]3   13   23   33   43   53   63   73   8393
>
> [4,]4   14   24   34   44   54   64   74   8494
>
> [5,]5   15   25   35   45   55   65   75   8595
>
> [6,]6   16   26   36   46   56   66   76   8696
>
> > *head2(mat, c(2, 3))*
>
>  [,1] [,2] [,3]
>
> [1,]1   11   21
>
> [2,]2   12   22
>
> > *head2(mat, c(2, -9))*
>
>  [,1]
>
> [1,]1
>
> [2,]2
>
>
> Now one thing to keep in mind here, is that I think we'd  either a) have
> to make the non-recycling  behavior permanent, or b) have head treat
> data.frames and matrices different with respect to the subsets they grab
> (which strikes me as a  *Bad Plan *(tm)).
>
> So I don't think the default behavior would ever be mat[1:6, 1:6],  not
> because of backwards compatibility, but because at least in my intuition
> that is just not what head on a data.frame should do by default, and I
> think the behaviors for the basic rectangular datatypes should "stick
> together". I mean, also because of backwards compatibility, but that could  
> *in
> theory* change across a long enough deprecation cycle, but  the
> conceptually right thing to do with a data.frame probably won't.
>
> All of that said, is head(mat, c(6, 6)) really that much  easier to
> type/better than just mat[1:6, 1:6, drop=FALSE] (I know this will behave
> differently if any of the dims of mat are less than 6, but if so why are
> you heading it in the first place ;) )? I don't really have a strong
> feeling on the answer to that.
>
> I'm happy to put a patch for head.matrix, head.data.frame, tail.matrix and
> tail.data.frame, plus documentation, if people on R-core are interested in
> this.
>
> Note, as most here probably know, and as alluded to above,  length(n) > 1
> for head or tail currently give an error, so  this would  be an extension
> of the existing functionality in the mathematical extension sense, where
> all existing behavior would remain identical, but the support/valid
> parameter space would grow.
>
> Best,
> ~G
>
>
> On Fri, Jul 12, 2019 at 4:03 PM Abby Spurdle  wrote:
>
>> > I assume there are lots of backwards-compatibility issues as well as
>> valid
>> > use cases for this behavior, so I guess defaulting to M[1:6, 1:6] is out
>> of
>> > the question.
>>
>> Agree.
>>
>> > Is there any scope for adding a new argument to head.matrix that would
>> > allow this flexibility?
>>
>> I agree with what you're trying to achieve.
>> However, I'm not sure this is as simple as you're suggesting.
>>
>> What if the user wants "head" in rows but "tail" in columns.
>> Or "head" in rows, and both "head" and "tail" in columns.
>> With head and tail alone, there's a combinatorial explosion.
>>
>> Also, when using tail on an unnamed matrix,

[Rd] --disable-long-double or --enable-long-double=no?

2019-08-21 Thread Michael Chirico

There's a bit of confusion about how to disable long double support in an R
build.

I see --disable-long-double scattered about, e.g.

   - R-exts:
   
https://cran.r-project.org/doc/manuals/r-release/R-exts.html#Writing-portable-packages
   - R-admin:
   https://cran.r-project.org/doc/manuals/r-release/R-admin.html#Solaris
   - CRAN noLD check description:
   https://www.stats.ox.ac.uk/pub/bdr/noLD/README.txt
   - ?capabilities:
   https://stat.ethz.ch/R-manual/R-devel/library/base/html/capabilities.html

However, it's *missing* from ./config (cd r-source && grep
"disable-long-double" configure). Instead there appears to be some code
built around enable-long-double:

./configure:1808:  --enable-long-doubleuse long double type [yes]

./configure:24723:# Check whether --enable-long-double was given.

I see the option apparently introduced here in 2012 & the ambiguity is
immediate -- the commit mentions disable-long-double but builds
enable-long-double.

https://github.com/wch/r-source/commit/fb8e36f8be0aaf47a9c54c9effb219dae34f0e41

Could someone please help to clear the confusion?

Thanks
Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Why no tz argument for format.POSIXlt?

2019-08-13 Thread Michael Chirico

Was a bit surprised to see

oldtz = Sys.getenv('TZ')
Sys.setenv(TZ = 'Asia/Jakarta')
format(Sys.time())
# [1] "2019-08-13 16:05:03"
format(Sys.time(), tz = 'UTC') # all is well
# [1] "2019-08-13 09:05:03"
format(trunc(Sys.time(), 'hours')) # correctly truncated in local time
# [1] "2019-08-13 16:00:00"
format(trunc(Sys.time(), 'hours'), tz = 'UTC') # no effect!
[1] "2019-08-13 16:00:00"
Sys.setenv(TZ = oldtz)

The reason for the discrepancy is that trunc.POSIXt returns a POSIXlt
object (not POSIXct), whereas Sys.time() is POSIXct. And while
format.POSIXct has a tz argument, format.POSIXlt does not:

names(formals(format.POSIXct))
# [1] "x"  "format" "tz" "usetz"  "..."
names(formals(format.POSIXlt))
# [1] "x"  "format" "usetz"  "..."

Is there any reason not to accept a tz argument for format.POSIXlt? It's
quite convenient to be able to specify an output timezone format on the fly
with format.POSIXct; in the case at hand, I'm trying to force UTC time on
input. format(as.POSIXct(x), tz = 'UTC') seems to work just fine, is there
a reason why this wouldn't be done internally?

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] bug: write.dcf converts hyphen in field name to period

2019-08-02 Thread Michael Chirico

write.dcf(list('my-field' = 1L), tmp <- tempfile())

cat(readLines(tmp))
# my.field: 1

However there's nothing wrong with hyphenated fields per the Debian
standard:

https://www.debian.org/doc/debian-policy/ch-controlfields.html

And in fact we see them using hyphenated fields there, and indeed read.dcf
handles this just fine:

writeLines(gsub('.', '-', readLines(tmp), fixed = TRUE), tmp)
read.dcf(tmp)
#  my-field
# [1,] "1"

The guilty line is as.data.frame:

if(!is.data.frame(x)) x <- as.data.frame(x, stringsAsFactors = FALSE)

For my case, simply adding check.names=FALSE to this call would solve the
issue in my case, but I think not in general. Here's what I see in the
standard:

> The field name is composed of US-ASCII characters excluding control
characters, space, and colon (i.e., characters in the ranges U+0021 (!)
through U+0039 (9), and U+003B (;) through U+007E (~), inclusive). Field
names must not begin with the comment character (U+0023 #), nor with the
hyphen character (U+002D -).

This could be handled by an adjustment to the next line:

nmx <- names(x)

becomes

nmx <- gsub('^[#-]', '', gsub('[^\U{0021}-\U{0039}\U{003B}-\U{007E}]', '.',
names(x)))

(Or maybe errors for having invalid names)

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] head.matrix can return 1000s of columns -- limit to n or add new argument?

2019-07-08 Thread Michael Chirico

I think of head() as a standard helper for "glancing" at objects, so I'm
sometimes surprised that head() produces massive output:

M = matrix(nrow = 10L, ncol = 10L)
print(head(M)) # <- beware, could be a huge print

I assume there are lots of backwards-compatibility issues as well as valid
use cases for this behavior, so I guess defaulting to M[1:6, 1:6] is out of
the question.

Is there any scope for adding a new argument to head.matrix that would
allow this flexibility? IINM it should essentially be as simple to do
head.array as:

do.call(`[`, c(list(x, drop = FALSE), lapply(pmin(dim(x), n), seq_len)))

(with extra decoration to handle -n, etc)

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] rbind has confusing result for custom sub-class (possible bug?)

2019-06-02 Thread Michael Chirico

Thanks for following up! In fact that's exactly what was done here:

https://github.com/Rdatatable/data.table/pull/3602/files

On Sun, Jun 2, 2019 at 8:00 PM Joshua Ulrich 
wrote:

> I thought it would be good to summarize my thoughts, since I made a
> few hypotheses that turned out to be false.
>
> This isn't a bug in base R, in either rbind() or `[<-.Date`.
>
> To summarize the root cause:
> base::rbind.data.frame() calls `[<-` for each column of the
> data.frame, and there is no `[<-.IDate` method to ensure the
> replacement value is converted to integer.  And, in fact, `[<-.Date`
> calls as.Date() and data.table::as.Date.IDate() calls as.numeric() on
> the IDate object.  So the problem exists, and can be fixed, in
> data.table.
>
> Best,
> Josh
>
> On Mon, May 27, 2019 at 9:34 AM Joshua Ulrich 
> wrote:
> >
> > Follow-up (inline) on my comment about a potential issue in `[<-.Date`.
> >
> > On Mon, May 27, 2019 at 9:31 AM Michael Chirico
> >  wrote:
> > >
> > > Yes, thanks for following up on thread here. And thanks again for
> clearing things up, your email was a finger snap of clarity on the whole
> issue.
> > >
> > > I'll add that actually it was data.table's code at fault on the
> storage conversion -- note that if you use an arbitrary sub-class 'foo'
> with no methods defined, it'll stay integer.
> > >
> > > That's because [<- calls as.Date and then as.Date.IDate, and that
> method (ours) has as.numeric(); earlier I had recognized that if we
> commented that line, the issue was "fixed" but I still wasn't understanding
> the root cause.
> > >
> > > My last curiosity on this issue will be in my follow-up thread.
> > >
> > > Mike C
> > >
> > > On Mon, May 27, 2019, 10:25 PM Joshua Ulrich 
> wrote:
> > >>
> > >> On Sun, May 26, 2019 at 6:47 AM Joshua Ulrich <
> josh.m.ulr...@gmail.com> wrote:
> > >> >
> > >> > On Sun, May 26, 2019 at 4:06 AM Michael Chirico
> > >> >  wrote:
> > >> > >
> > >> > > Have finally managed to come up with a fix after checking out
> sys.calls()
> > >> > > from within the as.Date.IDate debugger, which shows something
> like:
> > >> > >
> > >> > > [[1]] rbind(DF, DF)
> > >> > > [[2]] rbind(deparse.level, ...)
> > >> > > [[3]] `[<-`(`*tmp*`, ri, value = 18042L)
> > >> > > [[4]] `[<-.Date`(`*tmp*`, ri, value = 18042L)
> > >> > > [[5]] as.Date(value)
> > >> > > [[6]] as.Date.IDate(value)
> > >> > >
> > >> > > I'm not sure why [<- is called, I guess the implementation is to
> assign to
> > >> > > the output block by block? Anyway, we didn't have a [<- method.
> And
> > >> > > [<-.Date looks like:
> > >> > >
> > >> > > value <- unclass(as.Date(value)) # <- converts to double
> > >> > > .Date(NextMethod(.Generic), oldClass(x)) # <- restores 'IDate'
> class
> > >> > >
> > >> > > So we can fix our bug by defining a [<- class; the question that
> I still
> > >> > > don't see answered in documentation or source code is, why/where
> is [<-
> > >> > > called, exactly?
> > >> > >
> > >> > Your rbind(DF, DF) call dispatches to base::rbind.data.frame().  The
> > >> > `[<-` call is this line:
> > >> > value[[jj]][ri] <- if (is.factor(xij)) as.vector(xij) else xij
> > >> >
> > >> > That's where the storage.mode changes from integer to double.
> > >> >
> > >> > debug: value[[jj]][ri] <- if (is.factor(xij)) as.vector(xij) else
> xij
> > >> > Browse[2]>
> > >> > debug: xij
> > >> > Browse[2]> storage.mode(xij)
> > >> > [1] "integer"
> > >> > Browse[2]> value[[jj]][ri]
> > >> > [1] "2019-05-26"
> > >> > Browse[2]> storage.mode(value[[jj]][ri])
> > >> > [1] "integer"
> > >> > Browse[2]>
> > >> > debug: if (!is.null(nm <- names(xij))) names(value[[jj]])[ri] <- nm
> > >> > Browse[2]> storage.mode(value[[jj]][ri])
> > >> > [1] "double"
> > >> >
> > >> To be clear, I don't think this is a

[Rd] Why is R in Japanese (only in Mac terminal)?

2019-05-29 Thread Michael Chirico

Since a while ago, R on my Mac terminal is being started in Japanese:

R version 3.5.2 (2018-12-20) -- "Eggshell Igloo"

Copyright (C) 2018 The R Foundation for Statistical Computing

Platform: x86_64-apple-darwin15.6.0 (64-bit)


R は、自由なソフトウェアであり、「完全に無保証」です。

一定の条件に従えば、自由にこれを再配布することができます。

配布条件の詳細に関しては、'license()' あるいは 'licence()' と入力してください。


  Natural language support but running in an English locale


R is a collaborative project with many contributors.

Type 'contributors()' for more information and

'citation()' on how to cite R or R packages in publications.


'demo()' と入力すればデモをみることができます。

'help()' とすればオンラインヘルプが出ます。

'help.start()' で HTML ブラウザによるヘルプがみられます。

'q()' と入力すれば R を終了します。

I never gave it too much mind since I understand Japanese and am mostly
working in RStudio anyway (RStudio is in English). But I found a "bug" in
testthat's is_english (which tests whether the current session is reporting
base messages in English) and reported here:

https://github.com/r-lib/testthat/issues/879

I say "bug" because as near as I can tell is_english is built assuming the
logic laid out in ?gettext, ?locales. So even though my machine appears to
have none of the "symptoms" of a non-English locale, nevertheless I get
Japanese. My session info:

R version 3.5.2 (2018-12-20)

Platform: x86_64-apple-darwin15.6.0 (64-bit)

Running under: macOS High Sierra 10.13.6


Matrix products: default

BLAS:
/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRblas.0.dylib

LAPACK:
/Library/Frameworks/R.framework/Versions/3.5/Resources/lib/libRlapack.dylib


locale:

[1] C/UTF-8/C/C/C/C


attached base packages:

[1] stats graphics  grDevices utils datasets  methods   base


loaded via a namespace (and not attached):

[1] compiler_3.5.2

My Sys.getenv() and "Languages & Region" settings are in the issue link.

Where else should I be looking in my R session or terminal to figure out
why it's in Japanese?

Mike C

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] rbind has confusing result for custom sub-class (possible bug?)

2019-05-27 Thread Michael Chirico

Yes, thanks for following up on thread here. And thanks again for clearing
things up, your email was a finger snap of clarity on the whole issue.

I'll add that actually it was data.table's code at fault on the storage
conversion -- note that if you use an arbitrary sub-class 'foo' with no
methods defined, it'll stay integer.

That's because [<- calls as.Date and then as.Date.IDate, and that method
(ours) has as.numeric(); earlier I had recognized that if we commented that
line, the issue was "fixed" but I still wasn't understanding the root cause.

My last curiosity on this issue will be in my follow-up thread.

Mike C

On Mon, May 27, 2019, 10:25 PM Joshua Ulrich 
wrote:

> On Sun, May 26, 2019 at 6:47 AM Joshua Ulrich 
> wrote:
> >
> > On Sun, May 26, 2019 at 4:06 AM Michael Chirico
> >  wrote:
> > >
> > > Have finally managed to come up with a fix after checking out
> sys.calls()
> > > from within the as.Date.IDate debugger, which shows something like:
> > >
> > > [[1]] rbind(DF, DF)
> > > [[2]] rbind(deparse.level, ...)
> > > [[3]] `[<-`(`*tmp*`, ri, value = 18042L)
> > > [[4]] `[<-.Date`(`*tmp*`, ri, value = 18042L)
> > > [[5]] as.Date(value)
> > > [[6]] as.Date.IDate(value)
> > >
> > > I'm not sure why [<- is called, I guess the implementation is to
> assign to
> > > the output block by block? Anyway, we didn't have a [<- method. And
> > > [<-.Date looks like:
> > >
> > > value <- unclass(as.Date(value)) # <- converts to double
> > > .Date(NextMethod(.Generic), oldClass(x)) # <- restores 'IDate' class
> > >
> > > So we can fix our bug by defining a [<- class; the question that I
> still
> > > don't see answered in documentation or source code is, why/where is [<-
> > > called, exactly?
> > >
> > Your rbind(DF, DF) call dispatches to base::rbind.data.frame().  The
> > `[<-` call is this line:
> > value[[jj]][ri] <- if (is.factor(xij)) as.vector(xij) else xij
> >
> > That's where the storage.mode changes from integer to double.
> >
> > debug: value[[jj]][ri] <- if (is.factor(xij)) as.vector(xij) else xij
> > Browse[2]>
> > debug: xij
> > Browse[2]> storage.mode(xij)
> > [1] "integer"
> > Browse[2]> value[[jj]][ri]
> > [1] "2019-05-26"
> > Browse[2]> storage.mode(value[[jj]][ri])
> > [1] "integer"
> > Browse[2]>
> > debug: if (!is.null(nm <- names(xij))) names(value[[jj]])[ri] <- nm
> > Browse[2]> storage.mode(value[[jj]][ri])
> > [1] "double"
> >
> To be clear, I don't think this is a bug in rbind() or
> rbind.data.frame().  The confusion is that rbind.data.frame() calls
> `[<-` for each column of the data.frame, and there is no `[<-.IDate`
> method.  So the parent class method is dispatched, which converts the
> storage mode to double.
>
> Someone may argue that this is an issue with `[<-.Date`, and that it
> shouldn't convert the storage.mode from integer to double.
> >
> > > Mike C
> > >
> > > On Sun, May 26, 2019 at 1:16 PM Michael Chirico <
> michaelchiri...@gmail.com>
> > > wrote:
> > >
> > > > Debugging this issue:
> > > >
> > > > https://github.com/Rdatatable/data.table/issues/2008
> > > >
> > > > We have custom class 'IDate' which inherits from 'Date' (it just
> forces
> > > > integer storage for efficiency, hence, I).
> > > >
> > > > The concatenation done by rbind, however, breaks this and returns a
> double:
> > > >
> > > > library(data.table)
> > > > DF = data.frame(date = as.IDate(Sys.Date()))
> > > > storage.mode(rbind(DF, DF)$date)
> > > > # [1] "double"
> > > >
> > > > This is specific to base::rbind (data.table's rbind returns an
> integer as
> > > > expected); in ?rbind we see:
> > > >
> > > > The method dispatching is not done via UseMethod(), but by C-internal
> > > > dispatching. Therefore there is no need for, e.g., rbind.default.
> > > > The dispatch algorithm is described in the source file
> > > > (‘.../src/main/bind.c’) as
> > > > 1. For each argument we get the list of possible class memberships
> from
> > > > the class attribute.
> > > > 2. *We inspect each class in turn to see if there is an applic

Re: [Rd] rbind has confusing result for custom sub-class (possible bug?)

2019-05-26 Thread Michael Chirico

Have finally managed to come up with a fix after checking out sys.calls()
from within the as.Date.IDate debugger, which shows something like:

[[1]] rbind(DF, DF)
[[2]] rbind(deparse.level, ...)
[[3]] `[<-`(`*tmp*`, ri, value = 18042L)
[[4]] `[<-.Date`(`*tmp*`, ri, value = 18042L)
[[5]] as.Date(value)
[[6]] as.Date.IDate(value)

I'm not sure why [<- is called, I guess the implementation is to assign to
the output block by block? Anyway, we didn't have a [<- method. And
[<-.Date looks like:

value <- unclass(as.Date(value)) # <- converts to double
.Date(NextMethod(.Generic), oldClass(x)) # <- restores 'IDate' class

So we can fix our bug by defining a [<- class; the question that I still
don't see answered in documentation or source code is, why/where is [<-
called, exactly?

Mike C

On Sun, May 26, 2019 at 1:16 PM Michael Chirico 
wrote:

> Debugging this issue:
>
> https://github.com/Rdatatable/data.table/issues/2008
>
> We have custom class 'IDate' which inherits from 'Date' (it just forces
> integer storage for efficiency, hence, I).
>
> The concatenation done by rbind, however, breaks this and returns a double:
>
> library(data.table)
> DF = data.frame(date = as.IDate(Sys.Date()))
> storage.mode(rbind(DF, DF)$date)
> # [1] "double"
>
> This is specific to base::rbind (data.table's rbind returns an integer as
> expected); in ?rbind we see:
>
> The method dispatching is not done via UseMethod(), but by C-internal
> dispatching. Therefore there is no need for, e.g., rbind.default.
> The dispatch algorithm is described in the source file
> (‘.../src/main/bind.c’) as
> 1. For each argument we get the list of possible class memberships from
> the class attribute.
> 2. *We inspect each class in turn to see if there is an applicable
> method.*
> 3. If we find an applicable method we make sure that it is identical to
> any method determined for prior arguments. If it is identical, we proceed,
> otherwise we immediately drop through to the default code.
>
> It's not clear what #2 means -- an applicable method *for what*? Glancing
> at the source code would suggest it's looking for rbind.IDate:
>
> https://github.com/wch/r-source/blob/trunk/src/main/bind.c#L1051-L1063
>
> const char *generic = ((PRIMVAL(op) == 1) ? "cbind" : "rbind"); // should
> be rbind here
> const char *s = translateChar(STRING_ELT(classlist, i)); // iterating over
> the classes, should get to IDate first
> sprintf(buf, "%s.%s", generic, s); // should be rbind.IDate
>
> but adding this method (or even exporting it) is no help [ simply defining
> rbind.IDate = function(...) as.IDate(NextMethod()) ]
>
> Lastly, it appears that as.Date.IDate is called, which is causing the type
> conversion:
>
> debug(data.table:::as.Date.IDate)
> rbind(DF, DF) # launches debugger
> x
> # [1] "2019-05-26" <-- singleton, so apparently applied to DF$date, not
> c(DF$date, DF$date)
> undebug(data.table:::as.Date.IDate)
>
> I can't really wrap my head around why as.Date is being called here, and
> even allowing that, why the end result is still the original class [
> class(rbind(DF, DF)$date) == c('IDate', 'Date') ]
>
> So, I'm beginning to think this might be a bug. Am I missing something?
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] rbind has confusing result for custom sub-class (possible bug?)

2019-05-25 Thread Michael Chirico

Debugging this issue:

https://github.com/Rdatatable/data.table/issues/2008

We have custom class 'IDate' which inherits from 'Date' (it just forces
integer storage for efficiency, hence, I).

The concatenation done by rbind, however, breaks this and returns a double:

library(data.table)
DF = data.frame(date = as.IDate(Sys.Date()))
storage.mode(rbind(DF, DF)$date)
# [1] "double"

This is specific to base::rbind (data.table's rbind returns an integer as
expected); in ?rbind we see:

The method dispatching is not done via UseMethod(), but by C-internal
dispatching. Therefore there is no need for, e.g., rbind.default.
The dispatch algorithm is described in the source file
(‘.../src/main/bind.c’) as
1. For each argument we get the list of possible class memberships from the
class attribute.
2. *We inspect each class in turn to see if there is an applicable method.*
3. If we find an applicable method we make sure that it is identical to any
method determined for prior arguments. If it is identical, we proceed,
otherwise we immediately drop through to the default code.

It's not clear what #2 means -- an applicable method *for what*? Glancing
at the source code would suggest it's looking for rbind.IDate:

https://github.com/wch/r-source/blob/trunk/src/main/bind.c#L1051-L1063

const char *generic = ((PRIMVAL(op) == 1) ? "cbind" : "rbind"); // should
be rbind here
const char *s = translateChar(STRING_ELT(classlist, i)); // iterating over
the classes, should get to IDate first
sprintf(buf, "%s.%s", generic, s); // should be rbind.IDate

but adding this method (or even exporting it) is no help [ simply defining
rbind.IDate = function(...) as.IDate(NextMethod()) ]

Lastly, it appears that as.Date.IDate is called, which is causing the type
conversion:

debug(data.table:::as.Date.IDate)
rbind(DF, DF) # launches debugger
x
# [1] "2019-05-26" <-- singleton, so apparently applied to DF$date, not
c(DF$date, DF$date)
undebug(data.table:::as.Date.IDate)

I can't really wrap my head around why as.Date is being called here, and
even allowing that, why the end result is still the original class [
class(rbind(DF, DF)$date) == c('IDate', 'Date') ]

So, I'm beginning to think this might be a bug. Am I missing something?

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] \dots used improperly in ?Rprof examples

2019-05-25 Thread Michael Chirico

\dots is used in the Usage section of the Rprof manual, but it's not
rendered as ...

I'm not sure if this should be \ldots, or just written manually with ...

Also, I think the Rprof() on the first line is intended to be on the second
line? So that the flow looks like

Rprof() # start profiling
## some code to be profiled
Rprof(NULL) # shut off profiling
## some code NOT to be profiled
Rprof(append = TRUE) # turn profiling back on and append output to current
file
## some code to be profiled
Rprof(NULL)
## ... et cetera
## Now post-process the output as described in Details

As it is the first line looks like it's commented out

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] strtoi output of empty string inconsistent across platforms

2019-01-12 Thread Michael Chirico

Thanks Martin.

For what it's worth, this extremely representative, highly scientific
Twitter poll suggests the Mac/Linux split is pretty stark (NA on Mac, 0 on
Linux)

https://twitter.com/michael_chirico/status/1083649190117306369?s=17

On Sat, Jan 12, 2019, 2:00 AM Martin Maechler  >>>>> Martin Maechler
> >>>>> on Fri, 11 Jan 2019 09:44:14 +0100 writes:
>
> >>>>> Michael Chirico
> >>>>> on Fri, 11 Jan 2019 14:36:17 +0800 writes:
>
> >> Identified as root cause of a bug in data.table:
> >> https://github.com/Rdatatable/data.table/issues/3267
>
> >> On my machine, strtoi("", base = 2L) produces NA_integer_
> >> (which seems consistent with ?strtoi: "Values which
> >> cannot be interpreted as integers or would overflow are
> >> returned as NA_integer_").
>
> > indeed consistent with R's documentation on strtoi().
> > What machine would that be?
>
> >> But on all the other machines I've seen, 0L is
> >> returned. This seems to be consistent with the output of
> >> a simple C program using the underlying strtol function
> >> (see data.table link for this program, and for full
> >> sessionInfo() of some environments with differing
> >> output).
>
> >> So, what is the correct output of strtoi("", base = 2L)?
>
> >> Is the cross-platform inconsistency to be
> >> expected/documentable?
>
> > The inconsistency is certainly undesirable.  The relevant
> > utility function in R's source (/src/main/character.c)
> > is
>
> > static int strtoi(SEXP s, int base) { long int res; char
> > *endp;
>
> > /* strtol might return extreme values on error */
> > errno = 0;
>
> > if(s == NA_STRING) return(NA_INTEGER); res =
> > strtol(CHAR(s), &endp, base); /* ASCII */ if(errno ||
> > *endp != '\0') res = NA_INTEGER; if(res > INT_MAX || res <
> > INT_MIN) res = NA_INTEGER; return (int) res; }
>
> > and so it clearly is a platform-inconsistency in the
> > underlying C library's strtol().
>
> (corrected typos here: )
>
> > I think we should make this cross-platform consistent ...
> > and indeed it makes much sense to ensure the result of
>
> > strtoi("", base=2L)to become   NA_integer_
>
> > but chances are that would break code that has relied on
> > the current behavior {on "all but your computer" ;-)} ?
>
> I still think that such a change should be done.
>
> 'make check all' on the R source (+ Recommended packages) seems
> not to signal any error or warning with such a change, so I plan
> to commit that change to "the trunk" / "R-devel" soon, unless
> concerns are raised highly (and quickly enough).
>
> Martin
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] strtoi output of empty string inconsistent across platforms

2019-01-10 Thread Michael Chirico

Identified as root cause of a bug in data.table:

https://github.com/Rdatatable/data.table/issues/3267

On my machine, strtoi("", base = 2L) produces NA_integer_ (which seems
consistent with ?strtoi: "Values which cannot be interpreted as integers or
would overflow are returned as NA_integer_").

But on all the other machines I've seen, 0L is returned. This seems to be
consistent with the output of a simple C program using the underlying
strtol function (see data.table link for this program, and for full
sessionInfo() of some environments with differing output).

So, what is the correct output of strtoi("", base = 2L)? Is the
cross-platform inconsistency to be expected/documentable?

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Dead link in documentation of ?timezones

2018-12-07 Thread Michael Chirico

Indeed! Sorry, I need more sleep, should have known better. Thanks!

On Fri, Dec 7, 2018 at 6:22 PM Martin Maechler 
wrote:

> >>>>> Michael Chirico
> >>>>> on Fri, 7 Dec 2018 10:36:37 +0800 writes:
>
> > This link is referenced in ?timezones and appears to have been
> > moved/removed. Is there a replacement?
>
> > http://www.twinsun.com/tz/tz-link.htm
>
> Yes, already in the sources (*) of R at
>
>https://svn.r-project.org/R/trunk/src/library/base/man/timezones.Rd
>
> We (Kurt \in {R-core}) do regularly (but not daily!) check all our URLs
> --- as they are also checked for all CRAN packages -- and so
> found and fixed the problems there.
>
> So, (in the future) you can look into the development sources to
> see if a URL problem has already been addressed.
>
> Still, of course "thank you!"  for noticing and caring about it!
>
> Best,
> Martin
>
>
> --
> *) the only official source, everything else is a mirror
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Dead link in documentation of ?timezones

2018-12-07 Thread Michael Chirico

This link is referenced in ?timezones and appears to have been
moved/removed. Is there a replacement?

http://www.twinsun.com/tz/tz-link.htm

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Error message truncation

2018-05-18 Thread Michael Chirico

Help pages for stop/warning reference the option "warning.length", e.g.
from ?stop:

Errors will be truncated to getOption("warning.length") characters, default
> 1000.


Essentially the same is in ?warning.

Neither of these mention the hard-coded limits on the acceptable values of
this option in options.c
<https://github.com/wch/r-source/blob/a7356bf91b511287aacd3a992abfbcb75b60d93c/src/main/options.c#L546-L552>
:

if (streql(CHAR(namei), "warning.length")) {
  int k = asInteger(argi);
  if (k < 100 || k > 8170)
  error(_("invalid value for '%s'"), CHAR(namei));
  R_WarnLength = k;
  SET_VECTOR_ELT(value, i, SetOption(tag, argi));
}

Further, it appears there's a physical limit on the length of the error
message itself which is only slightly larger than 8170:

set.seed(1023)
NN = 1L
str = paste(sample(letters, NN, TRUE), collapse = '')
# should of course be 1
tryCatch(stop(str), error = function(e) nchar(e$message))
# [1] 8190

My questions are:


   - Can we add some information to the help pages indicating valid values
   of options('warning.length')?
   - Is there any way to increase the limit on error message length? I
   understand having such a limit is safer than potentially crashing a system
   that wants to print a massive error string.

This came up in relation to this SO Q&A:

https://stackoverflow.com/a/50387968/3576984

The user is submitting a database query; the error message will first
reproduce the entirety of the query and then give some diagnostic
information. Queries can get quite long, so it stands to reason that this
8190-length limit might be binding.

Thanks,
Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] model.frame strips class as promised, but fails to strip OBJECT in C

2018-03-05 Thread Michael Chirico

Full thread here:

https://github.com/tidyverse/broom/issues/287

Reproducible example:

is.object(freeny$y)
# [1] TRUE
attr(freeny$y, 'class')
# [1] "ts"
class(freeny$y)
# [1] "ts"

# ts attribute wiped by model.frame
class(model.frame(y ~ ., data = freeny)$y)
# [1] "numeric"
attr(model.frame(y ~ ., data = freeny)$y, 'class')
# NULL

# but still:
is.object(model.frame(y ~ ., data = freeny)$y)
# [1] TRUE

That is, given a numeric vector with class "ts", model.frame strips the
"ts" attribute but keeps the is.object property.

This behavior is alluded to in ?model.frame:

Unless na.action = NULL, time-series attributes will be removed from the
> variables found (since they will be wrong if NAs are removed).
>

And in fact explicitly setting na.action = NULL prevents dropping the class:

class(model.frame(y ~ ., data = freeny, na.action = NULL)$y)
# [1] "ts"

The reason this looks especially like a bug is that it differs from how
na.omit behaves:

DF <- data.frame(x = c(1, 2, 3), y = c(0, 10, NA))
is.object(DF$y)
# [1] FALSE
class(DF$y) = 'foo'
is.object(DF$y)
# [1] TRUE
class(na.omit(DF)$y)
# [1] "numeric"
is.object(na.omit(DF)$y)
# [1] FALSE


That is, similarly presented with a classed object, na.omit strips the
class *and* the OBJECT attribute.

Thanks,
Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] scale.default gives an incorrect error message when is.numeric() fails on a dgeMatrix

2018-03-01 Thread Michael Chirico

thanks. I know the setup code is a mess, just duct-taped something together
from the examples in lars (which are a mess in turn). in fact when I
messaged Prof. Hastie he recommended using glmnet. I wonder why lars is
kept on CRAN if they've no intention of maintaining it... but I digress...

On Mar 2, 2018 1:52 AM, "Martin Maechler" 
wrote:

> >>>>> Michael Chirico 
> >>>>> on Tue, 27 Feb 2018 20:18:34 +0800 writes:
>
> Slightly amended 'Subject': (unimportant mistake: a dgeMatrix is *not*
> sparse)
>
> MM: modified to commented R code,  slightly changed from your post:
>
>
> ## I am attempting to use the lars package with a sparse input feature
> matrix,
> ## but the following fails:
>
> library(Matrix)
> library(lars)
> data(diabetes) # from 'lars'
> ##UAagghh! not like this -- both attach() *and*   as.data.frame()  are
> horrific!
> ##UA  attach(diabetes)
> ##UA  x = as(as.matrix(as.data.frame(x)), 'dgCMatrix')
> x <- as(unclass(diabetes$x), "dgCMatrix")
> lars(x, y, intercept = FALSE)
> ## Error in scale.default(x, FALSE, normx) :
> ##   length of 'scale' must equal the number of columns of 'x'
>
> ## More specifically, scale.default fails as called from lars():
> normx <- new("dgeMatrix",
>   x = c(4, 0, 9, 1, 1, -1, 4, -2, 6, 6)*1e-14, Dim = c(1L, 10L),
>   Dimnames = list(NULL,
>   c("x.age", "x.sex", "x.bmi", "x.map", "x.tc",
> "x.ldl", "x.hdl", "x.tch", "x.ltg", "x.glu")))
> scale.default(x, center=FALSE, scale = normx)
> ## Error in scale.default(x, center = FALSE, scale = normx) :
> ##   length of 'scale' must equal the number of columns of 'x'
>
> >  The problem is that this check fails because is.numeric(normx) is FALSE:
>
> >  if (is.numeric(scale) && length(scale) == nc)
>
> >  So, the error message is misleading. In fact length(scale) is the same
> as
> >  nc.
>
> Correct, twice.
>
> >  At a minimum, the error message needs to be repaired; do we also want to
> >  attempt as.numeric(normx) (which I believe would have allowed scale to
> work
> >  in this case)?
>
> It seems sensible to allow  both 'center' and 'scale' to only
> have to *obey*  as.numeric(.)  rather than fulfill is.numeric(.).
>
> Though that is not a bug in scale()  as its help page has always
> said that 'center' and 'scale' should either be a logical value
> or a numeric vector.
>
> For that reason I can really claim a bug in 'lars' which should
> really not use
>
>scale(x, FALSE, normx)
>
> but rather
>
>scale(x, FALSE, scale = as.numeric(normx))
>
> and then all would work.
>
> > -
>
> >  (I'm aware that there's some import issues in lars, as the offending
> line
> >  to create normx *should* work, as is.numeric(sqrt(drop(rep(1, nrow(x))
> %*%
> >  (x^2 is TRUE -- it's simply that lars doesn't import the
> appropriate S4
> >  methods)
>
> >  Michael Chirico
>
> Yes, 'lars' has _not_ been updated since  Spring 2013, notably
> because its authors have been saying (for rather more than 5
> years I think) that one should really use
>
>  require("glmnet")
>
> instead.
>
> Your point is still valid that it would be easy to enhance
> base :: scale.default()  so it'd work in more cases.
>
> Thank you for that.  I do plan to consider such a change in
> R-devel (planned to become R 3.5.0 in April).
>
> Martin Maechler,
> ETH Zurich
>
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] scale.default gives an incorrect error message when is.numeric() fails on a sparse row matrix (dgeMatrix)

2018-02-27 Thread Michael Chirico

I am attempting to use the lars package with a sparse input feature matrix,
but the following fails:

library(Matrix)
library(lars)
data(diabetes)
attach(diabetes)
x = as(as.matrix(as.data.frame(x)), 'dgCMatrix')
lars(x, y, intercept = FALSE)

Error in scale.default(x, FALSE, normx) :
>
>   length of 'scale' must equal the number of columns of 'x'
>
>
More specifically, scale.default fails:

normx = new(
  "dgeMatrix",
  x = c(1.04, 1, 1.09,
1.01, 1.01,
0.992, 1.04,
0.975, 1.06,
1.06), Dim = c(1L, 10L),
  Dimnames =
list(NULL, c("x.age", "x.sex", "x.bmi", "x.map", "x.tc",
 "x.ldl", "x.hdl", "x.tch", "x.ltg", "x.glu")),
  factors = list()
)

scale(x, FALSE, normx)

The problem is that this check fails because is.numeric(normx) is FALSE:

if (is.numeric(scale) && length(scale) == nc)

So, the error message is misleading. In fact length(scale) is the same as
nc.

At a minimum, the error message needs to be repaired; do we also want to
attempt as.numeric(normx) (which I believe would have allowed scale to work
in this case)?

(I'm aware that there's some import issues in lars, as the offending line
to create normx *should* work, as is.numeric(sqrt(drop(rep(1, nrow(x)) %*%
(x^2 is TRUE -- it's simply that lars doesn't import the appropriate S4
methods)

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Inconsistency in handling of numeric input with %d by sprintf

2017-05-23 Thread Michael Chirico

https://github.com/Rdatatable/data.table/issues/2171

The fix was easy, it's just surprising to see the behavior change almost on
a whim. Just wanted to point it out in case this is unknown behavior, but
Evan seems to have found this as well.

On Tue, May 23, 2017 at 12:00 PM, Michael Chirico  wrote:

> Astute observation. And of course we should be passing integer when we use
> %d. It's an edge case in how we printed ITime objects in data.table:
>
>
> On Tue, May 23, 2017 at 11:53 AM, Joris Meys  wrote:
>
>> I initially thought this is "documented behaviour". ?sprintf says:
>>
>> Numeric variables with __exactly integer__ values will be coerced to
>> integer. (emphasis mine).
>>
>> Turns out this only works when the first value is numeric and not NA, as
>> shown by the following example:
>>
>> > sprintf("%d", as.numeric(c(NA,1)))
>> Error in sprintf("%d", as.numeric(c(NA, 1))) :
>>   invalid format '%d'; use format %f, %e, %g or %a for numeric objects
>> > sprintf("%d", as.numeric(c(1,NA)))
>> [1] "1"  "NA"
>>
>> So the safest thing is indeed passing the right type, but the behaviour
>> is indeed confusing. I checked this on both Windows and Debian, and on both
>> systems I get the exact same response.
>>
>> Cheers
>> Joris
>>
>> On Tue, May 23, 2017 at 4:53 PM, Evan Cortens 
>> wrote:
>>
>>> Hi Michael,
>>>
>>> I posted something on this topic to R-devel several weeks ago, but never
>>> got a response. My ultimate conclusion is that sprintf() isn't super
>>> consistent in how it handles coercion: sometimes it'll coerce real to
>>> integer without complaint, other times it won't. (My particular email had
>>> to do with the vectors longer than 1 and their positioning vis-a-vis the
>>> format string.) The safest thing is just to pass the right type. In this
>>> case, sprintf('%d', as.integer(NA_real_)) works.
>>>
>>> Best,
>>>
>>> Evan
>>>
>>> On Fri, May 19, 2017 at 9:23 AM, Michael Chirico <
>>> michaelchiri...@gmail.com>
>>> wrote:
>>>
>>> > Consider
>>> >
>>> > #as.numeric for emphasis
>>> > sprintf('%d', as.numeric(1))
>>> > # [1] "1"
>>> >
>>> > vs.
>>> >
>>> > sprintf('%d', NA_real_)
>>> >
>>> > >  Error in sprintf("%d", NA_real_) :
>>> >
>>> >invalid format '%d'; use format %f, %e, %g or %a for numeric object
>>> > >
>>> >
>>> > I understand the error is correct, but if it works for other numeric
>>> input,
>>> > why doesn't R just coerce NA_real_ to NA_integer_?
>>> >
>>> > Michael Chirico
>>> >
>>> > [[alternative HTML version deleted]]
>>> >
>>> > __
>>> > R-devel@r-project.org mailing list
>>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>>> >
>>>
>>>
>>>
>>> --
>>> Evan Cortens, PhD
>>> Institutional Analyst - Office of Institutional Analysis
>>> Mount Royal University
>>> 403-440-6529
>>>
>>> [[alternative HTML version deleted]]
>>>
>>> __
>>> R-devel@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>>
>>
>>
>>
>> --
>> Joris Meys
>> Statistical consultant
>>
>> Ghent University
>> Faculty of Bioscience Engineering
>> Department of Mathematical Modelling, Statistics and Bio-Informatics
>>
>> tel :  +32 (0)9 264 61 79 <+32%209%20264%2061%2079>
>> joris.m...@ugent.be
>> ---
>> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>>
>
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

Re: [Rd] Inconsistency in handling of numeric input with %d by sprintf

2017-05-23 Thread Michael Chirico

Astute observation. And of course we should be passing integer when we use
%d. It's an edge case in how we printed ITime objects in data.table:


On Tue, May 23, 2017 at 11:53 AM, Joris Meys  wrote:

> I initially thought this is "documented behaviour". ?sprintf says:
>
> Numeric variables with __exactly integer__ values will be coerced to
> integer. (emphasis mine).
>
> Turns out this only works when the first value is numeric and not NA, as
> shown by the following example:
>
> > sprintf("%d", as.numeric(c(NA,1)))
> Error in sprintf("%d", as.numeric(c(NA, 1))) :
>   invalid format '%d'; use format %f, %e, %g or %a for numeric objects
> > sprintf("%d", as.numeric(c(1,NA)))
> [1] "1"  "NA"
>
> So the safest thing is indeed passing the right type, but the behaviour is
> indeed confusing. I checked this on both Windows and Debian, and on both
> systems I get the exact same response.
>
> Cheers
> Joris
>
> On Tue, May 23, 2017 at 4:53 PM, Evan Cortens  wrote:
>
>> Hi Michael,
>>
>> I posted something on this topic to R-devel several weeks ago, but never
>> got a response. My ultimate conclusion is that sprintf() isn't super
>> consistent in how it handles coercion: sometimes it'll coerce real to
>> integer without complaint, other times it won't. (My particular email had
>> to do with the vectors longer than 1 and their positioning vis-a-vis the
>> format string.) The safest thing is just to pass the right type. In this
>> case, sprintf('%d', as.integer(NA_real_)) works.
>>
>> Best,
>>
>> Evan
>>
>> On Fri, May 19, 2017 at 9:23 AM, Michael Chirico <
>> michaelchiri...@gmail.com>
>> wrote:
>>
>> > Consider
>> >
>> > #as.numeric for emphasis
>> > sprintf('%d', as.numeric(1))
>> > # [1] "1"
>> >
>> > vs.
>> >
>> > sprintf('%d', NA_real_)
>> >
>> > >  Error in sprintf("%d", NA_real_) :
>> >
>> >invalid format '%d'; use format %f, %e, %g or %a for numeric object
>> > >
>> >
>> > I understand the error is correct, but if it works for other numeric
>> input,
>> > why doesn't R just coerce NA_real_ to NA_integer_?
>> >
>> > Michael Chirico
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > __
>> > R-devel@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-devel
>> >
>>
>>
>>
>> --
>> Evan Cortens, PhD
>> Institutional Analyst - Office of Institutional Analysis
>> Mount Royal University
>> 403-440-6529
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-devel@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-devel
>>
>
>
>
> --
> Joris Meys
> Statistical consultant
>
> Ghent University
> Faculty of Bioscience Engineering
> Department of Mathematical Modelling, Statistics and Bio-Informatics
>
> tel :  +32 (0)9 264 61 79 <+32%209%20264%2061%2079>
> joris.m...@ugent.be
> ---
> Disclaimer : http://helpdesk.ugent.be/e-maildisclaimer.php
>

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Inconsistency in handling of numeric input with %d by sprintf

2017-05-19 Thread Michael Chirico

Consider

#as.numeric for emphasis
sprintf('%d', as.numeric(1))
# [1] "1"

vs.

sprintf('%d', NA_real_)

>  Error in sprintf("%d", NA_real_) :

   invalid format '%d'; use format %f, %e, %g or %a for numeric object
>

I understand the error is correct, but if it works for other numeric input,
why doesn't R just coerce NA_real_ to NA_integer_?

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] New strptime conversion specification for ordinal suffixes

2016-08-31 Thread Michael Chirico

As touched on briefly on SO <http://stackoverflow.com/questions/39237299>,
base R has what appears to me to be a serious deficiency in its inability
to recognize dates formatted as character strings with ordinal suffixes:

ord_dates <- c("September 1st, 2016", "September 2nd, 2016",
   "September 3rd, 2016", "September 4th, 2016")

?strptime lists no conversion specification which could match ord_dates in
one pass (as I discovered, even lubridate only manages to succeed by going
through the vector in several passes).

How difficult would it be to add a new conversion specification which would
handle this, which would seem to me to be a pretty common instance of dates
to be found in the raw data wild?

My suggestion would be %o for ordinal suffixes. These would obviously be
locale-specific, but in English, %o would match to:


   - st
   - nd
   - rd
   - th
   - st
   - nd
   - rd
   - th


Other languages may be covered by this
<https://en.wikipedia.org/wiki/Ordinal_indicator> and/or this
<https://en.wikipedia.org/wiki/Unicode_subscripts_and_superscripts>
Wikipedia page on ordinal superscripts & Unicode superscripts, respectively.

With this implemented, converting ord_dates to a Date or POSIXct would be
as simple as:

as.Date(ord_dates, format = "%B %d%o, %Y")

Is there something on the C level preventing this from happening?

Michael Chirico

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

[Rd] Suggestions for improvement as regards `as` methods, and a call for consistency in `as.Date` methods

2016-01-26 Thread Michael Chirico

Good evening all,

This topic is gone into at a bit more length at my related Stack Overflow
question here:

http://stackoverflow.com/questions/34647674/why-do-as-methods-remove-vector-names-and-is-there-a-way-around-it

There are two lingering issues despite the abundant insight received at SO,
namely:

1) _Why_ do as methods remove their arguments' names attribute?

This is a fact which is mentioned briefly in a select few of the related
help files, namely ?as.vector ("removes *all* attributes including names
for results of atomic mode"), ?as.double ("strips attributes including
names.") and ?as.character ("strips attributes including names"); however,
it appears (1) neither of these references gives a satisfactory explanation
of the reasoning behind this (I can only think of speed) and (2) it would
be much more digestible to users if this information (even copy-pasting the
same blurb) was placed in all of the as reference files (e.g., ?as,
?as.numeric, ?as.Date, ?as.POSIXct, etc.)

Personally, I think that unless there's a substantial efficiency cost to
doing so, the default should in fact be to retain names (if not other
attributes).

2) All as.Date methods should behave consistently as regards attribute
retention

As explicated in the referenced SO topic, the following should all give the
same result (as they would for similar examples involving other as
methods), but don't:

datesc <- c(ind = "2015-07-04", nyd = "2016-01-01")
datesn <- c(ind = 16620, nyd = 16801)
datesp <- structure(c(1435982400, 1451624400), .Names = c("ind", "nyd"),
class = c("POSIXct", "POSIXt"), tzone = "")
datesl <- structure(list(sec = c(0, 0), min = c(0L, 0L),
 hour = c(0L, 0L), mday = c(4L, 1L),
 mon = c(6L, 0L),
 year = structure(115:116, .Names = c("ind",
"nyd")),
 wday = c(6L, 5L), yday = c(184L, 0L),
 isdst = c(1L, 0L), zone = c("EDT", "EST"),
 gmtoff = c(NA_integer_, NA_integer_)),
.Names = c("sec", "min", "hour", "mday",
   "mon", "year", "wday", "yday",
   "isdst", "zone", "gmtoff"),
class = c("POSIXlt", "POSIXt"))

Retain names

as.Date.numeric(datesn)
# ind  nyd
#"2015-07-04" "2016-01-01"
as.Date.POSIXct(datesp)
# ind  nyd
#"2015-07-04" "2016-01-01"

Destroy names

as.Date.POSIXlt(datesl)
# [1] "2015-07-04" "2016-01-01"
as.Date.character(datesc)
# [1] "2015-07-04" "2016-01-01"


(unconfirmed, but I assume given a glance at the code that all of
as.Date.date, as.Date.dates, as.Date.ts, as.Date.yearmon, and
as.Date.yearqtr will also strip the names)

Regardless of the default behavior as regards keeping/destroying
names/other attributes, it would seem that for the sake of consistency the
above should be unified.

Barring an overhaul of all as methods to retain names, this would mean the
following changes (for example):

as.Date.numeric <- function (x, origin, ...)
{
if (missing(origin))
origin <- "1970-01-01"
if (identical(origin, "-00-00"))
origin <- as.Date("-01-01", ...) - 1
    setNames(as.Date(origin, ...) + x, NULL)
}

as.Date.POSIXct <- function (x, tz = "UTC", ...)
{
if (tz == "UTC") {
z <- floor(unclass(x)/86400)
attr(z, "tzone") <- NULL
attr(z, "names") <- NULL
structure(z, class = "Date")
}
else as.Date(as.POSIXlt(x, tz = tz))
}

Thank you in advance for your consideration and thank you as always for
your time on this project.

Michael Chirico
PhD Candidate in Economics
University of Pennsylvania
3718 Locust Walk
Room 160 McNeil Building
Philadelphia, PA 19104
United States of America

[[alternative HTML version deleted]]

__
R-devel@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-devel

60 matches

Mail list logo