Re: [Rd] Curry: proposed new functional programming, er, function.
On 5/25/12 5:23 PM, Hadley Wickham wrote: On Fri, May 25, 2012 at 3:14 PM, Yike Luyikelu.h...@gmail.com wrote: So here's the way I'm reading this: Original: curry_call is the function body you're constructing, which is itself just a one liner which calls the symbol FUN with the appropriate substitutions. Yup. With a bit more infrastructure you could probably modify it so that multiple curries collapsed into the equivalent single curry. Yes I could see how one would do that - if the match.call detects a Curry as the first function being called, we would short circuit the usual evaluation into a different path which properly collapses all the nesting. It's interesting how R offers these facilities to override the usual evaluation order, but if one does that too much it could easily become confusing. I was looking at Rpipe the other day (https://github.com/slycoder/Rpipe) and the way he implements it is by defining his own Eval. Cheers, __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).
Dear Professor Ripley and R-devel, Thank you for taking the time to look into this. I understand that the 2.15.0 behavior in question was undocumented and ambiguous (at least as of that release), and it should not have been relied upon, intuitiveness or not. My suggestion is that in the next release, it ought to be the standard, documented behavior, not just because it's historical, but because it's more convenient and safer. From the point of view of programmer convenience, a having a 0-length vector on the R side always map to a NULL pointer on the C side provides a useful bit of information that the programmer can use, while a non-NULL pointer to no data isn't useful, and the current R-devel behavior requires the programmer to pass the information about whether it's empty through an additional argument (of which there is an upper limit). For example, if a procedure implemented in C takes optional weights, passing a double(0) that was translated to NULL could be used to signal that there are no weights. Also, while the .Call() interface allows an R vector passed to it to be resized, the .C() and .Fortran() interfaces don't, so a 0-length R vector passed via .C() or .Fortran() can be neither read nor written to, and nothing is lost by passing it as NULL. On the issue of safety, while dereferencing a NULL pointer is not an instant segfault on absolutely every system, it is the case for the overwhelming majority of modern systems on which anyone is likely to run a recent version of R. For those systems for which it is not the case, the behavior is no worse than dereferencing a non-NULL pointer to no data. On the contrary, while it's easy to check if a pointer is NULL, there is no general way to check whether a non-NULL pointer is valid, so if the 0-length-NULL behavior is made the standard and documented, package developers may be more likely to make use of it to check. On the issue of instrumentation and debugging, again, I think it comes down to programmer convenience. Segmentation faults caused by NULL dereferencing can be caught and debugged interactively with a debugger like GDB, while non-NULL memory errors have less predictable consequences and require tools like the slower and non-interactive Valgrind. Perhaps R's new guard bytes will change that somewhat, but, from what I've read, they only check for invalid writes, not invalid reads, which can cause almost as much trouble. On the other hand, both trying to read from a NULL pointer and trying to write to it will be detected on most systems. And, having 0-length vectors be passed as NULL does not preclude using guard bytes on non-0-length vectors. To summarize, I think that on both safety and convenience, standardizing on the 0-length-NULL behavior dominates the 0-length-invalid-pointer behavior: in each scenario that either of us has brought up so far, it behaves no worse and often better. My patch does not include changes to documentation, and, if you like, I am willing to write one that does. If my patch can be improved in some other way, please let me know and I will try to improve it. Sincerely, Pavel Krivitsky On Thu, 2012-05-17 at 10:46 +0100, Prof Brian Ripley wrote: On 04/05/2012 18:42, Pavel N. Krivitsky wrote: Dear R-devel, While tracking down some hard-to-reproduce bugs in a package I maintain, I stumbled on a behavior change between R 2.15.0 and the current R-devel (or SVN trunk). In 2.15.0 and earlier, if you passed an 0-length vector of the right mode (e.g., double(0) or integer(0)) as one of the arguments in a .C() call with DUP=TRUE (the default), the C routine would be passed NULL (the C pointer, not R NULL) in the corresponding argument. The current Where did you get that from? The documentation says it passes an (e.g.) double* pointer to a copy of the data area of the R vector. There is no change in the documented behaviour Now, of course a zero-length area can be at any address, but none is stated anywhere that I am aware of. development version instead passes it a pointer to what appears to be memory location immediately following the the SEXP that holds the metadata for the argument. If the argument has length 0, this is often memory belonging to a different R object. (DUP=FALSE in 2.15.0 appears to have the same behavior as R-devel.) .C() documentation and Writing R Extensions don't explicitly specify a behavior for 0-length vectors, so I don't know if this change is intentional, or whether it was a side-effect of the following news item: .C() and .Fortran() do less copying: arguments which are raw, logical, integer, real or complex vectors and are unnamed are not copied before the call, and (named or not) are not copied after the call. Lists are no longer copied (they are supposed to be used read-only in the C code). Was the change in the empty vector
Re: [Rd] Curry: proposed new functional programming, er, function.
On Sat, May 26, 2012 at 12:30 PM, Yike Lu yikelu.h...@gmail.com wrote: On 5/25/12 5:23 PM, Hadley Wickham wrote: On Fri, May 25, 2012 at 3:14 PM, Yike Luyikelu.h...@gmail.com wrote: So here's the way I'm reading this: Original: curry_call is the function body you're constructing, which is itself just a one liner which calls the symbol FUN with the appropriate substitutions. Yup. With a bit more infrastructure you could probably modify it so that multiple curries collapsed into the equivalent single curry. Yes I could see how one would do that - if the match.call detects a Curry as the first function being called, we would short circuit the usual evaluation into a different path which properly collapses all the nesting. It's interesting how R offers these facilities to override the usual evaluation order, but if one does that too much it could easily become confusing. I was looking at Rpipe the other day (https://github.com/slycoder/Rpipe) and the way he implements it is by defining his own Eval. The proto package does currying on proto methods by defining $.proto appropriately: library(proto) p - proto(a = 1, b = 2) # same as ls(p) - output is c(a, b) p$ls() Here ls() is _not_ a special proto method but is just the ordinary ls() provided by R. $.proto calls ls() sticking in p as the first argument. A proto object is an environment and ls with a first argument that is an environment lists the names of the objects in that environment. Similarly: p$as.list() p$str() p$parent.env() p$print() p$eapply(length) are the same as as.list(p), str(p), parent.env(p), print(p) and eapply(p, length). Although this might seem like its just syntax in proto it does allow one to override the method. For example, p2 - proto(a = 1, b = 2, print = function(.) with(., cat(a:, a, b:, b, \n)) ) p2$print() # uses p2's print print(p2) # uses R's print etc. -- Statistics Software Consulting GKX Group, GKX Associates Inc. tel: 1-877-GKX-GROUP email: ggrothendieck at gmail.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).
On May 26, 2012, at 1:02 PM, Pavel N. Krivitsky wrote: Dear Professor Ripley and R-devel, Thank you for taking the time to look into this. I understand that the 2.15.0 behavior in question was undocumented and ambiguous (at least as of that release), and it should not have been relied upon, intuitiveness or not. My suggestion is that in the next release, it ought to be the standard, documented behavior, not just because it's historical, but because it's more convenient and safer. That is bogus - .C is inherently unsafe wrt vector lengths so talking about safety here is IMHO nonsensical. Your safety relies on bombing the program - that is arguably much less safe than using checks that Brian was talking about because they are recoverable. You can argue either way, but there is no winner - the real answer is use .Call() instead. From the point of view of programmer convenience, a having a 0-length vector on the R side always map to a NULL pointer on the C side provides a useful bit of information that the programmer can use, while a non-NULL pointer to no data isn't useful, and the current R-devel behavior requires the programmer to pass the information about whether it's empty through an additional argument (of which there is an upper limit). For example, if a procedure implemented in C takes optional weights, passing a double(0) that was translated to NULL could be used to signal that there are no weights. That would be just plain wrong use that certainly should not be encouraged - you *have* to pass the length along with any vectors passed to .C (that's why you should not be even thinking of using .C in the first place!) so it is much safer to check that the length you passed is 0 rather than relying on special-casing into NULL pointers. Cheers, Simon Also, while the .Call() interface allows an R vector passed to it to be resized, the .C() and .Fortran() interfaces don't, so a 0-length R vector passed via .C() or .Fortran() can be neither read nor written to, and nothing is lost by passing it as NULL. On the issue of safety, while dereferencing a NULL pointer is not an instant segfault on absolutely every system, it is the case for the overwhelming majority of modern systems on which anyone is likely to run a recent version of R. For those systems for which it is not the case, the behavior is no worse than dereferencing a non-NULL pointer to no data. On the contrary, while it's easy to check if a pointer is NULL, there is no general way to check whether a non-NULL pointer is valid, so if the 0-length-NULL behavior is made the standard and documented, package developers may be more likely to make use of it to check. On the issue of instrumentation and debugging, again, I think it comes down to programmer convenience. Segmentation faults caused by NULL dereferencing can be caught and debugged interactively with a debugger like GDB, while non-NULL memory errors have less predictable consequences and require tools like the slower and non-interactive Valgrind. Perhaps R's new guard bytes will change that somewhat, but, from what I've read, they only check for invalid writes, not invalid reads, which can cause almost as much trouble. On the other hand, both trying to read from a NULL pointer and trying to write to it will be detected on most systems. And, having 0-length vectors be passed as NULL does not preclude using guard bytes on non-0-length vectors. To summarize, I think that on both safety and convenience, standardizing on the 0-length-NULL behavior dominates the 0-length-invalid-pointer behavior: in each scenario that either of us has brought up so far, it behaves no worse and often better. My patch does not include changes to documentation, and, if you like, I am willing to write one that does. If my patch can be improved in some other way, please let me know and I will try to improve it. Sincerely, Pavel Krivitsky On Thu, 2012-05-17 at 10:46 +0100, Prof Brian Ripley wrote: On 04/05/2012 18:42, Pavel N. Krivitsky wrote: Dear R-devel, While tracking down some hard-to-reproduce bugs in a package I maintain, I stumbled on a behavior change between R 2.15.0 and the current R-devel (or SVN trunk). In 2.15.0 and earlier, if you passed an 0-length vector of the right mode (e.g., double(0) or integer(0)) as one of the arguments in a .C() call with DUP=TRUE (the default), the C routine would be passed NULL (the C pointer, not R NULL) in the corresponding argument. The current Where did you get that from? The documentation says it passes an (e.g.) double* pointer to a copy of the data area of the R vector. There is no change in the documented behaviour Now, of course a zero-length area can be at any address, but none is stated anywhere that I am aware of. development version instead passes it a pointer to what appears to be memory location
Re: [Rd] [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).
On 26 May 2012 at 14:00, Simon Urbanek wrote: | [...] the real answer is use .Call() instead. Maybe Kurt could add something to that extent to the R FAQ ? Dirk -- Dirk Eddelbuettel | e...@debian.org | http://dirk.eddelbuettel.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [patch] Behavior of .C() and .Fortran() when given double(0) or integer(0).
Dear Simon, On Sat, 2012-05-26 at 14:00 -0400, Simon Urbanek wrote: My suggestion is that in the next release, it ought to be the standard, documented behavior, not just because it's historical, but because it's more convenient and safer. That is bogus - .C is inherently unsafe wrt vector lengths so talking about safety here is IMHO nonsensical. Your safety relies on bombing the program - IMHO, not all memory errors are created equal. From the safety perspective, an error that immediately bombs the program is preferable to one that corrupts the memory, producing subtle problems much later or one that reads the wrong memory area and goes into an infinite loop or allocates gigabytes of RAM, etc.. that is arguably much less safe than using checks that Brian was talking about because they are recoverable. While undoubtedly useful for debugging, I don't think they are particularly recoverable in practice. At best, they tell you that some memory before or after that allocated has been overwritten. They cannot tell you how much memory or whether R is now in an inconsistent state (which may occur if the write is off by more than 64 bytes, I believe), and should be restarted immediately, only taking the time to save the data and history --- which is what a caught segfault in R does anyway, at least on UNIX-alikes. Furthermore, the guard bytes only trigger after the C routine exits, so the error is only caught some time after it occurs, which makes debugging it more difficult. (In contrast, a debugger like GDB can tell exactly which C statement caused a segmentation fault.) The one advantage guard bytes might have over NULL (for a 0-length vector) is that an error caught by a guard byte might allow the developer to browse (via options(error=recover)) the R function that made the .C() call, but even that relies on the bug not overwriting more than a few bytes, and it cannot detect improper reads. You can argue either way, but there is no winner - the real answer is use .Call() instead. It seems to me that the 0-length-NULL approach still dominates on the matter of safety and debugging, with a possible exception in what I am pretty sure is a relatively rare scenario when the developer has passed a 0-length vector via .C() _and_ it was written to _and_ the developer wants to browse (using error=recover()) the R code leading up to the problematic .C() call, rather than browse (via GDB) the C code that triggered the segfault. In that scenario, the developer can still easily infer what argument was passed as an empty vector and via what .C() call. (Standardizing on 0-length-NULL does not preclude putting guard bytes on non-empty vectors, of course.) From the point of view of programmer convenience, a having a 0-length vector on the R side always map to a NULL pointer on the C side provides a useful bit of information that the programmer can use, while a non-NULL pointer to no data isn't useful, and the current R-devel behavior requires the programmer to pass the information about whether it's empty through an additional argument (of which there is an upper limit). For example, if a procedure implemented in C takes optional weights, passing a double(0) that was translated to NULL could be used to signal that there are no weights. That would be just plain wrong use that certainly should not be encouraged - you *have* to pass the length along with any vectors passed to .C (that's why you should not be even thinking of using .C in the first place!) so it is much safer to check that the length you passed is 0 rather than relying on special-casing into NULL pointers. Not necessarily. In the weighted data scenario, the length of the data vector would, presumably, be passed in a different argument, and, if weights exist, their length would equal to that. The NULL here could be a binary signal not to use weights. While I understand that .Call() interface has many advantages over .C(), .C() remains a simple and convenient interface that doesn't require the developer to learn too much about R's internals, and, either way, as long as the .C() interface is not being deprecated, I think that it ought to be made as safe and as useful as possible. Best, Pavel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel