Re: [Rd] S4 and connection slot [Sec=Unclassified]
Martin Morgan wrote: [...] ## Attempt two -- initialize setClass(Element, representation=representation(conn=file)) setMethod(initialize, Element, function(.Object, ..., conn=file()) { callNextMethod(.Object, ..., conn=conn) }) new(Element) ## oops, connection created but not closed; gc() closes (eventually) ## but with an ugly warning ## gc() ##used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 717240 38.41166886 62.4 1073225 57.4 ## Vcells 3795 284.9 63274729 482.8 60051033 458.2 ## gc() ##used (Mb) gc trigger (Mb) max used (Mb) ## Ncells 715906 38.31166886 62.4 1073225 57.4 ## Vcells 37335626 284.9 63274729 482.8 60051033 458.2 ## Warning messages: ## 1: closing unused connection 3 () setClass(ElementX, contains=Element) ## oops, two connections opened (!) yes, that's because of the nonsense double call to the initializer while creating a subclass. the conceptual bug in the s4 system leads to this ridiculous behaviour in your essentially correct and useful pattern. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Show location of workspace image on quit?
Barry Rowlingson wrote: Would something like q() Save workspace image (/home/me/workspace/.RData)? [y/n/c]: be useful to anyone else? Just thought I'd ask before I dive into internals or wrap the q function for myself. yes, it would be very useful to me. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Show location of workspace image on quit?
Mathieu Ribatet wrote: I guess that having something like this q() Save workspace image (/home/me/workspace/.RData)? [y/n/c/e]: where e means Editing the path should be clear enought, isn't it? good idea; maybe 'o' for 'other path'; or 'a' for 'alternative path'; or 'd' for 'different path'; or 'm' for 'modify path'; or 'p' for 'path'; or... ? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] bug tracker
the post 13613 has been classified as featuresfaq and annotated with As documented in the Warning section!. however, the bug has actually been fixed. may i kindly suggest that the annotation be changed to a more appropriate note? regards, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reference counting bug related to break and next in loops
William Dunlap wrote: One of our R users here just showed me the following problem while investigating the return value of a while loop. I added some information on a similar bug in for loops. I think he was using 2.9.0 but I see the same problem on today's development version of 2.10.0 (svn 48703). Should the semantics of while and for loops be changed slightly to avoid the memory buildup that fixing this to reflect the current docs would entail? S+'s loops return nothing useful - that change was made long ago to avoid memory buildup resulting from semantics akin the R's present semantics. Bill Dunlap TIBCO Software Inc - Spotfire Division wdunlap tibco.com Forwarded (and edited) message below--- -- I think I have found another reference counting bug. If you type in the following in R you get what I think is the wrong result. i = 1; y = 1:10; q = while(T) { y[i] = 42; if (i == 8) { break }; i = i + 1; y}; q [1] 42 42 42 42 42 42 42 42 9 10 I had expected [1] 42 42 42 42 42 42 42 8 9 10 which is what you get if you add 0 to y in the last statement in the while loop: a simplified example may help to get a clear picture: i = 1; y = 1:3; (while(TRUE) { y[i] = 0 if (i == 2) break i = i + 1 y }) # 0 0 3 i = 1; y = 1:3; (while(TRUE) { y[i] = 0 if (i == 2) break i = i + 1 y + 0 }) # 0 2 3 the test on i is done after the assignment to y[i]. when the loop breaks, y is 0 0 3, and one might expect this to be the final result. it looks like the result is the value of y from the previous iteration, and it does not seem particularly intuitive to me. (using common sense, i mean; an informed expert on the copy-when-scared semantics may have a different opinion, but why should a casual user ever suspect such magic.) anyway, i'd rather expect NULL to be returned. for the oracle, ?'while', says: 'for', 'while' and 'repeat' return the value of the last expression evaluated (or 'NULL' if none was), invisibly. [...] 'if' returns the value of the expression evaluated, or 'NULL' if none was. [...] 'break' and 'next' have value 'NULL', although it would be strange to look for a return value. when i is 2, i == 2 is TRUE. hence, if (i == 2) break evaluates to break. break evaluates to NULL, breaks the loop, and the return value should be NULL. while it is, following the docs, strange to have q = while(...) ... in the code, the result above is not compliant with the docs at all -- seems like a plain bug. there is no reason for while to return the value of y, be it 0 0 3 or 0 2 3. one might naively suspect that it is the syntactically last expression in the body of while that provides the return value, but the docs explicitly say the last expression evaluated. and indeed, (while (TRUE) { break; 'foo' }) # NULL however, i = FALSE (while (TRUE) { if (i) break; i = !i; i }) # TRUE which again reveals the bug. one could suspect that the last expression evaluated is actually the whole body of the while loop; so in the above, the value of { if (i) break; i = !i; i } should be returned, even if the loop breaks in the middle. hence, the result should be TRUE (or maybe FALSE?). however, (while (TRUE) { break; while(TRUE) { 'foo' } }) # NULL has no problem with returning NULL -- obviously, so to speak. it seems to me that the bug is not in reference counting, but in that the while loop incorrectly returns the value of the *previous* iteration while executing a break, instead of the break's NULL. likewise, (for (i in 1:2) { if (i == 2) break i }) # 1 instead of the specification-promised NULL. i = 1; y = 1:10; q = while(T) { y[i] = 42; if (i == 8) { break }; i = i + 1; y + 0}; q [1] 42 42 42 42 42 42 42 8 9 10 Also, i = 1; y = 1:10; q = while(T) { y[i] = 42; if (i == 8) { break }; i-i+1 ; if (i=8i3)next ; cat(Completing iteration, i, \n); y}; q Completing iteration 2 Completing iteration 3 [1] 42 42 42 42 42 42 42 42 9 10 but if the last statement in the while loop is y+0 instead of y I get the expected result: i = 1; y = 1:10; q = while(T) { y[i] = 42; if (i == 8) { break }; i-i+1 ; if (i=8i3)next ; cat(Completing iteration, i, \n); y+0L}; q Completing iteration 2 Completing iteration 3 [1] 42 42 3 4 5 6 7 8 9 10 A background to the problem is that in R a while-loop returns the value of the last iteration. not according to the docs; the last expression evaluated. specifically, not the value of the last non-break-broken iteration. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Print bug for matrix(list(NA_complex_, ...))
Stavros Macrakis wrote: In R 2.8.0 on Windows (tested both under ESS and under R Console in case there was an I/O issue) There is a bug in printing val - matrix(list(NA_complex_,NA_complex_),1). dput(val) structure(list(NA_complex_, NA_complex_), .Dim = 1:2) print(val) [,1] [1,] [,2] [1,] Note that a large number of spaces are printed instead of NA. on ubuntu 8.04 with r 2.10.0 r48703 there is almost no problem (still some unnecessary spaces): [,1] [,2] [1,]NANA Compare the unproblematic real case: print(structure(list(NA_real_, NA_real_), .Dim = 1:2)) [,1] [,2] [1,] NA NA Also, when printed in the read-eval-print loop, printing takes a very very long time: proc.time(); matrix(list(NA_complex_,NA_complex_),1); proc.time() user system elapsed 74.350.09 329.45 [,1] [1,] [,2] [1,] user system elapsed 92.630.15 347.86 18 seconds runtime! user system elapsed 0.648 0.056 155.843 [,1] [,2] [1,] NA NA user system elapsed 0.648 0.056 155.843 vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reference counting bug related to break and next in loops
Wacek Kusnierczyk wrote: a simplified example may help to get a clear picture: i = 1; y = 1:3; (while(TRUE) { y[i] = 0 if (i == 2) break i = i + 1 y }) # 0 0 3 i = 1; y = 1:3; (while(TRUE) { y[i] = 0 if (i == 2) break i = i + 1 y + 0 }) # 0 2 3 the test on i is done after the assignment to y[i]. when the loop breaks, y is 0 0 3, and one might expect this to be the final result. it looks like the result is the value of y from the previous iteration, and it does not seem particularly intuitive to me. (using common sense, i mean; an informed expert on the copy-when-scared semantics may have a different opinion, but why should a casual user ever suspect such magic.) anyway, i'd rather expect NULL to be returned. for the oracle, ?'while', says: 'for', 'while' and 'repeat' return the value of the last expression evaluated (or 'NULL' if none was), invisibly. [...] 'if' returns the value of the expression evaluated, or 'NULL' if none was. [...] 'break' and 'next' have value 'NULL', although it would be strange to look for a return value. when i is 2, i == 2 is TRUE. hence, if (i == 2) break evaluates to break. break evaluates to NULL, breaks the loop, and the return value should be NULL. while it is, following the docs, strange to have q = while(...) ... in the code, the result above is not compliant with the docs at all -- seems like a plain bug. there is no reason for while to return the value of y, be it 0 0 3 or 0 2 3. somewhat surprising to learn, i = 1 y = 1:3 (while (TRUE) { y[i] = 0 if (i == 2) { 2*y; break } i = i + 1 y }) # 0 0 3 where clearly the last expression evaluated (before the break, that is) is 2*y -- or? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reference counting bug related to break and next in loops
William Dunlap wrote: help('while') says: Usage: for(var in seq) expr while(cond) expr repeat expr break next Value: 'for', 'while' and 'repeat' return the value of the last expression evaluated (or 'NULL' if none was), invisibly. 'for' sets 'var' to the last used element of 'seq', or to 'NULL' if it was of length zero. 'break' and 'next' have value 'NULL', although it would be strange to look for a return value. Does the 'the last expression evaluated' mean (a) the value from evaluating 'expr' the last time it was completely evaluated or does it mean (b) the value of the last element of a {} expr that was evaluated? it's interesting (if not obvious) that i = 1; y = 1:3 (while (TRUE) { y[i] = 0 if (i==2) break i = i +1 y + 0 }) # 0 2 3 does not reflect in the final value the modification made to y in the second, incomplete iteration, and that i = 1; y = 1:3 (while (TRUE) { y[i] = 0 if (i==2) break i = i +1 y }) # 0 0 3 does reflect this modification, yet i = 1; y = 1:3 (while (TRUE) { y[i] = 0 if (i==2) { y = 1:3; break } i = i +1 y }) # 0 0 3 makes a copy of y on y = 1:3 and returns the previous value. again, this surely has a straightforward explanation in the copy-when-scared mechanics, yet, intuitively, the returned value seems completely out of place. R currently follow interpretation (a), modulo reference counting bugs. My suggestion is to move to interpretation (b), so that the fact that break and next return NULL would mean that a broken-out-of loop would have value NULL. (Personally, I'm happy with S+'s return value for all loops being NULL in all cases, but that might break existing R code.) i'm truly impressed by s+'s superiority over r. Of course, if the reference counting bug can be fixed without degrading performance in ordinary situations (does anyone look at the return value of a loop, particularly one that is broken out of?), then I'm happy retaining the current semantics. ... with the current lousy documentation improved to match the actual semantics. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] reference counting bug: overwriting for loop 'seq' variable
William Dunlap wrote: It looks like the 'seq' variable to 'for' can be altered from within the loop, leading to incorrect answers. E.g., in the following I'd expect 'sum' to be 1+2=3, but R 2.10.0 (svn 48686) gives 44.5. x = c(1,2); sum = 0; for (i in x) { x[i+1] = i + 42.5; sum = sum + i }; sum [1] 44.5 or, with a debugging cat()s, x = c(1,2); sum = 0; for (i in x) { cat(before, i=, i, \n); x[i+1] = i + 42.5; cat(after, i=, i,\n); sum = sum + i }; sum before, i= 1 after, i= 1 before, i= 43.5 after, i= 43.5 [1] 44.5 If I force the for's 'seq' to be a copy of x by adding 0 to it, then I do get the expected answer. x = c(1,2); sum = 0; for (i in x+0) { x[i+1] = i + 42.5; sum = sum + i }; sum b[1] 3 It looks like an error in reference counting. indeed; seems like you've hit the issue of when r triggers data duplication and when it doesn't, discussed some time ago in the context of names() etc. consider: x = 1:2 for (i in x) x[i+1] = i-1 x # 1 0 1 y = c(1, 2) for (i in y) y[i+1] = i-1 y # -1 0 vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setdiff bizarre
Stavros Macrakis wrote: '1:3' %in% data.frame(a=2:4,b=1:3) # TRUE utterly weird. so what would x have to be so that x %in% data.frame('a') # TRUE hint: '1' %in% data.frame(1) # TRUE vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setdiff bizarre
William Dunlap wrote: %in% is a thin wrapper on a call to match(). match() is not a generic function (and is not documented to be one), so it treats data.frames as lists, as their underlying representation is a list of columns. match is documented to convert lists to character and to then run the character version of match on that character data. match does not bail out if the types of the x and table arguments don't match (that would be undesirable in the integer/numeric mismatch case). yes, i understand that this is documented behaviour, and that it's not a bug. nevertheless, the example is odd, and hints that there's a design flaw. i also do not understand why the following should be useful and desirable: as.character(list('a')) # a as.character(data.frame('a')) # 1 and hence 'a' %in% list('a') # TRUE while 'a' %in% data.frame('a') # FALSE '1' %in% data.frame('a') # TRUE there is a mechanistic explanation for how this works, but is there one for why this works this way? Hence '1' %in% data.frame(1) # - TRUE is acting consistently with match(as.character(pi), c(1, pi, exp(1))) # - 2 and 1L %in% c(1.0, 2.0, 3.0) # - TRUE The related functions, duplicated() and unique(), do have row-wise data.frame methods. E.g., duplicated(data.frame(x=c(1,2,2,3,3),y=letters[c(1,1,2,2,2)])) [1] FALSE FALSE FALSE FALSE TRUE Perhaps match() ought to have one also. S+'s match is generic and has a data.frame method (which is row-oriented) so there we get: match(data.frame(x=c(1,3,5), y=letters[c(1,3,5)]), data.frame(x=1:10,y=letters[1:10])) [1] 1 3 5 is.element(data.frame(x=1:10,y=letters[1:10]), data.frame(x=c(1,3,5), y=letters[c(1,3,5)])) [1] TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE FALSE FALSE I think that %in% and is.element() ought to remain calls to match() and that if you want them to work row-wise on data.frames then match should get a data.frame method. sounds good to me. how is 'a' %in% data.frame('a') in S+? thanks for the response. regards, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] setdiff bizarre
Barry Rowlingson wrote: [...] I suspect it's using 'deparse()' to get the character representation. This function is mentioned in ?as.character, but as.character.default disappears into the infernal .Internal and I don't have time to chase source code - it's sunny outside! on the side, as.character triggers do_ascharacter, which in turn calls DispatchOrEval, a function with the following beautiful comment: To call this an ugly hack would be to insult all existing ugly hacks at large in the world. a fortune? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler wrote: PS == Petr Savicky savi...@cs.cas.cz on Sun, 31 May 2009 10:29:41 +0200 writes: [] PS I appreciate the current version, which contains static PS const char* dropTrailing0(char *s, char cdec) ... PS mkChar(dropTrailing0((char *)EncodeReal(x, w, d, e, PS OutDec), ... PS Here, is better visible that the cast (char *) is used PS than if it was hidden inside dropTrailing0(). Also, it PS makes dropTrailing0() more consistent. PS I would like to recall the already discussed PS modification if (replace != p) while((*(replace++) = PS *(p++))) ; which saves a few instructions in the more PS frequent case that there are no trailing zeros. Yes, thank you. This already was in my working version, and I had managed to lose it again. Will put i back still hoping this topic would be closed now ... i would rather hope for the EncodeReal flaw to be repaired... vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler wrote: Hi Waclav (and other interested parties), I have committed my working version of src/main/coerce.c so you can prepare your patch against that. some further investigation and reflections on the code in StringFromReal (henceforth SFR), src/main/coerce.c:315 (as in the patched version, now in r-devel). petr's elim_trailing (renamed to dropTrailing, henceforth referred to as DT) takes as input a const char*, and returns a const char*. const-ness of the return is not a problem; it is fed into mkChar, which (via mkCharLenCE) makes a local memcpy of the string, and there's no violation of the contract here. const-ness of the input is a consequence of the return type of EncodeReal (henceforth EC). however, it is hardly ever, in principle, a good idea to destructively modify const input (as DT does) if it comes from a function that explicitly provides it as const (as ER does). the first question is, why does ER return the string as const? it appears that the returned pointer provides the address of a buffer used internally in ER, which is allocated *statically*. that is, each call to ER operates on the same memory location, and each call to ER returns the address of that same location. i suspect this is intended to be a smart optimization, to avoid heap- or stack-allocating a new buffer in each call to ER, and deallocating it after use. however, this appraoch is problematic, in that any two calls to ER return the address of the same piece of memory, and this may easily lead to data corruption. under the assumption that the content of this piece of memory is copied before any destructive use, and that after the string is copied the address is not further distributed, the hack is relatively harmless. this is what mkChar (via mkCharLenCE) does; in SFR it copies the content of s with memcpy, and wraps it into a SEXP that becomes the return value from SFR. the original author of this hack seems to have had some concern about exporting (from ER) the address of a static buffer, hence the returned buffer is const. in principle, this should prevent corruption of the buffer's content in situations such as // hypothetical char *p1 = ER(...); // p1 is some string returned from ER char p2 = ER(...); // p2 is some other string returned from ER // some modifications performed on the string referred to by p1 p1[0] = 'x'; // p2[0] is 'x' -- possible data corruption still worse in a scenario with concurrent calls to ER. however, since the output from ER is const, this is no longer possible -- at least, not without a deconstifying cast the petr style. the problem with petr's solution is not only that it modifies shared memory purposefully qualified as const (by virtue of ER's return type), but also that it effectively distributes the address for further use. unfortunately, like most of the r source code, ER is not appropriately commented at the declaration and the definition, and without looking at the code, one can hardly have any clue that ER always return the same address of a static location. while the original developer might be careful enough not to misuse ER, in a large multideveloper project it's hard expect that from others. petr's function is precisely an example of such misuse, and as it adds (again, without an appropriate comment) a step of indirection; any use of petr's function other than what you have in SFR (and can you guarantee no one will ever use DT for other purposes?) is even more likely to end up in data corruption. one simple way to improve the code is as follows; instead of (simplified) const char* dropTrailing(const char* s, ...) { const char *p = s; char *replace; ... replace = (char*) p; ... return s; } ...mkChar(dropTrailing(EncodeReal(...), ...) ... you can have something like const char* dropTrailing(char* s, ...) { char *p = s, *replace; ... replace = p; ... return s; } ...mkChar(dropTrailing((char*)EncodeReal(...), ...) ... where it is clear, from DT's signature, that it may (as it purposefully does, in fact) modify the content of s. that is, you drop the promise-not-to-modify contract in DT, and move the need for deconstifying ER's return out of DT, making it more explicit. however, this is still an ad hoc hack; it still breaks the original developer's assumption (if i'm correct) that the return from ER (pointing to its internal buffer) should not be destructively modified outside of ER. another issue is that even making the return from ER const does not protect against data corruption. for example, const char *p1 = ER(...) // p1 is some string returned from ER const char *p2 = ER(...) // p2 is some other string returned from ER // but p1 == p2 if p1 is used after the second call to ER, it's likely to lead to data corruption problems. frankly, i'd consider the design of ER
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler wrote: [...] vQ the first question is, why does ER return the string as const? it vQ appears that the returned pointer provides the address of a buffer used vQ internally in ER, which is allocated *statically*. that is, each call vQ to ER operates on the same memory location, and each call to ER returns vQ the address of that same location. i suspect this is intended to be a vQ smart optimization, to avoid heap- or stack-allocating a new buffer in vQ each call to ER, and deallocating it after use. however, this appraoch vQ is problematic, in that any two calls to ER return the address of the vQ same piece of memory, and this may easily lead to data corruption. Well, that would be ok if R could be used threaded / parallel / ... this can cause severe problems even without concurrency, as one of my examples hinted. and we all know that there are many other pieces of code {not just R's own, but also in Fortran/C algorithms ..} that are not thread-safe. absolutely. again, ER is unsafe even in a sequential execution environment. Yes, of course, R looks like a horrible piece of software telepathy? to some, because of that vQ under the assumption that the content of this piece of memory is copied vQ before any destructive use, and that after the string is copied the vQ address is not further distributed, the hack is relatively harmless. vQ this is what mkChar (via mkCharLenCE) does; in SFR it copies the vQ content of s with memcpy, and wraps it into a SEXP that becomes the vQ return value from SFR. exactly. but it should be made clear, by means of a comment, that ER is supposed to be used in this way. there is no hint at the interface level. vQ the original author of this hack seems to have had some concern about vQ exporting (from ER) the address of a static buffer, hence the returned vQ buffer is const. in principle, this should prevent corruption of the vQ buffer's content in situations such as vQ // hypothetical vQ char *p1 = ER(...); vQ // p1 is some string returned from ER vQ char p2 = ER(...); vQ // p2 is some other string returned from ER vQ // some modifications performed on the string referred to by p1 vQ p1[0] = 'x'; vQ // p2[0] is 'x' -- possible data corruption vQ still worse in a scenario with concurrent calls to ER. (which will not happen in the near future) unless you know a powerful and willing magician. vQ however, since the output from ER is const, this is no longer possible vQ -- at least, not without a deconstifying cast the petr style. the vQ problem with petr's solution is not only that it modifies shared memory vQ purposefully qualified as const (by virtue of ER's return type), but vQ also that it effectively distributes the address for further use. vQ unfortunately, like most of the r source code, ER is not appropriately vQ commented at the declaration and the definition, and without looking at vQ the code, one can hardly have any clue that ER always return the same vQ address of a static location. while the original developer might be vQ careful enough not to misuse ER, in a large multideveloper project it's vQ hard expect that from others. petr's function is precisely an example vQ of such misuse, and as it adds (again, without an appropriate comment) a vQ step of indirection; any use of petr's function other than what you have vQ in SFR (and can you guarantee no one will ever use DT for other vQ purposes?) is even more likely to end up in data corruption. you have a point here, and as a consequence, I'm proposing to put the following version of DT into the source : /* Note that we modify a 'const char*' which is unsafe in general, * but ok in the context of filtering an Encode*() value into mkChar(): */ static const char* dropTrailing0(char *s, char cdec) { char *p = s; for (p = s; *p; p++) { if(*p == cdec) { char *replace = p++; while ('0' = *p*p = '9') if(*(p++) != '0') replace = p; while((*(replace++) = *(p++))) ; break; } } return s; } the first line appears inessential; to an informed programmer, taking a string as char* (as opposed to const char*) means that it *may* be modified within the call, irrespectively of whether it actually is, and on what occasions, and one should not assume the string is not destructively modified. i think it is much more appropriate to comment (a) ER, with a warning to the effect that it always returns the same address, hence the output should be used immediately and never written to, (b) the use of ER in SFR where
[Rd] bug in strsplit?
src/main/character.c:435-438 (do_strsplit) contains the following code: for (i = 0; i tlen; i++) if (getCharCE(STRING_ELT(tok, 0)) == CE_UTF8) use_UTF8 = TRUE; for (i = 0; i len; i++) if (getCharCE(STRING_ELT(x, 0)) == CE_UTF8) use_UTF8 = TRUE; since both loops iterate over loop-invariant expressions and statements, either the loops are redundant, or the fixed index '0' was meant to actually be the variable i. i guess it's the latter, hence 'bug?' in the subject. it also appears that if *any* element of tok (or x) positively passes the test, use_UTF8 is set to TRUE; in such a case, further checks make no sense. the following rewrite cuts the inessential computation: for (i = 0; i tlen; i++) if (getCharCE(STRING_ELT(tok, i)) == CE_UTF8) { use_UTF8 = TRUE; break; } for (i = 0; i len; i++) if (getCharCE(STRING_ELT(x, i)) == CE_UTF8) { use_UTF8 = TRUE; break; } since the pattern is repetitive, the following generic approach would help (and the macro could possibly be reused in other places): #define CHECK_CE(CHARACTER, LENGTH, USEUTF8) \ for (i = 0; i (LENGTH); i++) \ if (getCharCE(STRING_ELT((CHARACTER), i)) == CE_UTF8) { \ (USEUTF8) = TRUE; \ break; } CHECK_CE(tok, tlen, use_UTF8) CHECK_CE(x, len, use_UTF8) if you like it, i can provide a patch. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Petr Savicky wrote: On Fri, May 29, 2009 at 03:53:02PM +0200, Martin Maechler wrote: my version of *using* the function was 1 SEXP attribute_hidden StringFromReal(double x, int *warn) 2 { 3 int w, d, e; 4 formatReal(x, 1, w, d, e, 0); 5 if (ISNA(x)) return NA_STRING; 6 else return mkChar(dropTrailing0(EncodeReal(x, w, d, e, OutDec), OutDec)); 7 } where you need to consider that mkChar() expects a 'const char*' and EncodeReal(.) returns one, and I am pretty sure this was the main reason why Petr had used the two 'const char*' in (the now-named) dropTrailing0() definition. Yes, the goal was to accept the output of EncodeReal() with exactly the same type, which EncodeReal() produces. A question is, whether the output type of EncodeReal() could be changed to (char *). Then, changing the output string could be done without casting const to non-const. exactly. my suggestion was to modify your function so that no modify a constant string-cheating is done, by either (a) keeping the const but returning a *new* string (hence no const-to-nonconst cast would be needed), or (b) modify your function to accept a non-const string *and* modify the code that connects to your function via the input and output strings. note, if a solution in which your function serves as a destructive filter is just fine (martin seems to have accepted it already), then EncodeReal probably can produce just a string, with no const qualifier, and analogously for mkChar. on the other hand, if EncodeReal is purposefully designed to return a const string (i.e., there is an important reason for doing so), and analogously for mkChar, then your function violates the assumptions and can potentially be harmful to the rest of the code. This solution may be in conflict with the structure of the rest of R code, so i cannot evaluate, whether this is possible. well, either the rest of the code does *not* need const, and it can be safely removed, or it *does* rely on const, and your solution ciolates the expectation. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler wrote: [...] vQ you return s, which should be the same pointer value (given the actual vQ code that does not modify the local variable s) with the same pointed-to vQ string value (given the signature of the function). vQ was perhaps vQ char *elim_trailing(char* const s, char cdec) vQ intended? yes that would seem slightly more logical to my eyes, and in principle I also agree with the other remarks you make above, what does ' in principle ' mean, as opposed to 'in principle'? (is it emphasis, or sneer quotes?) ... vQ anyway, having the pointer s itself declared as const does vQ make sense, as the code seems to assume that exactly the input pointer vQ value should be returned. or maybe the argument to elim_trailing should vQ not be declared as const, since elim_trailing violates the declaration. vQ one way out is to drop the violated const in both the actual argument vQ and in elim_trailing, which would then be simplified by removing all vQ const qualifiers and (char*) casts. I've tried that, but ``it does not work'' later: {after having renamed 'elim_trailing' to 'dropTrailing0' } my version of *using* the function was 1 SEXP attribute_hidden StringFromReal(double x, int *warn) 2 { 3 int w, d, e; 4 formatReal(x, 1, w, d, e, 0); 5 if (ISNA(x)) return NA_STRING; 6 else return mkChar(dropTrailing0(EncodeReal(x, w, d, e, OutDec), OutDec)); 7 } where you need to consider that mkChar() expects a 'const char*' and EncodeReal(.) returns one, and I am pretty sure this was the main reason why Petr had used the two 'const char*' in (the now-named) dropTrailing0() definition. If I use your proposed signature char* dropTrailing0(char *s, char cdec); line 6 above gives warnings in all of several incantations I've tried including this one : else return mkChar((const char *) dropTrailing0((char *)EncodeReal(x, w, d, e, OutDec), OutDec)); which (the warnings) leave me somewhat clue-less or rather unmotivated to dig further, though I must say that I'm not the expert on the subject char* / const char* .. of course, if the input *is* const and the output is expected to be const, you should get an error/warning in the first case, and at least a warning in the other (depending on the level of verbosity/pedanticity you choose). but my point was not to light-headedly change the signature/return of elim_trailing and its implementation and use it in the original context; it was to either modify the context as well (if const is inessential), or drop modifying the const string if the const is in fact essential. vQ another way out is to make vQ elim_trailing actually allocate and return a new string, keeping the vQ input truly constant, at a performance cost. yet another way is to vQ ignore the issue, of course. vQ the original (martin/petr) version may quietly pass -Wall, but the vQ compiler would complain (rightfully) with -Wcast-qual. hmm, yes, but actually I haven't found a solution along your proposition that even passes -pedantic -Wall -Wcast-align (the combination I've personally been using for a long time). one way is to return from elim_trailing a new, const copy of the const string. using memcpy should be efficient enough. care should be taken to deallocate s when no longer needed. (my guess is that using the approach suggested here, s can be deallocated as soon as it is copied, which means pretty much that it does not really have to be const.) Maybe we can try to solve this more esthetically in private e-mail exchange? sure, we can discuss aesthetics offline. as long as we do not discuss aesthetics (do we?), it seems appropriate to me to keep the discussion online. i will experiment with a patch to solve this issue, and let you know when i have something reasonable. best, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler wrote: Hi Waclav (and other interested parties), I have committed my working version of src/main/coerce.c so you can prepare your patch against that. Hi Martin, One quick reaction (which does not resolve my original complaint): you can have p non-const, and cast s to char* on the first occasion its value is assigned to p, thus being able to copy from p to replace without repetitive casts. make check-ed patch atatched. vQ Index: src/main/coerce.c === --- src/main/coerce.c (revision 48689) +++ src/main/coerce.c (working copy) @@ -297,13 +297,13 @@ const char* dropTrailing0(const char *s, char cdec) { -const char *p; -for (p = s; *p; p++) { +char *p; +for (p = (char *)s; *p; p++) { if(*p == cdec) { - char *replace = (char *) p++; + char *replace = p++; while ('0' = *p*p = '9') if(*(p++) != '0') - replace = (char *) p; + replace = p; while((*(replace++) = *(p++))) ; break; __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Why change data type when dropping to one-dimension?
Stavros Macrakis wrote: This is another example of the general preference of the designers of R for convenience over consistency. In my opinion, this is a design flaw even for non-programmers, because I find that inconsistencies make the system harder to learn. Yes, the naive user may stumble over the difference between m[[1,1]] and m[1,1] a few times before getting it, but once he or she understands the principle, it is general. +1 vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] split strings
(diverted to r-devel, a source code patch attached) Wacek Kusnierczyk wrote: Allan Engelhardt wrote: Immaterial, yes, but it is always good to test :) and your solution *is* faster and it is even faster if you can assume byte strings: :) indeed; though if the speed is immaterial (and in this case it supposedly was), it's probably not worth risking fixed=TRUE removing '.tif' from the middle of the name, however unlikely this might be (cf murphy's laws). but if you can assume that each string ends with a '.tif' (or any other \..{3} substring), then substr is marginally faster than sub, even as a three-pass approach, while avoiding the risk of removing '.tif' from the middle: strings = sprintf('f:/foo/bar//%s.tif', replicate(1000, paste(sample(letters, 10), collapse=''))) library(rbenchmark) benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL, substr={basenames=basename(strings); substr(basenames, 1, nchar(basenames)-4)}, sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE)) # test elapsed # 1 substr 3.176 # 2sub 3.296 btw., i wonder why negative indices default to 1 in substr: substr('foobar', -5, 5) # fooba # substr('foobar', 1, 5) substr('foobar', 2, -2) # # substr('foobar', 2, 1) this does not seem to be documented in ?substr. there are ways to make negative indices meaningful, e.g., by taking them as indexing from behind (as in, e.g., perl): # hypothetical substr('foobar', -5, 5) # ooba # substr('foobar', 6-5+1, 5) substr('foobar', 2, -2) # ooba # substr('foobar', 2, 6-2+1) there is a trivial fix to src/main/character.c that gives substr the extended functionality -- see the attached patch. the patch has been created and tested as follows: svn co https://svn.r-project.org/R/trunk r-devel cd r-devel # modifications made to src/main/character.c svn diff character.c.diff svn revert -R . patch -p0 character.c.diff ./configure make make check-all # no problems reported with the patched substr, the original problem can now be solved more concisely, using a two-pass approach, with performance still better than the sub/fixed/bytes one, as follows: strings = sprintf('f:/foo/bar//%s.tif', replicate(1000, paste(sample(letters, 10), collapse=''))) library(rbenchmark) benchmark(columns=c('test', 'elapsed'), replications=1000, order=NULL, substr=substr(basename(strings), 1, -5), 'substr-nchar'={ basenames=basename(strings) substr(basenames, 1, nchar(basenames)-4) }, sub=sub('.tif', '', basename(strings), fixed=TRUE, useBytes=TRUE)) # test elapsed # 1 substr 2.981 # 2 substr-nchar 3.206 # 3 sub 3.273 if this sounds interesting, i can update the docs accordingly. vQ Index: src/main/character.c === --- src/main/character.c (revision 48667) +++ src/main/character.c (working copy) @@ -244,7 +244,12 @@ ss = CHAR(el); slen = strlen(ss); /* FIXME -- should handle embedded nuls */ buf = R_AllocStringBuffer(slen+1, cbuff); - if (start 1) start = 1; + if (start == 0) + start = 1; + else if (start 0) + start = slen + start + 1; + if (stop 0) + stop = slen + stop + 1; if (start stop || start slen) { buf[0] = '\0'; } else { __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] split strings
William Dunlap wrote: Would your patched code affect the following use of regexpr's output as input to substr, to pull out the matched text from the string? x-c(ooo,good food,bad) r-regexpr(o+, x) substring(x,r,attr(r,match.length)+r-1) [1] ooo oo no; same output substr(x,r,attr(r,match.length)+r-1) [1] ooo oo no; same output r [1] 1 2 -1 attr(,match.length) [1] 3 2 -1 attr(r,match.length)+r-1 [1] 3 3 -3 attr(,match.length) [1] 3 2 -1 for the positive indices there is no change, as you might expect. if i understand your concern, the issue is that regexpr returns -1 (with the corresponding attribute -1) where there is no match. in this case, you expect as the substring. if there is no match, we have: start = r = -1 (the start you index provide) stop = attr(r) + r - 1 = -1 + -1 -1 = -3 (the stop index you provide) for a string of length n, my patch computes the final indices as follows: start' = n + start - 1 stop' = n + stop - 1 whatever the value of n, stop' - start' = stop - start = -3 - 1 = -4. that is, stop' start', hence an empty string is returned, by virtue of the original code. (see the sources for details.) does this answer your question? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] split strings
Wacek Kusnierczyk wrote: William Dunlap wrote: Would your patched code affect the following use of regexpr's output as input to substr, to pull out the matched text from the string? x-c(ooo,good food,bad) r-regexpr(o+, x) substring(x,r,attr(r,match.length)+r-1) [1] ooo oo no; same output substr(x,r,attr(r,match.length)+r-1) [1] ooo oo no; same output r [1] 1 2 -1 attr(,match.length) [1] 3 2 -1 attr(r,match.length)+r-1 [1] 3 3 -3 attr(,match.length) [1] 3 2 -1 for the positive indices there is no change, as you might expect. if i understand your concern, the issue is that regexpr returns -1 (with the corresponding attribute -1) where there is no match. in this case, you expect as the substring. if there is no match, we have: start = r = -1 (the start you index provide) stop = attr(r) + r - 1 = -1 + -1 -1 = -3 (the stop index you provide) for a string of length n, my patch computes the final indices as follows: start' = n + start - 1 stop' = n + stop - 1 whatever the value of n, stop' - start' = stop - start = -3 - 1 = -4. except for that stop - start = -3 - -1 = -2, but that's still negative, i.e., stop' start'. silly me, sorry. vQ that is, stop' start', hence an empty string is returned, by virtue of the original code. (see the sources for details.) does this answer your question? __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] minor correction to the r internals manual
sec. 1.1 says: both types of node structure have as their first three fields a 32-bit sxpinfo header and then three pointers [...] that's *four* fields, as seen in src/include/Rinternals.h:208+: #define SEXPREC_HEADER \ struct sxpinfo_struct sxpinfo; \ struct SEXPREC *attrib; \ struct SEXPREC *gengc_next_node, *gengc_prev_node vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.numeric(levels(factor(x))) may be a decreasing sequence
Martin Maechler wrote: I have very slightly modified the changes (to get rid of -Wall warnings) and also exported the function as Rf_dropTrailing0(), and tested the result with 'make check-all' . As the change seems reasonable and consequent, and as it seems not to produce any problems in our tests, I'm hereby proposing to commit it (my version of it), [to R-devel only] within a few days, unless someone speaks up. i may be misunderstanding the code, but: Martin Maechler, ETH Zurich PS --- R-devel/src/main/coerce.c 2009-04-17 17:53:35.0 +0200 PS +++ R-devel-elim-trailing/src/main/coerce.c 2009-05-23 08:39:03.914774176 +0200 PS @@ -294,12 +294,33 @@ PS else return mkChar(EncodeInteger(x, w)); PS } PS +const char *elim_trailing(const char *s, char cdec) the first argument is const char*, which usually means a contract promising not to change the content of the pointed-to object PS +{ PS +const char *p; PS +char *replace; PS +for (p = s; *p; p++) { PS +if (*p == cdec) { PS +replace = (char *) p++; const char* p is cast to non-const char* replace PS +while ('0' = *p *p = '9') { PS +if (*(p++) != '0') { PS +replace = (char *) p; likewise PS +} PS +} PS +while (*(replace++) = *(p++)) { the char* replace is assigned to -- effectively, the content of the promised-to-be-constant string s is modified, and the modification may involve any character in the string. (it's a no-compile-error contract violation; not an uncommon pattern, but not good practice either.) PS +; PS +} PS +break; PS +} PS +} PS +return s; you return s, which should be the same pointer value (given the actual code that does not modify the local variable s) with the same pointed-to string value (given the signature of the function). was perhaps char *elim_trailing(char* const s, char cdec) intended? anyway, having the pointer s itself declared as const does make sense, as the code seems to assume that exactly the input pointer value should be returned. or maybe the argument to elim_trailing should not be declared as const, since elim_trailing violates the declaration. one way out is to drop the violated const in both the actual argument and in elim_trailing, which would then be simplified by removing all const qualifiers and (char*) casts. another way out is to make elim_trailing actually allocate and return a new string, keeping the input truly constant, at a performance cost. yet another way is to ignore the issue, of course. the original (martin/petr) version may quietly pass -Wall, but the compiler would complain (rightfully) with -Wcast-qual. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Qs: The list of arguments, wrapping functions...
Kynn Jones wrote: Hi. I'm pretty new to R, but I've been programming in other languages for some time. I have a couple of questions regarding programming with function objects. 1. Is there a way for a function to refer generically to all its actual arguments as a list? I'm thinking of something like the @_ array in Perl or the arguments variable in JavaScript. (By actual I mean the ones that were actually passed, as opposed to its formal arguments, as returned by formals()). a quick shot from a naive r user: f = function(a=1, b, ...) as.list(match.call()[-1]) f(2) f(b=2) f(1,2,3) 2. I have a package in which most of the functions have the form: the.function - function(some, list, of, params) { return( some.other.function(the.list.of.params.to.this.function)); } Is there a way that I can use a loop to define all these functions? what do you mean, precisely? In general, I'm looking for all the information I can find on the subject of dynamic function definition (i.e. using code to automate the definition of functions at runtime). I'm most interested in introspection facilities and dynamic code generation. E.g. is it possible to write a module that redefines itself when sourced? Or can a function redefine itself when first run? Or how can a function find out about how it was called? another quick shot from a naive r user: f = function() assign( as.character(match.call()[[1]]), function() evil(), envir=parent.frame()) f f() f you can then use stuff like formals, body, match.call, parent.frame, etc. to have your function reimplement itself based on how and where it is called. FWIW, Some of the things I'd like to do are in the spirit of a decorator in Python, which is a function that take a function f an argument and return another function g that is somehow based on f. For example, this makes it very easy to write functions as wrappers to other simpler functions. recall that decorators, when applied using the @syntax, do not just return a new function, but rather redefine the one to which they are applied. so in r it would not be enough to write a function that takes a function and returns another one; it'd have to establish the input function's name and the environment it resides in, and then replace that entry in that environment with the new function. yet another quick shot from the same naive r user: # the decorator operator '%...@%' = function(decorator, definition) { definition = substitute(definition) name = definition[[2]][[2]] definition = definition[[2]][[3]] assign( as.character(name), decorator(eval(definition, envir=parent.frame())), envir=parent.frame()) } # a decorator twice = function(f) function(...) do.call(f, as.list(f(...))) # a function inv = function(a, b) c(b, a) inv(1,2) # 2 1 twice(inv)(1,2) # 1 2 # a decorated function twice %...@% { square = function(x) x^2 } square(2) # 16 # another decorator verbose = function(f) function(...) { cat('computing...\n') f(...) } # another decorated function verbose %...@% { square = function(x) x^2 } square(2) # computing... # 4 there is certainly a lot of space for improvements, and there are possibly bugs in the code above, but i hope it helps a little. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Qs: The list of arguments, wrapping functions...
Wacek Kusnierczyk wrote: Kynn Jones wrote: In general, I'm looking for all the information I can find on the subject of dynamic function definition (i.e. using code to automate the definition of functions at runtime). I'm most interested in introspection facilities and dynamic code generation. E.g. is it possible to write a module that redefines itself when sourced? Or can a function redefine itself when first run? Or how can a function find out about how it was called? another quick shot from a naive r user: f = function() assign( as.character(match.call()[[1]]), function() evil(), envir=parent.frame()) or maybe f = function() body(f) - expression(evil()) f f() f vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Qs: The list of arguments, wrapping functions...
Wacek Kusnierczyk wrote: Wacek Kusnierczyk wrote: Kynn Jones wrote: In general, I'm looking for all the information I can find on the subject of dynamic function definition (i.e. using code to automate the definition of functions at runtime). I'm most interested in introspection facilities and dynamic code generation. E.g. is it possible to write a module that redefines itself when sourced? Or can a function redefine itself when first run? Or how can a function find out about how it was called? another quick shot from a naive r user: f = function() assign( as.character(match.call()[[1]]), function() evil(), envir=parent.frame()) or maybe f = function() body(f) - expression(evil()) though, 'of course', these two versions are not effectively equivalent; try g = f f() c(g, f) with both definitions. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] View() crashy on Ubuntu 9.04
Ben Bolker wrote: It's my vague impression that View() is workable on Windows and maybe on MacOS, but on Ubuntu Linux 9.04 (intrepid) it seems completely unstable. I can reliably crash R by trying to look at a very small, simple data frame ... on my 8.04, r is reliable at crashing with, e.g., View(1) with a subsequent attempt to move through the spreadsheet with an arrow key. this always causes a segfault. I was going to try to run with debug turned on, but my installed version (2.9.0) doesn't have debugging symbols, and I'm having trouble building the latest SVN version (./configure gives checking for recommended packages... ls: cannot access ./src/library/Recommended/boot_*.tar.gz: No such file or directory) tools/rsync-recommended vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] unsplit list of data.frames with one column
Peter Dalgaard wrote: Will Gray wrote: Perhaps this is the intended behavior, but I discovered that unsplit throws an error when it tries to set rownames of a variable that has no dimension. This occurs when unsplit is passed a list of data.frames that have only a single column. An example: df - data.frame(letters[seq(25)]) fac - rep(seq(5), 5) unsplit(split(df, fac), fac) For reference, I'm using R version 2.9.0 (2009-04-17), subversion revision 48333, on Ubuntu 8.10. That's a bug. The line x - value[[1L]][rep(NA, len), ] should be x - value[[1L]][rep(NA, len), , drop=FALSE] looks like someone got caught by the drop=TRUE design...? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] proposed changes to RSiteSearch
Romain Francois wrote: txt - grep( '^tr.*td align=right.*a', readLines( url ), value = TRUE ) rx - '^.*?a href=(.*?)(.*?)/a.*td(.*?)/td.*$' out - data.frame( url = gsub( rx, \\1, txt ), group = gsub( rx, \\2, txt ), description = gsub( rx, \\3, txt ), looking at this bit of your code, i wonder why gsub is not vectorized for the pattern and replacement arguments, although it is for the x argument. the three lines above could be collapsed to just one with a vectorized gsub: gsubm = function(pattern, replacement, x, ...) mapply(USE.NAMES=FALSE, SIMPLIFY=FALSE, gsub, pattern=pattern, replacement=replacement, x=x, ...) for example, given the sample data txt = 'foofoo/foobarbar/bar' rx = '(.*?)(.*?)/(.*?)' the sequence open = gsub(rx, '\\1', txt, perl=TRUE) content = gsub(rx, '\\2', txt, perl=TRUE) close = gsub(rx, '\\3', txt, perl=TRUE) print(list(open, content, close)) could be replaced with data = structure(names=c('open', 'content', 'close'), gsubm(rx, paste('\\', 1:3, sep=''), txt, perl=TRUE)) print(data) surely, a call to mapply does not improve performance, but a source-level fix should not be too difficult; unfortunately, i can't find myself willing to struggle with r sources right now. note also that .*? does not work as a non-greedy .* with the default regex engine, e.g., txt = foo='FOO' bar='BAR' gsub((.*?)='(.*?)', '\\1', txt) # foo='FOO' bar gsub((.*?)='(.*?)', '\\2', txt) # BAR because the first .*? matches everyithng up to and exclusive of the second, *not* the first, '='. for a non-greedy match, you'd need pcre (and using pcre generally improves performance anyway): txt = foo='FOO' bar='BAR' gsub((.*?)='(.*?)', '\\1', txt, perl=TRUE) # foo bar gsub((.*?)='(.*?)', '\\2', txt, perl=TRUE) # FOO BAR vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] proposed changes to RSiteSearch
Romain Francois wrote: strapply in package gsubfn brings elegance here: txt - 'foobar/foo' rx - (.*?)(.*?)/(.*?) strapply( txt, rx, c , perl = T ) [[1]] [1] foo bar foo sure, but this does not, in any way, make it less strange that gsub is not vectorized. Too bad you have to pay this on performance: txt - rep( 'foobar/foo', 1000 ) rx - (.*?)(.*?)/(.*?) system.time( out - strapply( txt, rx, c , perl = T ) ) user system elapsed 2.923 0.005 3.063 system.time( out2 - sapply( paste('\\', 1:3, sep=''), function(x){ + gsub(rx, x, txt, perl=TRUE) + } ) ) user system elapsed 0.011 0.000 0.011 strapply and you know why. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] proposed changes to RSiteSearch
hadley wickham wrote: On Fri, May 8, 2009 at 10:11 AM, Romain Francois romain.franc...@dbmail.com wrote: strapply in package gsubfn brings elegance here: txt - 'foobar/foo' rx - (.*?)(.*?)/(.*?) strapply( txt, rx, c , perl = T ) [[1]] [1] foo bar foo Too bad you have to pay this on performance: txt - rep( 'foobar/foo', 1000 ) rx - (.*?)(.*?)/(.*?) system.time( out - strapply( txt, rx, c , perl = T ) ) user system elapsed 2.923 0.005 3.063 system.time( out2 - sapply( paste('\\', 1:3, sep=''), function(x){ + gsub(rx, x, txt, perl=TRUE) + } ) ) user system elapsed 0.011 0.000 0.011 Not sure what the right play i For me: system.time( out - strapply( txt, rx, c , perl = T ) ) user system elapsed 0.004 0.000 0.004 system.time( out2 - sapply( paste('\\', 1:3, sep=''), function(x){ + gsub(rx, x, txt, perl=TRUE) + } ) ) user system elapsed 0 0 0 for me: txt - 'foobar/foo' rx - '(.*?)(.*?)/(.*?)' library(rbenchmark) benchmark(replications=1000, columns=c('test', 'elapsed'), order='elapsed', sapply=sapply(paste('\\', 1:3, sep=''), function(x) gsub(rx, x, txt, perl=TRUE)), mapply=mapply(gsub, rx, paste('\\', 1:3, sep=''), txt, perl=TRUE), strapply=strapply(txt, rx, c, perl=TRUE)) # 2 mapply 0.151 # 1 sapply 0.166 # 3 strapply 1.917 vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Some extensions to class inheritance and method selection
Stavros Macrakis wrote: These look like important improvements. As a relative newcomer to the R community, I'm not sure I understand what the procedures are for such changes. In particular, does the fact that the changes were committed to R-devel mean that the changes have already been reviewed and approved by R Core? Are R Core's discussions / deliberations archived somewhere? What is the role of the larger R community in reviewing and approving changes like this? How is documentation handled? Who is responsible for developing and maintaining a definitive reference manual (not just man pages) which includes all the cumulative changes and describes them comprehensively and in black-box way (not referring to history and implementation details)? as another newcommer, i admit the procedures mentioned above are quite opaque to me, too. from my perspective, it seems like quite many, if not most, improvements (changes, at least) to r code are committed in an ad hoc fashion, by a single developer, without any publicly visible discussion. this is likely to lead, and in certain circumstances does lead, to bizarre, eclectic patches visible in the sources. it would be indeed interesting and desirable to make the process more open, at least for review, by users. or is r not *that* open? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] incorrect output and segfaults from sprintf with %*d (PR#13667)
Gabor Grothendieck wrote: On Fri, Apr 24, 2009 at 6:45 AM, maech...@stat.math.ethz.ch wrote: Yes, the documentation will also have to be amended, but apart from that, would people see a big problem with the 8192 limit which now is suddenly of greater importance {{as I said all along; hence my question to Wacek (and the R-develers) if anybody found that limit too low}} I haven't been following all this but in working with strings for the gsubfn package my own usage of the package was primarily for small strings but then I discovered that others wanted to use it for much larger strings of 25,000 characters, say, and it was necessary to raise the limits (and there are also performance implications which could be addressed too). I don't know what the situation is particularly here but cases where very large strings can be used include linguistic analysis and computer generated R code. in principle, instead of the quite arbitrary and not justified constant size limit 8192 [1], one could use dynamic arrays. this would allow strings of arbitrary length without adding much performance penalty for strings shorter than 8193 bytes. [1] src/include/Defn.h:60 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] incorrect output and segfaults from sprintf with %*d (PR#13667)
maech...@stat.math.ethz.ch wrote: vQ sprintf has a documented limit on strings included in the output using the vQ format '%s'. It appears that there is a limit on the length of strings included vQ with, e.g., the format '%d' beyond which surprising things happen (output vQ modified for conciseness): vQ ... and this limit is *not* documented. MM well, it is basically (+ a few bytes ?) MM the same 8192 limit that *is* documented. indeed, I was right with that.. hmm, i'd guess this limit is valid for all strings included in the output with any format? not just %s (and, as it appears, undocumentedly %d)? vQ while snprintf would help avoid buffer overflow, it may not be a vQ solution to the issue of confused output. MM I think it would / will. We would be able to give warnings and MM errors, by checking the snprintf() return codes. My current working code gives an error for all the above examples, e.g., sprintf('%d', 1) Error in sprintf(%d, 1) : required resulting string length is maximal 8191 it passes 'make check-devel' and I am inclined to commit that code to R-devel (e.g. tomorrow). Yes, the documentation will also have to be amended, but apart from that, would people see a big problem with the 8192 limit which now is suddenly of greater importance {{as I said all along; hence my question to Wacek (and the R-develers) if anybody found that limit too low}} i didn't find the limit itself problematic. (so far?) btw. (i do know what that means ;)), after your recent fix: sprintf('%q%s', 1) # Error in sprintf(%q%s, 1) : # use format %f, %e, %g or %a for numeric objects sprintf('%s', 1) # [1] 1 you may want to add '%s' (and '%x', and ...) to the error message. or perhaps make it say sth like 'invalid format: ...'. the problem is not that %q is not applicable to numeric, but that it is not a valid format at all. there's also an issue with the additional arguments supplied after the format: any superfluous arguments are ignored (this is not documented, as far as i can see), but they *are* evaluated nevertheless, e.g.: sprintf('%d', 0, {print(1)}) # 1 # [1] 0 it might be a good idea to document this behaviour. best, vQ vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] incorrect output and segfaults from sprintf with %*d (PR#13667)
maech...@stat.math.ethz.ch wrote: vQ sprintf has a documented limit on strings included in the output using the vQ format '%s'. It appears that there is a limit on the length of strings included vQ with, e.g., the format '%d' beyond which surprising things happen (output vQ modified for conciseness): ... and this limit is *not* documented. vQ gregexpr('1', sprintf('%9000d', 1)) vQ # [1] 9000 9801 vQ gregexpr('1', sprintf('%9000d', 1)) vQ # [1] 9000 9801 10602 vQ gregexpr('1', sprintf('%9000d', 1)) vQ # [1] 9000 9801 10602 11403 vQ gregexpr('1', sprintf('%9000d', 1)) vQ # [1] 9000 9801 10602 11403 12204 vQ ... vQ Note that not only more than one '1' is included in the output, but also that vQ the same functional expression (no side effects used beyond the interface) gives vQ different results on each execution. Analogous behaviour can be observed with vQ '%nd' where n 8200. vQ The actual output above is consistent across separate sessions. vQ With sufficiently large field width values, R segfaults: vQ sprintf('%*d', 10^5, 1) vQ # *** caught segfault *** vQ # address 0xbfcfc000, cause 'memory not mapped' vQ # Segmentation fault Thank you, Wacek. That's all ``interesting'' ... unfortunately, my version of 'man 3 sprintf' contains BUGS Because sprintf() and vsprintf() assume an arbitrarily long string, callers must be careful not to overflow the actual space; this is often impossible to assure. Note that the length of the strings produced is locale-dependent and difficult to predict. Use snprintf() and vsnprintf() instead (or asprintf() and vasprintf). yes, but this is c documentation, not r documentation. it's applicable to a degree, since ?sprintf does say that sprintf is a wrapper for the C function 'sprintf'. however, in c you use a buffer and you usually have control over it's capacity, while in r this is a hidden implementational detail, which should not be visible to the user, or should cause an attempt to overflow the buffer to fail more gracefully than with a segfault. in r, sprintf('%9000d', 1) will produce a confused output with a count of 1's variable (!) across runs (while sprintf('%*d', 9000, 1) seems to do fine): gregexpr('1', sprintf('%*d', 9000, 1)) # [1] 9000 gregexpr('1', sprintf('%9000d', 1)) # [1] 9000 9801 ..., variable across executions on one execution in a series i actually got this: Warning message: In gregexpr(1, sprintf(%9000d, 1)) : input string 1 is invalid in this locale while the very next execution, still in the same session, gave # [1] 9000 9801 10602 with sprintf('%*d', 1, 1) i got segfaults on some executions but correct output on others, while sprintf('%1d', 1) is confused again. (note the impossible part above) yes, but it does also say must be careful, and it seems that someone has not been careful enough. and we haven't used snprintf() yet, probably because it requires the C99 C standard, and AFAIK, we have only relatively recently started to more or less rely on C99 in the R sources. while snprintf would help avoid buffer overflow, it may not be a solution to the issue of confused output. More precisely, I see that some windows-only code relies on snprintf() being available whereas in at least on non-Windows section, I read /* we cannot assume snprintf here */ Now such platform dependency issues and corresponding configure settings I do typically leave to other R-corers with a much wider overview about platforms and their compilers and C libraries. it looks like src/main/sprintf.c is just buggy, and it's plausible that the bug could be repaired in a platform-independent manner. BTW, 1) sprintf(%n %g, 1,1) also seg.faults as do sprintf('%n%g', 1, 1) sprintf('%n%') etc., while sprintf('%q%g', 1, 1) sprintf('%q%') work just fine. strange, because per ?sprintf 'n' is not recognized as a format specifier, so the output from the first two above should be as from the last two above, respectively. (and likewise in the %S case, discussed and bug-reported earlier.) 2) Did you have a true use case where the 8192 limit was an undesirable limit? how does it matter? if you set a limit, be sure to consistently enforce it and warn the user on attempts to exceed it. or write clearly in the docs that such attempts will cause the output to be silently truncated. examples such as sprintf('%9000d', 1) do not contribute to the reliability of r, and neither to the user's confidence in it. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] sprintf limits output string length with no warning/error message
sprintf has a limit on the length of a string produced with a '%s' specification: nchar(sprintf('%1s', '')) # 8191 nchar(sprintf('%*s', 1, '')) # 8191 This is sort of documented in ?sprintf: There is a limit of 8192 bytes on elements of 'fmt' and also on strings included by a '%s' conversion specification. but it should be a good idea for sprintf to at least warn when the output is shorter than specified. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] Definition of = vs. -
Peter Dalgaard wrote: Wacek Kusnierczyk wrote: Stavros Macrakis wrote: `-` Error: object - not found that's weird! Why??? partly because it was april fools. but more seriously, it's because one could assume that in any syntactic expression with an operator involved, the operator maps to a semantic object. it has been claimed on this list (as far as i recall; don't ask me for reference, but if pressed, i'll find it) that any expression of the form lhs op rhs is a syntactic variant for `op`(lhs, rhs) (which would, following that argumentation, make r a lisp-like language) but this apparently does not apply to '-'. i would (naively, perhaps) expect that `-` is a function, which, internally, may well just invert the order of arguments and imemdiately call `-`. the fact that expressions involving '-' are converted, at the parse time, into ones using '-' is far from obvious to me (it is now, but not a priori): quote(1-a) # a - 1 # why not: 1 - a # why not: `-`(1, a) and btw. the following is also weird: quote(a=1) # 1 not because '=' works as named argument specifier (so that the result would be something like `=`(a, 1)), but because quote has no parameter named 'a', and i would expect an error to be raised: # hypothetical quote(a=1) # error: unused argument(s): (a = 1) as in, say vector(mode='list', i=1) # error: unused argument(s): (i = 1) it appears that, in fact, quite many r functions will gladly match a *named* argument with a *differently named* parameter. it is weird to the degree that it is *wrong* wrt. the 'r language definition', sec. 4.3.2 'argument matching', which says: The first thing that occurs in a function evaluation is the matching of formal to the actual or supplied arguments. This is done by a three-pass process: 1. Exact matching on tags. For each named supplied argument the list of formal arguments is searched for an item whose name matches exactly. It is an error to have the same formal argument match several actuals or vice versa. 2. Partial matching on tags. Each remaining named supplied argument is compared to the remaining formal arguments using partial matching. If the name of the supplied argument matches exactly with the first part of a formal argument then the two arguments are con- sidered to be matched. It is an error to have multiple partial matches. Notice that if f - function(fumble, fooey) fbody, then f(f = 1, fo = 2) is illegal, even though the 2nd actual argument only matches fooey. f(f = 1, fooey = 2) is legal though since the second argument matches exactly and is removed from consideration for partial matching. If the formal arguments contain ‘...’ then partial matching is only applied to arguments that precede it. 3. Positional matching. Any unmatched formal arguments are bound to unnamed supplied arguments, in order. If there is a ‘...’ argument, it will take up the remaining arguments, tagged or not. If any arguments remain unmatched an error is declared. if you now consider the example of quote(a=1), with quote having *one* formal argument (parameter) named 'expr' (see ?quote), we see that: 1. there is no exact match between the formal 'expr' and the actual 'a' 2. there is no partial match between the formal 'expr' and the actual 'a' 3a. there is an unmatched formal argument ('expr'), but no unnamed actual argument. hence, 'expr' remains unmatched. 3b. there is no argument '...' (i think the r language definition is lousy and should say 'formal argument' here, as you can have it as an actual, too, as in quote('...'=1)). hence, the actual argument named 'a' will not be 'taken up'. there remain unmatched arguments (i guess the r language definition is lousy and should say 'unmatched actual arguments', as you can obviously have unmatched formals, as in eval(1)), hence an error should be 'declared' (i guess 'raised' is more appropriate). this does not happen in quote(a=1) (and many, many other cases), and this makes me infer that there is a *bug* in the implementation of argument matching, since it clearly does not conform to the definiton. hence, i cc: to r-devel, and will also report a bug in the usual way. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] Definition of = vs. -
Wacek Kusnierczyk wrote: and btw. the following is also weird: quote(a=1) # 1 not because '=' works as named argument specifier (so that the result would be something like `=`(a, 1)), i meant to write: not because '=' does not work as an assignment operator (or otherwise the result would be ...) but because quote has no parameter named 'a', and i would expect an error to be raised: # hypothetical quote(a=1) # error: unused argument(s): (a = 1) as in, say vector(mode='list', i=1) # error: unused argument(s): (i = 1) it appears that, in fact, quite many r functions will gladly match a *named* argument with a *differently named* parameter. it is weird to the degree that it is *wrong* wrt. the 'r language definition', sec. 4.3.2 'argument matching', which says: The first thing that occurs in a function evaluation is the matching of formal to the actual or supplied arguments. This is done by a three-pass process: 1. Exact matching on tags. For each named supplied argument the list of formal arguments is searched for an item whose name matches exactly. It is an error to have the same formal argument match several actuals or vice versa. 2. Partial matching on tags. Each remaining named supplied argument is compared to the remaining formal arguments using partial matching. If the name of the supplied argument matches exactly with the first part of a formal argument then the two arguments are con- sidered to be matched. It is an error to have multiple partial matches. Notice that if f - function(fumble, fooey) fbody, then f(f = 1, fo = 2) is illegal, even though the 2nd actual argument only matches fooey. f(f = 1, fooey = 2) is legal though since the second argument matches exactly and is removed from consideration for partial matching. If the formal arguments contain ‘...’ then partial matching is only applied to arguments that precede it. 3. Positional matching. Any unmatched formal arguments are bound to unnamed supplied arguments, in order. If there is a ‘...’ argument, it will take up the remaining arguments, tagged or not. If any arguments remain unmatched an error is declared. if you now consider the example of quote(a=1), with quote having *one* formal argument (parameter) named 'expr' (see ?quote), we see that: 1. there is no exact match between the formal 'expr' and the actual 'a' 2. there is no partial match between the formal 'expr' and the actual 'a' 3a. there is an unmatched formal argument ('expr'), but no unnamed actual argument. hence, 'expr' remains unmatched. 3b. there is no argument '...' (i think the r language definition is lousy and should say 'formal argument' here, as you can have it as an actual, too, as in quote('...'=1)). hence, the actual argument named 'a' will not be 'taken up'. there remain unmatched arguments (i guess the r language definition is lousy and should say 'unmatched actual arguments', as you can obviously have unmatched formals, as in eval(1)), hence an error should be 'declared' (i guess 'raised' is more appropriate). this does not happen in quote(a=1) (and many, many other cases), and this makes me infer that there is a *bug* in the implementation of argument matching, since it clearly does not conform to the definiton. hence, i cc: to r-devel, and will also report a bug in the usual way. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Assignment to string
Stavros Macrakis wrote: On Wed, Apr 1, 2009 at 5:11 PM, Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: Stavros Macrakis wrote: ... i think this concords with the documentation in the sense that in an assignment a string can work as a name. note that `foo bar` = 1 is.name(`foo`) # FALSE the issue is different here in that in is.name(foo) foo evaluates to a string (it works as a string literal), while in is.name(`foo`) `foo` evaluates to the value of the variable named 'foo' (with the quotes *not* belonging to the name). Wacek, surely you are joking here. The object written `foo` (a name) *evaluates to* its value. yes, which is the value of a variable named 'foo' (quotes not included in the name), or with other words, the value of the variable foo. The object written foo (a string) evaluates to itself. This has nothing to do with the case at hand, since the left-hand side of an assignment statement is not evaluated in the normal way. yes. i did support your point that the documentation is confusing wrt. foo = 1 because foo is not a name (and in particular, not a quoted name). ...with only a quick look at the sources (src/main/envir.c:1511), i guess the first element to an assignment operator (i mean the left-assignment operators) is converted to a name Yes, clearly when the LHS of an assignment is a string it is being coerced to a name. I was simply pointing out that that is not consistent with the documentation, which requires a name on the LHS. ... but there is probably something going on in do_set (in src/main/eval.c) before do_assign is called. - maclisp was designed by computer scientists in a research project, - r is being implemented by statisticians for practical purposes. Well, I think it is overstating things to say that Maclisp was designed at all. Maclisp grew out of PDP-6 Lisp, with new features being added regularly. Maclisp itself wasn't a research project -- didn't say that; it was, as far as i know (and that's little) developed as part, or in support of, the MIT research project MAC. there are vanishingly few papers about it in the academic literature, unlike contemporary research languages like Planner, EL/1, CLU, etc. In fact, there are many parallels with R -- it was in some sense a service project supporting AI and symbolic algebra research, with ad hoc features (a.k.a. hacks) that's a parallel to r, i guess? being added regularly to support some new idea in AI or algebra. To circle back to the current discussion, Maclisp didn't even have strings as a data type until the mid-70's -- before that, atoms ('symbols' in more modern terminology) were the only way to represent strings. (And that lived on in Maxima for many decades...) See http://www.softwarepreservation.org/projects/LISP/ for documentation on the history of many different Lisps. interesting, thanks. We learned many lessons with Maclisp. Well, actually two different sets of lessons were learned by two different communities. The Scheme community learned the importance of minimalist, clean, principled design. and scheme is claimed to be the inspiration for r... The Common Lisp community learned the importance of large, well-designed libraries. Both learned the importance of standardization and clear specification. There is much to learn. yes... best, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] actual argument matching does not conform to the definition (PR#13634)
Thomas Lumley wrote: The explanation is that quote() is a primitive function and that the argument matching rules do not apply to primitives. That section of the R Language definition should say that primitives are excluded; it is documented in ?.Primitive. thanks. indeed, the documentation -- the language *definition* -- should make this clear. so this is a bug in the definition, which does not match the implementation, which in turn is as intended (right?) ?.Primitive says: The advantage of '.Primitive' over '.Internal' functions is the potential efficiency of argument passing. However, this is done by ignoring argument names and using positional matching of arguments (unless arranged differently for specific primitives such as 'rep'), so this is discouraged for functions of more than one argument. what is discouraged? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] duplicated.data.frame {was [R] which rows are duplicates?}
Martin Maechler wrote: WK i attach the patch post for reference. note that you need to fix all of WK the functions in duplicated.R that share the buggy code. (yes, this was WK another thread; i submitted a bug report, and then sent a follow-up WK post with a patch). Thank you; yes, in the mean time I have also seen your bug report and patch. Interestingly (or not), I have myself patched identically to what you propose, withOUT even having known about your bug report + patch. this means, the solution has greater chances to be correct. { hmmm, it seems your thinking can be very close to mine, so why can't you like R properly ;-b } actually, i think i *do* like r properly. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] variance/mean
Martin Maechler wrote: Your patch is basically only affecting the default method = pearson. For (most) other cases, 'y = NULL' would still remain *the* way to save computations, unless we'd start to use an R-level equivalent [which I think does not exist] of your C trick (DATAPTR(x) == DATAPTR(y)). yes, my patch was constrained to the c code, but i don't think it would be particularly difficult to fix the relevant r-level code as well. i did think about it, but didn't want to invest more time in this until (or unless) someone would respond. (thanks for the response.) Also, for S- and R- backcompatibility reasons, we'd need to continue allowing y = NULL (as your patch would, too), only in its current for -- indeed, the (unimplemented) intention was to detach from the old misdesign, and fix everything so that y=x by default anywhere. so currently I think this whole idea -- as slick as it is, I learned something! -- does not make sense applying here. i think it does, because the current state is somewhat funny, including both the difference in performance between var(x) and var(x,x) (with x being a matrix), and the respective comment in ?var. the attached patch suggests modifications to src/main/cov.c and src/library/stats/man/cor.Rd. BTW: since you didn't (and shouldn't , because of method != pearson !) change the R code, i would suggest it be done, though. the docs \usage{.} part should not have been changed either ! indeed, the change in the docs didn't match what i *have* actually fixed in the code. and as I mentioned: using 'y = NULL' in the function call must *MUST* ? continue to work, hence should also be documented as possibility == the docs would not really become more clear, I think no, of course, without the change in r code having the docs say y=x by default would be a nonsense. but again, this was a start, not a complete modification (and i admit i failed to acknowledge this). vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Gamma funtion(s) bug
Martin Maechler wrote: Using 'bug' (without any qualifying ? or possible ..) in the subject line is still a bit unfriendly... is suggesting that a poster includes 'excel bug' in the subject line [1] friendly?? vQ [1] https://stat.ethz.ch/pipermail/r-help/2009-March/190119.html __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Assignment to string
Stavros Macrakis wrote: The documentation for assignment says: In all the assignment operator expressions, 'x' can be a name or an expression defining a part of an object to be replaced (e.g., 'z[[1]]'). A syntactic name does not need to be quoted, though it can be (preferably by backticks). But the implementation allows assignment to a character string (i.e. not a name), which it coerces to a name: foo - 23; foo # returns 23 is.name(foo) [1] FALSE Is this a documentation error or an implementation error? i think this concords with the documentation in the sense that in an assignment a string can work as a name. note that `foo bar` = 1 is.name(`foo`) # FALSE the issue is different here in that in is.name(foo) foo evaluates to a string (it works as a string literal), while in is.name(`foo`) `foo` evaluates to the value of the variable named 'foo' (with the quotes *not* belonging to the name). with only a quick look at the sources (src/main/envir.c:1511), i guess the first element to an assignment operator (i mean the left-assignment operators) is converted to a name, so that in foo - 1 foo evaluates to a string and not a name (hence is.name(foo) is false), but internally it is sort of 'coerced' to a name, as in as.name(foo) # `foo` is.name(as.name(foo)) # TRUE The coercion is not happening at parse time: class(quote(foo-3)[[2]]) [1] character i think the internal assignment op really receives a string in a case like foo - 1, it knows it has to treat it as a name without the parser classifying the string as a name. (pure guesswork, again.) the documentation might avoid calling a plain string a 'quoted name', though, it is confusing. a quoted name is something like quote(name) or quote(`name`): is(quote(name)) # name language is(quote(`name`)) # name language but *not* something like name: is(name) # character vector data.frameRowLabels and *not* like quote(name): is(quote(name)) # character vector data.frameRowLabels In fact, bizarrely, not only does it coerce to a name, it actually *modifies* the parse tree: gg - quote(hij - 4) gg hij - 4 eval(gg) gg hij - 4 wow! that's called 'functional programming' ;) you're right: gg = quote({a = 1}) is(gg[[2]][[2]]) # character ... eval(gg) is(gg[[2]][[2]]) # name ... *** The cases below only come up with expression trees generated programmatically as far as I know, so are much more marginal cases. *** The - operator even allows the left-hand-side to be of length 1, though it just ignores the other elements, with the same side effect as before: that's clear from the sources; see src/main/envir.c:1521. it should be documented (maybe it is, i haven't investigated this issue). gg - quote(x-44) gg[[2]] - c(x,y) gg c(x, y) - 44 eval(gg) but also this: rm(list=ls()) do.call('=', list(letters, 1)) # just fine a # 1 b # error weird these work. i think it deserves a warning, at the very least, as in c('x', 'y') = 4 # error: assignment to non-language object c(x, y) = 4 # error: could not find function c- (provided that x and y are already there) btw., that's what you can do with rvalues (using the otherwise semantically void operator `:=`). these could seem equivalent, but they're (obviously) not: 'x' = 1 c('x') = 1 x = 1 c(x) = 1 x [1] 44 y Error: object y not found gg x - 44 None of this is documented in ? -, and it is rather a surprise that evaluating an expression tree can modify it. I admit we had a feature (performance hack) like this in MacLisp years ago, where expanded syntax macros replaced the source code of the macro, but it was a documented, general, and optional part of the macro mechanism. but - maclisp was designed by computer scientists in a research project, - r is being implemented by statisticians for practical purposes. almost every part differs here (and almost no pun intended). Another little glitch: gg - quote(x-44); gg[[2]] - character(0); eval(gg) Error in eval(expr, envir, enclos) : 'getEncChar' must be called on a CHARSXP This looks like an internal error that users shouldn't see. by no means the only example that the interface is no blood-brain barrier. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] incoherent conversions from/to raw
Martin Maechler wrote: (...) WK which shows that raw won't coerce to the four first types in the WK 'hierarchy' (excluding NULL), but it will to character, list, and WK expression. WK suggestion: improve the documentation, or adapt the implementation to WK a more coherent design. Thank you, Wacek. I've decided to adapt the implementation such that all the above c(raw , type) calls' implicit coercions will work. great! WK (2) WK incidentally, there's a bug somewhere there related to the condition WK system and printing: WK tryCatch(stop(), error=function(e) print(e)) WK # works just fine WK tryCatch(stop(), error=function(e) sprintf('%s', e)) WK # *** caught segfault *** WK # address (nil), cause 'memory not mapped' WK # Traceback: WK # 1: sprintf(%s, e) WK # 2: value[[3]](cond) WK # 3: tryCatchOne(expr, names, parentenv, handlers[[1]]) WK # 4: tryCatchList(expr, classes, parentenv, handlers) WK # 5: tryCatch(stop(), error = function(e) sprintf(%s, e)) WK # Possible actions: WK # 1: abort (with core dump, if enabled) WK # 2: normal R exit WK # 3: exit R without saving workspace WK # 4: exit R saving workspace WK # Selection: WK interestingly, it is possible to stay in the session by typing ^C. the WK session seems to work, but if the tryCatch above is tried once again, a WK segfault causes r to crash immediately: WK # ^C WK tryCatch(stop(), error=function(e) sprintf('%s', e)) WK # [whoe...@wherever] $ WK however, this doesn't happen if some other code is evaluated first: WK # ^C WK x = 1:10^8 WK tryCatch(stop(), error=function(e) sprintf('%s', e)) WK # Error in sprintf(%s, e) : 'getEncChar' must be called on a CHARSXP WK this can't be a feature. (tried in both 2.8.0 and r-devel; version WK info at the bottom.) WK suggestion: trace down and fix the bug. [not me, at least not now.] sure; i might try to find the bug in spare time, but can't promise. WK (3) WK the error argument to tryCatch is used in two examples in ?tryCatch, but WK it is not explained anywhere in the help page. one can guess that the WK argument name corresponds to the class of conditions the handler will WK handle, but it would be helpful to have this stated explicitly. the WK help page simply says: WK WK If a condition is signaled while evaluating 'expr' then WK established handlers are checked, starting with the most recently WK established ones, for one matching the class of the condition. WK When several handlers are supplied in a single 'tryCatch' then the WK first one is considered more recent than the second. WK WK which is uninformative in this respect -- what does 'one matching the WK class' mean? WK suggestion: improve the documentation. Patches to tryCatch.Rd are gladly accepted and quite possibly applied to the sources without much changes. ok, if you're willing to accept my suggestions i can try to suggest a patch to the rd. Thanks in advance! you're welcome. best, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] duplicated.data.frame {was [R] which rows are duplicates?}
Martin Maechler wrote: WK what the documentation *fails* to tell you is that the parameter WK 'incomparables' is defunct No, not defunct, but the contrary of it, not yet implemented ! that's my bad english, again. sorry. WK # data as above, or any data frame WK duplicated(data, incomparables=NA) WK # Error in if (!is.logical(incomparables) || incomparables) WK .NotYetUsed(incomparables != FALSE) : WK # missing value where TRUE/FALSE needed WK the error message here is *confusing*. yes! ! WK the error is raised because the WK author of the code made a mistake and apparently haven't carefully ((plural or singular ??)) i guess hasn't was intended. i'd need to ask the author. WK examined and tested his product; the code goes: ((aah, ... singular ...)) my guesswork, anyway. WK duplicated.data.frame WK # function (x, incomparables = FALSE, fromLast = FALSE, ...) WK # { WK #if (!is.logical(incomparables) || incomparables) WK #.NotYetUsed(incomparables != FALSE) WK #duplicated(do.call(paste, c(x, sep = \r)), fromLast = fromLast) WK # } WK # environment: namespace:base WK clearly, the intention here is to raise an error with a (still hardly WK clear) message as in: WK .NotYetUsed(incomparables != FALSE) WK # Error: argument 'incomparables != FALSE' is not used (yet) WK but instead, if(NA) is evaluated (because '!is.logical(NA) || NA' WK evaluates, *obviously*, to NA) and hence the uninformative error message. WK take home point: rtfm, *but* don't believe it. and then be helpful to the R community and send a bug report *with* a patch if {as in this case} you are able to... Well, that' no longer needed here, I'll fix that easily myself. but i *have* sent a patch already! vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] as.data.frame peculiarities
Stavros Macrakis wrote: The documentation of as.data.frame is not explicit about how it generates column names for the simple vector case, but it seems to use the character form of the quoted argument, e.g. names(as.data.frame(1:3)) [1] 1:3 But there is a strange case: names(as.data.frame(c(a))) [1] if (stringsAsFactors) factor(x) else x gosh! you don't even need the c(): names(as.data.frame('')) # same as above i thought you don''t even need the '', but then you're served with the following highly informative message: names(as.data.frame()) # Error in as.data.frame() : # element 1 is empty; #the part of the args list of 'is.null' being evaluated was: #(x) which actually comes from as.data.frame(). I feel fairly comfortable calling this a bug, though there is no explicit specification. maybe there is none so that it can always be claimed that you deal with an intentional, but not (yet) documented feature, rather than a bug. let's investigate this feature. in names(as.data.frame('a')) as.data.frame is generic, 'a' is character, thus as.data.frame.character(x, ...) is called with x = 'a'. here's the code for as.data.frame.character: function (x, ..., stringsAsFactors = default.stringsAsFactors()) as.data.frame.vector(if (stringsAsFactors) factor(x) else x, ...) and the as.data.frame.vector it calls: function (x, row.names = NULL, optional = FALSE, ...) { nrows - length(x) nm - paste(deparse(substitute(x), width.cutoff = 500L), collapse = ) if (is.null(row.names)) { if (nrows == 0L) row.names - character(0L) else if (length(row.names - names(x)) == nrows !any(duplicated(row.names))) { } else row.names - .set_row_names(nrows) } names(x) - NULL value - list(x) if (!optional) names(value) - nm attr(value, row.names) - row.names class(value) - data.frame value } watch carefully: nm = paste(deparse(substitute(x)), width.cutoff=500L), that is: nm = paste(if (stringsAsFactors) factor(x) else x, width.cutoff=500L) x = factor('a'), row.names==NULL, names(x)==NULL, and nrows = 1, and thus row.names = .set_row_names(1) = c(NA, -1) (interesting; see .set_row_names). and then we have: x = factor('a') # the input names(x) = NULL value = list(x) # value == list(factor('a')) names(value) = if (stringsAsFactors) factor(x) else x # the value of nm attr(value, 'row.names') = c(NA, -1) # the value of row.names class(value) = 'data.frame' value here you go: as some say, the answer is always in the code. that's how ugly hacks with deparse/substitute lead r core developers to produce ugly bugs. very useful, indeed. There is another strange case which I don't understand. The specification of 'optional' is: optional: logical. If 'TRUE', setting row names and converting column names (to syntactic names: see 'make.names') is optional. I am not sure what this means and why it is useful. In practice, it seems to produce a structure of class data.frame which exhibits some very odd behavior: d - as.data.frame(c(a),optional=TRUE) class(d) [1] data.frame d structure(a, class = AsIs)where does this column name come from? 1 a' gosh... rtfc, again; code as above, but this time optional=TRUE so names(value) = nm does not apply: x = factor('a') # the input names(x) = NULL value = list(x) # value == list(factor('a')) attr(value, 'row.names') = c(NA, -1) # the value of row.names class(value) = 'data.frame' value here you go. names(d) NULL not from names() yes, because it was explicitly set to NULL, second line above. dput(d) structure(list(structure(1L, .Label = a, class = factor)), row.names = c(NA, -1L), class = data.frame) and it doesn't show up in dput yes, because there are no names there! it's format.data.frame, called from print.data.frame, called from print(value), that makes up this column name; rtfc. seems like there's a need for post-implementation design. for the desserts, here's another curious, somewhat related example: data = data.frame(1) row.names(data) = TRUE data # X1 # TRUE 1 as.data.frame(1, row.names=TRUE) # Error in attr(value, row.names) - row.names : # row names must be 'character' or 'integer', not 'logical' probably not a bug, because ?as.data.frame says: row.names: 'NULL' or a character vector giving the row names for the data frame. Missing values are not allowed. so it's rather a design flaw. much harder to fix in r. best, vQ __ R-devel@r-project.org mailing list
Re: [Rd] duplicated.data.frame {was [R] which rows are duplicates?}
Martin Maechler wrote: and then be helpful to the R community and send a bug report *with* a patch if {as in this case} you are able to... Well, that' no longer needed here, I'll fix that easily myself. WK but i *have* sent a patch already! Ok, I believe you. But I think you did not mention that during this thread, ... and/or I must have overlooked your patch. In any case the problem is now solved [well, a better solution of course would add the not-yet functionality..]; thank you for the contribution. i attach the patch post for reference. note that you need to fix all of the functions in duplicated.R that share the buggy code. (yes, this was another thread; i submitted a bug report, and then sent a follow-up post with a patch). vQ ---BeginMessage--- the bug seems to have a trivial solution; as far as i can see, it suffices to replace if (!is.logical(incomparables) || incomparables) with if(!identical(incomparables, FALSE)) in all its occurrences in src/library/base/R/duplicated.R attached is a patch created, successfully tested and installed on Ubuntu 8.04 Linux 32 bit as follows: svn co https://svn.r-project.org/R/trunk trunk cd trunk # edit src/library/base/R/duplicated.R svn diff duplicated.R.diff svn revert -R src patch -p0 duplicated.R.diff tools/rsync-recommended ./configure make make check and now duplicated(data.frame(), incomparables=NA) # error: argument 'incomparables != FALSE' is not used (yet) regards, vQ waclaw.marcin.kusnierc...@idi.ntnu.no wrote: Full_Name: Wacek Kusnierczyk Version: 2.8.0 and 2.10.0 r48242 OS: Ubuntu 8.04 Linux 32 bit Submission from: (NULL) (129.241.110.161) In the following code: duplicated(data.frame(), incomparables=NA) # Error in if (!is.logical(incomparables) || incomparables) .NotYetUsed(incomparables != FALSE) : # missing value where TRUE/FALSE needed the raised error is clearly not the one intended to be raised. ?duplicated says: incomparables: a vector of values that cannot be compared. 'FALSE' is a special value, meaning that all values can be compared, and may be the only value accepted for methods other than the default. It will be coerced internally to the same type as 'x'. (...) Values in 'incomparables' will never be marked as duplicated. This is intended to be used for a fairly small set of values and will not be efficient for a very large set. However, in duplicated.data.frame (which is called when duplicated is applied to a data frame, as above) the parameter 'incomparables' is defunct. The documentation fails to explain this, and it might be a good idea to improve it. In the code for duplicated.data.frame there is an attempt to intercept any use of the parameter 'incomparables' with a value other than FALSE and to raise an appropriate error, but this attempt fails with, e.g., incomparables=NA. Incidentally, the attempt to intercept incomparables != FALSE fails completely (i.e., the call to duplicated succeeds) with certain inputs: duplicated(data.frame(logical=c(TRUE, TRUE)), incomparables=c(FALSE, TRUE)) # [1] FALSE TRUE while duplicated(c(TRUE, TRUE), incomparables=c(FALSE, TRUE)) # [1] FALSE FALSE Regards, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- --- Wacek Kusnierczyk, MD PhD Email: w...@idi.ntnu.no Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060 --- Index: src/library/base/R/duplicated.R === --- src/library/base/R/duplicated.R (revision 48242) +++ src/library/base/R/duplicated.R (working copy) @@ -25,7 +25,7 @@ duplicated.data.frame - function(x, incomparables = FALSE, fromLast = FALSE, ...) { -if(!is.logical(incomparables) || incomparables) +if (!identical(incomparables, FALSE)) .NotYetUsed(incomparables != FALSE) duplicated(do.call(paste, c(x, sep=\r)), fromLast = fromLast) } @@ -33,7 +33,7 @@ duplicated.matrix - duplicated.array - function(x, incomparables = FALSE , MARGIN = 1L, fromLast = FALSE, ...) { -if(!is.logical(incomparables
Re: [Rd] duplicated fails to rise correct errors (PR#13632)
the bug seems to have a trivial solution; as far as i can see, it suffices to replace if (!is.logical(incomparables) || incomparables) with if(!identical(incomparables, FALSE)) in all its occurrences in src/library/base/R/duplicated.R attached is a patch created, successfully tested and installed on Ubuntu 8.04 Linux 32 bit as follows: svn co https://svn.r-project.org/R/trunk trunk cd trunk # edit src/library/base/R/duplicated.R svn diff duplicated.R.diff svn revert -R src patch -p0 duplicated.R.diff tools/rsync-recommended ./configure make make check and now duplicated(data.frame(), incomparables=NA) # error: argument 'incomparables != FALSE' is not used (yet) regards, vQ waclaw.marcin.kusnierc...@idi.ntnu.no wrote: Full_Name: Wacek Kusnierczyk Version: 2.8.0 and 2.10.0 r48242 OS: Ubuntu 8.04 Linux 32 bit Submission from: (NULL) (129.241.110.161) In the following code: duplicated(data.frame(), incomparables=NA) # Error in if (!is.logical(incomparables) || incomparables) .NotYetUsed(incomparables != FALSE) : # missing value where TRUE/FALSE needed the raised error is clearly not the one intended to be raised. ?duplicated says: incomparables: a vector of values that cannot be compared. 'FALSE' is a special value, meaning that all values can be compared, and may be the only value accepted for methods other than the default. It will be coerced internally to the same type as 'x'. (...) Values in 'incomparables' will never be marked as duplicated. This is intended to be used for a fairly small set of values and will not be efficient for a very large set. However, in duplicated.data.frame (which is called when duplicated is applied to a data frame, as above) the parameter 'incomparables' is defunct. The documentation fails to explain this, and it might be a good idea to improve it. In the code for duplicated.data.frame there is an attempt to intercept any use of the parameter 'incomparables' with a value other than FALSE and to raise an appropriate error, but this attempt fails with, e.g., incomparables=NA. Incidentally, the attempt to intercept incomparables != FALSE fails completely (i.e., the call to duplicated succeeds) with certain inputs: duplicated(data.frame(logical=c(TRUE, TRUE)), incomparables=c(FALSE, TRUE)) # [1] FALSE TRUE while duplicated(c(TRUE, TRUE), incomparables=c(FALSE, TRUE)) # [1] FALSE FALSE Regards, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- --- Wacek Kusnierczyk, MD PhD Email: w...@idi.ntnu.no Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060 --- Index: src/library/base/R/duplicated.R === --- src/library/base/R/duplicated.R (revision 48242) +++ src/library/base/R/duplicated.R (working copy) @@ -25,7 +25,7 @@ duplicated.data.frame - function(x, incomparables = FALSE, fromLast = FALSE, ...) { -if(!is.logical(incomparables) || incomparables) +if (!identical(incomparables, FALSE)) .NotYetUsed(incomparables != FALSE) duplicated(do.call(paste, c(x, sep=\r)), fromLast = fromLast) } @@ -33,7 +33,7 @@ duplicated.matrix - duplicated.array - function(x, incomparables = FALSE , MARGIN = 1L, fromLast = FALSE, ...) { -if(!is.logical(incomparables) || incomparables) +if (!identical(incomparables, FALSE)) .NotYetUsed(incomparables != FALSE) ndim - length(dim(x)) if (length(MARGIN) ndim || any(MARGIN ndim)) @@ -67,7 +67,7 @@ unique.data.frame - function(x, incomparables = FALSE, fromLast = FALSE, ...) { -if(!is.logical(incomparables) || incomparables) +if (!identical(incomparables, FALSE)) .NotYetUsed(incomparables != FALSE) x[!duplicated(x, fromLast = fromLast), , drop = FALSE] } @@ -75,7 +75,7 @@ unique.matrix - unique.array - function(x, incomparables = FALSE , MARGIN = 1, fromLast = FALSE, ...) { -if(!is.logical(incomparables) || incomparables) +if (!identical(incomparables, FALSE)) .NotYetUsed(incomparables != FALSE) ndim - length(dim(x)) if (length(MARGIN) 1L || any(MARGIN ndim
Re: [Rd] [R] [.data.frame and lapply
Romain Francois wrote: Wacek Kusnierczyk wrote: redirected to r-devel, because there are implementational details of [.data.frame discussed here. spoiler: at the bottom there is a fairly interesting performance result. Romain Francois wrote: Hi, This is a bug I think. [.data.frame treats its arguments differently depending on the number of arguments. you might want to hesitate a bit before you say that something in r is a bug, if only because it drives certain people mad. r is a carefully tested software, and [.data.frame is such a basic function that if what you talk about were a bug, it wouldn't have persisted until now. I did hesitate, and would be prepared to look the other way of someone shows me proper evidence that this makes sense. d - data.frame( x = 1:10, y = 1:10, z = 1:10 ) d[ j=1 ] x y z 1 1 1 1 2 2 2 2 3 3 3 3 4 4 4 4 5 5 5 5 6 6 6 6 7 7 7 7 8 8 8 8 9 9 9 9 10 10 10 10 If a single index is supplied, it is interpreted as indexing the list of columns. Clearly this does not happen here, and this is because NextMethod gets confused. obviously. it seems that there is a bug here, and that it results from the lack of clear design specification. I have not looked your implementation in details, but it misses array indexing, as in: yes; i didn't take it into consideration, but (still without detailed analysis) i guess it should not be difficult to extend the code to handle this. d - data.frame( x = 1:10, y = 1:10, z = 1:10 ) m - cbind( 5:7, 1:3 ) m [,1] [,2] [1,]51 [2,]62 [3,]73 d[m] [1] 5 6 7 subdf( d, m ) Error in subdf(d, m) : undefined columns selected this should be easy to handle by checking if i is a matrix and then indexing by its first column as i and the second as j. Matrix indexing using '[' is not recommended, and barely supported. For extraction, 'x' is first coerced to a matrix. For replacement a logical matrix (only) can be used to select the elements to be replaced in the same way as for a matrix. yes, here's how it's done (original comment): if(is.matrix(i)) return(as.matrix(x)[i]) # desperate measures and i can easily add this to my code, at virtually no additional expense. it's probably not a good idea to convert x to a matrix, x would often be much more data than the index matrix m, so it's presumably much more efficient, on average, to fiddle with i instead. there are some potentially confusing issues here: m = cbind(8:10, 1:3) d[m] # 3-element vector, as you could expect d[t(m)] # 6-element vector t(m) has dimensionality inappropriate for matrix indexing (it has 3 columns), so it gets flattened into a vector; however, it does not work like in the case of a single vector index where columns would be selected: d[as.vector(t(m))] # error: undefined columns selected i think it would be more appropriate to raise an error in a case like d[t(m)]. furthermore, if a matrix is used in a two-index form, the matrix is flattened again and is used to select rows (not elements, as in d[t(m)]). note also that the help page says that for extraction, 'x' is first coerced to a matrix. it fails to explain that if *two* indices are used of which at least one is a matrix, no coercion is done. that is, the matrix is again flattened into a vector, but here [.data.frame forgets that it was a matrix (unlike in d[t(m)]): is(d[m]) # a character vector, matrix indexing is(d[t(m)]) # a character vector, vector indexing of elements, not columns is(d[m,]) # a data frame, row indexing and finally, the fact that d[m] in fact converts x (i.e., d) to a matrix before the indexing means that the types of values in a some columns in d may get coerced to another type: d[,2] = as.character(d[,2]) is(d[,1]) # integer vector is(d[,2]) # character vector is(d[1:2, 1]) # integer vector is(d[cbind(1:2, 1)]) # character vector for all it's worth, i think matrix indexing of data frames should be dropped: d[m] # error: ... and if one needs it, it's as simple as as.matrix(d)[m] where the conversion of d to a matrix is explicit. on the side, [.data.frame is able to index matrices: '[.data.frame'(as.matrix(d), m) # same as as.matrix(d)[m] which is, so to speak, nonsense, since '[.data.frame' is designed specifically to handle data frames; i'd expect an error to be raised here (or a warning, at the very least). to summarize, the fact that subdf does not handle matrix indices is not an issue. anyway, thanks for the comment! best, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] [.data.frame and lapply
redirected to r-devel, because there are implementational details of [.data.frame discussed here. spoiler: at the bottom there is a fairly interesting performance result. Romain Francois wrote: Hi, This is a bug I think. [.data.frame treats its arguments differently depending on the number of arguments. you might want to hesitate a bit before you say that something in r is a bug, if only because it drives certain people mad. r is a carefully tested software, and [.data.frame is such a basic function that if what you talk about were a bug, it wouldn't have persisted until now. treating the arguments differently depending on their number is actually (if clearly...) documented: if there is one index (the 'i'), it selects columns. if there are two, 'i' selects rows. however, not all seems fine, there might be a design flaw: # dummy data frame d = structure(names=paste('col', 1:3, sep='.'), data.frame(row.names=paste('row', 1:3, sep='.'), matrix(1:9, 3, 3))) d[1:2] # correctly selects two first columns # 1:2 passed to [.data.frame as i, no j given d[,1:2] # correctly selects two first columns # 1:2 passed to [.data.frame as j, i given the missing argument value (note the comma) d[,i=1:2] # correctly selects two first rows # 1:2 passed to [.data.frame as i, j given the missing argument value (note the comma) d[j=1:2,] # correctly selects two first columns # 1:2 passed to [.data.frame as j, i given the missing argument value (note the comma) d[i=1:2] # correctly (arguably) selects the first two columns # 1:2 passed to [.data.frame as i, no j given d[j=1:2] # wrong: returns the whole data frame # does not recognize the index as i because it is explicitly named 'j' # does not recognize the index as j because there is only one index i say this *might* be a design flaw because it's hard to judge what the design really is. the r language definition (!) [1, sec. 3.4.3 p. 18] says: The most important example of a class method for [ is that used for data frames. It is not be described in detail here (see the help page for [.data.frame, but in broad terms, if two indices are supplied (even if one is empty) it creates matrix-like indexing for a structure that is basically a list of vectors of the same length. If a single index is supplied, it is interpreted as indexing the list of columns—in that case the drop argument is ignored, with a warning. it does not say what happens when only one *named* index argument is given. from the above, it would indeed seem that there is a *bug* here: in the last example above only one index is given, and yet columns are not selected, even though the *language definition* says they should. (so it's not a documented feature, it's a contra-definitional misfeature -- a bug?) somewhat on the side, the 'matrix-like indexing' above is fairly misleading; just try the same patterns of indexing -- one index, two indices, named indices -- on a data frame and a matrix of the same shape: m = matrix(1:9, 3, 3) md = data.frame(m) md[1] # the first column m[1] # the first element (i.e., m[1,1]) md[,i=3] # third row m[,i=3] # third column the quote above refers to the ?'[.data.frame' for details. unfortunately, it the help page a lump of explanations for various '['-like operators, and it is *not* a definition of any sort. it does not provide much more detail on '[.data.frame' -- it is hardly as a design specification. in particular, it does not explain the issue of named arguments to '[.data.frame' at all. `[.data.frame` only is called with two arguments in the second case, so the following condition is true: if(Narg 3L) { # list-like indexing or matrix indexing And then, the function assumes the argument it has been passed is i, and eventually calls NextMethod([) which I think calls `[.listof`(x,i,...), since i is missing in `[.data.frame` it is not passed to `[.listof`, so you have something equivalent to as.list(d) []. I think we can replace the condition with this one: if(Narg 3L !has.j) { # list-like indexing or matrix indexing or this: if(Narg 3L) { # list-like indexing or matrix indexing if(has.j) i - j indeed, for a moment i thought a trivial fix somewhere there would suffice. unfortunately, the code for [.data.frame [2, lines 500-641] is so clean and readable that i had to give up reading it, forget fixing. instead, i wrote an new version of '[.data.frame' from scratch. it fixes (or at least seems to fix, as far as my quick assessment goes) the problem. the function subdf (see the attached dataframe.r) is the new version of '[.data.frame': # dummy data frame d = structure(names=paste('col', 1:3, sep='.'), data.frame(row.names=paste('row', 1:3, sep='.'), matrix(1:9, 3, 3))) d[j=1:2] # incorrect: the whole data frame subdf(d,
Re: [Rd] typo in sprintf format string segfaults R
Sklyar, Oleg (London) wrote: typo as simple as %S instead of %s segfaults R devel: not exactly: sprintf('%S', 'aa') # error: unrecognised format at end of string without a segfault. but with another format specifier behind, it will cause a segfault. interestingly, here's again the same problem i have reported recently: that you are given a number of options for how to leave the session, but you can type ^c and stay in a semi-working session. (and the next execution of the above will then cause a segfault with immediate exit.) vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] variance/mean
William Dunlap wrote: Doesn't Fortran still require that the arguments to a function not alias each other (in whole or in part)? what do you mean? the following works pretty fine: echo ' program foo implicit none integer, target :: a = 1 integer, pointer :: p1, p2, p3 integer :: gee p1 = a p2 = a p3 = a write(*,*) p1, p2, p3 call bar (p1, p2, p3) write(*,*) p1, p2, p3 a = gee(p1, p2, p3) write(*,*) p1, p2, p3 end program foo subroutine bar (p1, p2, p3) integer :: p1, p2, p3 p3 = p1 + p2 end subroutine bar function gee(p1, p2, p3) integer :: p1, p2, p3, gee p3 = p1 + p2 gee = p3 return end function gee ' foo.f95 gfortran foo.f95 -o foo ./foo # 1 1 1 # 2 2 2 # 4 4 4 clearly, p1, p2, and p3 are aliases of each other, and there is an assignment made in both the subroutine and the function. have i misunderstood what you said? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] incoherent treatment of NULL
somewhat related to a previous discussion [1] on how 'names-' would sometimes modify its argument in place, and sometimes produce a modified copy without changing the original, here's another example of how it becomes visible to the user when r makes or doesn't make a copy of an object: x = NULL dput(x) # NULL class(x) = 'integer' # error: invalid (NULL) left side of assignment x = c() dput(x) # NULL class(x) = 'integer' dput(x) # integer(0) in both cases, x ends up with the value NULL (the no-value object). in both cases, dput explains that x is NULL. in both cases, an attempt is made to make x be an empty integer vector. the first fails, because it tries to modify NULL itself, the latter apparently does not and succeeds. however, the following has a different pattern: x = NULL dput(x) # NULL names(x) = character(0) # error: attempt to set an attribute on NULL x = c() dput(x) # NULL names(x) = character(0) # error: attempt to set an attribute on NULL and also: x = c() class(x) = 'integer' # fine class(x) = 'foo' # error: attempt to set an attribute on NULL how come? the behaviour can obviously be explained by looking at the source code (hardly surprisingly, because it is as it is because the source is as it is), and referring to the NAMED property (i.e., the sxpinfo.named field of a SEXPREC struct). but can the *design* be justified? can the apparent incoherences visible above the interface be defended? why should the first example above be unable to produce an empty integer vector? why is it possible to set a class attribute, but not a names attribute, on c()? why is it possible to set the class attribute in c() to 'integer', but not to 'foo'? why are there different error messages for apparently the same problem? vQ [1] search the rd archives for 'surprising behaviour of names-' __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] incoherent treatment of NULL
Martin Maechler wrote: more verbously, all NULL objects in R are identical, or as the help page says, there's only ``*The* NULL Object'' in R, i.e., NULL cannot get any attributes. WK yes, but that's not the issue. the issue is that names(x)- seems to WK try to attach an attribute to NULL, while it could, in principle, do the WK same as class(x)-, i.e., coerce x to some type (and hence attach the WK name attribute not to NULL, but to the coerced-to object). yes, it could; but really, the fact that 'class-' works is the exception. The other variants (with the error message) are the rule. ok. Also, note (here and further below), that Using class(.) - className is an S3 idiom and S3 classes ``don't really exist'', the class attribute being a useful hack, and many of us would rather like to work and improve working with S4 classes ( generics methods) than to fiddle with 'class-'. In S4, you'd use setClass(.), new(.) and setAs(.), typically, for defining and changing classes of objects. But maybe I have now lead you into a direction I will later regret, when you start telling us about the perceived inconsistencies of S4 classes, methods, etc. BTW: If you go there, please do use R 2.9.0 (or newer) using latest r-devel for the most part. i think you will probably not regret your words; from what i've seen already, s4 classes are the last thing i'd ever try to learn in r. but yes, there would certainly be lots of issues to complain about. i'll rather wait for s5. regards, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] dput(as.list(function...)...) bug
Stavros Macrakis wrote: Tested in R 2.8.1 Windows ff - formals(function(x)1) ff1 - as.list(function(x)1)[1] # ff1 acts the same as ff in the examples below, but is a list rather than a pairlist dput( ff , control=c(warnIncomplete)) list(x = ) This string is not parsable, but dput does not give a warning as specified. same in 2.10.0 r48200, ubuntu 8.04 linux 32 bit dput( ff , control=c(all,warnIncomplete)) list(x = quote()) likewise. This string is parseable, but quote() is not evaluable, and again dput does not give a warning as specified. In fact, I don't know how to write out ff$x. It appears to be the zero-length name: is.name(ff$x) = TRUE as.character(ff$x) = but there is no obvious way to create such an object: as.name() = execution error quote(``) = parse error The above examples should either produce a parseable and evaluable output (preferable), or give a warning. interestingly, quote(NULL) # NULL as.name(NULL) # Error in as.name(NULL) : # invalid type/length (symbol/0) in vector allocation æsj. vQ -s PS As a matter of comparative linguistics, many versions of Lisp allow zero-length symbols/names. But R coerces strings to symbols/names in a way that Lisp does not, so that might be an invitation to obscure bugs in R where it is rarely problematic in Lisp. PPS dput(pairlist(23),control=all) also gives the same output as dput(list(23),control=all), but as I understand it, pairlists will become non-user-visible at some point. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel -- --- Wacek Kusnierczyk, MD PhD Email: w...@idi.ntnu.no Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] variance/mean
(this post suggests a patch to the sources, so i allow myself to divert it to r-devel) Bert Gunter wrote: x a numeric vector, matrix or data frame. y NULL (default) or a vector, matrix or data frame with compatible dimensions to x. The default is equivalent to y = x (but more efficient). bert points to an interesting fragment of ?var: it suggests that computing var(x) is more efficient than computing var(x,x), for any x valid as input to var. indeed: set.seed(0) x = matrix(rnorm(1), 100, 100) library(rbenchmark) benchmark(replications=1000, columns=c('test', 'elapsed'), var(x), var(x, x)) #test elapsed # 1var(x) 1.091 # 2 var(x, x) 2.051 that's of course, so to speak, unreasonable: for what var(x) does is actually computing the covariance of x and x, which should be the same as var(x,x). the hack is that if y is given, there's an overhead of memory allocation for *both* x and y when y is given, as seen in src/main/cov.c:720+. incidentally, it seems that the problem can be solved with a trivial fix (see the attached patch), so that set.seed(0) x = matrix(rnorm(1), 100, 100) library(rbenchmark) benchmark(replications=1000, columns=c('test', 'elapsed'), var(x), var(x, x)) #test elapsed # 1var(x) 1.121 # 2 var(x, x) 1.107 with the quick checks all.equal(var(x), var(x, x)) # TRUE all(var(x) == var(x, x)) # TRUE and for cor it seems to make cor(x,x) slightly faster than cor(x), while originally it was twice slower: # original benchmark(replications=1000, columns=c('test', 'elapsed'), cor(x), cor(x, x)) #test elapsed # 1cor(x) 1.196 # 2 cor(x, x) 2.253 # patched benchmark(replications=1000, columns=c('test', 'elapsed'), cor(x), cor(x, x)) #test elapsed # 1cor(x) 1.207 # 2 cor(x, x) 1.204 (there is a visible penalty due to an additional pointer test, but it's 10ms on 1000 replications with 1 data points, which i think is negligible.) This is as clear as I would know how to state. i believe bert is right. however, with the above fix, this can now be rewritten as: x: a numeric vector, matrix or data frame. y: a vector, matrix or data frame with dimensions compatible to those of x. By default, y = x. which, to my simple mind, is even more clear than what bert would know how to state, and less likely to cause the sort of confusion that originated this thread. the attached patch suggests modifications to src/main/cov.c and src/library/stats/man/cor.Rd. it has been prepared and checked as follows: svn co https://svn.r-project.org/R/trunk trunk cd trunk # edited the sources svn diff cov.diff svn revert -R src patch -p0 cov.diff tools/rsync-recommended ./configure make make check bin/R # subsequent testing within R if you happen to consider this patch for a commit, please be sure to examine and test it carefully first. vQ Index: src/library/stats/man/cor.Rd === --- src/library/stats/man/cor.Rd (revision 48200) +++ src/library/stats/man/cor.Rd (working copy) @@ -6,9 +6,9 @@ \name{cor} \title{Correlation, Variance and Covariance (Matrices)} \usage{ -var(x, y = NULL, na.rm = FALSE, use) +var(x, y = x, na.rm = FALSE, use) -cov(x, y = NULL, use = everything, +cov(x, y = x, use = everything, method = c(pearson, kendall, spearman)) cor(x, y = NULL, use = everything, @@ -32,9 +32,7 @@ } \arguments{ \item{x}{a numeric vector, matrix or data frame.} - \item{y}{\code{NULL} (default) or a vector, matrix or data frame with -compatible dimensions to \code{x}. The default is equivalent to -\code{y = x} (but more efficient).} + \item{y}{a vector, matrix or data frame with dimensions compatible to those of \code{x}. By default, y = x.} \item{na.rm}{logical. Should missing values be removed?} \item{use}{an optional character string giving a method for computing covariances in the presence Index: src/main/cov.c === --- src/main/cov.c (revision 48200) +++ src/main/cov.c (working copy) @@ -689,7 +689,7 @@ if (ansmat) PROTECT(ans = allocMatrix(REALSXP, ncx, ncy)); else PROTECT(ans = allocVector(REALSXP, ncx * ncy)); sd_0 = FALSE; -if (isNull(y)) { +if (isNull(y) || (DATAPTR(x) == DATAPTR(y))) { if (everything) { /* NA's are propagated */ PROTECT(xm = allocVector(REALSXP, ncx)); PROTECT(ind = allocVector(LGLSXP, ncx)); __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] gsub('(.).(.)(.)', '\\3\\2\\1', 'gsub')
there seems to be something wrong with r's regexing. consider the following example: gregexpr('a*|b', 'ab') # positions: 1 2 # lengths: 1 1 gsub('a*|b', '.', 'ab') # .. where the pattern matches any number of 'a's or one b, and replaces the match with a dot, globally. the answer is correct (assuming a dfa engine). however, gregexpr('a*|b', 'ab', perl=TRUE) # positions: 1 2 # lengths: 1 0 gsub('a*|b', '.', 'ab', perl=TRUE) # .b. where the pattern is identical, but the result is wrong. perl uses an nfa (if it used a dfa, the result would still be wrong), and in the above example it should find *four* matches, collectively including *all* letters in the input, thus producing *four* dots (and *only* dots) in the output: perl -le ' $input = qq|ab|; print qq|match: $_| foreach $input =~ /a*|b/g; $input =~ s/a*|b/./g; print qq|output: $input|;' # match: a # match: # match: b # match: # output: since with perl=TRUE both gregexpr and gsub seem to use pcre, i've checked the example with pcretest, and also with a trivial c program (available on demand) using the pcre api; there were four matches, exactly as in the perl bit above. the results above are surprising, and suggest a bug in r's use of pcre rather than in pcre itself. possibly, the issue is that when an empty sting is matched (with a*, for example), the next attempt is not trying to match a non-empty string at the same position, but rather an empty string again at the next position. for example, gsub('a|b|c', '.', 'abc', perl=TRUE) # ..., correct gsub('a*|b|c', '.', 'abc', perl=TRUE) # .b.c., wrong gsub('a|b*|c', '.', 'abc', perl=TRUE) # ..c., wrong (but now only 'c' remains) gsub('a|b*|c', '.', 'aba', perl=TRUE) # ..., incidentally correct without detailed analysis of the code, i guess the bug is located somewhere in src/main/pcre.c, and is distributed among the do_p* functions, so that multiple fixes may be needed. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] sprintf causes a segfault (PR#13613)
strangely enough, the way r handles the same sequence of expressions on different occasions varies: # fresh session 1 e = simpleError('foo') sprintf('%s', e) # segfault: address 0x202, cause memory not mapped # ^c sprintf('%s', e) # error in sprintf(%s, e) : 'getEncChar' must be called on a CHARSXP # fresh session 2 e = simpleError('foo') sprintf('%s', e) # segfault: address (nil), cause memory not mapped # ^c sprintf('%s', e) # segfault, exit note the difference in the address and how this relates to the outcome of the second execution of sprintf('%s', e) vQ waclaw.marcin.kusnierc...@idi.ntnu.no wrote: the following code illustrates a problem with sprintf which consistently causes a segfault when applied to certain type of arguments. it also shows inconsistent consequences of the segfault: (e = tryCatch(stop(), error=identity)) # e is an error object sprintf('%d', e) # error in sprintf(%d, e) : unsupported type sprintf('%f', e) # error in sprintf(%f, e) : (list) object cannot be coerced to type 'double' sprintf('%s', e) # segfault reported, with a choice of options for how to exit the session it is possible not to leave the session, by simply typing ^c (ctrl-c). (which should probably be prohibited.) if one stays in the session, then trying to evaluate sprintf('%s', e) will cause a segfault with immediate crash (r is silently closed), but not necessarily if some other code is executed first. in the latter case, there may be no segfault, but an error message might be printed instead: e = tryCatch(stop(), error=identity) sprintf('%s', e) # segfault, choice of options # ^c, stay in the session e = tryCatch(stop(), error=identity) sprintf('%s', e) # segfault, immediate exit e = tryCatch(stop(), error=identity) sprintf('%s', e) # segfault, choice of options # ^c, stay in the session e = tryCatch(stop(), error=identity) x = 1 # possibly, whatever code would do sprintf('%s', e) # [1] Error in doTryCatch(return(expr), name, parentenv, handler): \n # [2] Error in doTryCatch(return(expr), name, parentenv, handler): \n sprintf('%s', e) # segfault, immediate exit in the second code snippet above, on some executions the error message was printed. on others a segfault caused immediate exit. (the pattern seems to differ between 2.8.0 and 2.10.0-devel.) __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] [R] incoherent conversions from/to raw
Wacek Kusnierczyk wrote: interestingly, c(1, as.raw(1)) # error: type 'raw' is unimplemented in 'RealAnswer' three more comments. (1) the above is interesting in the light of what ?c says: The output type is determined from the highest type of the components in the hierarchy NULL raw logical integer real complex character list expression. which seems to suggest that raw components should be coerced to whatever the highest type among all arguments to c, which clearly doesn't happen: test = function(type) c(as.raw(1), get(sprintf('as.%s',type))(1)) for (type in c('null', 'logical', 'integer', 'real', 'complex', 'character', 'list', 'expression')) tryCatch(test(type), error = function(e) cat(sprintf(raw won't coerce to %s type\n, type))) which shows that raw won't coerce to the four first types in the 'hierarchy' (excluding NULL), but it will to character, list, and expression. suggestion: improve the documentation, or adapt the implementation to a more coherent design. (2) incidentally, there's a bug somewhere there related to the condition system and printing: tryCatch(stop(), error=function(e) print(e)) # works just fine tryCatch(stop(), error=function(e) sprintf('%s', e)) # *** caught segfault *** # address (nil), cause 'memory not mapped' # Traceback: # 1: sprintf(%s, e) # 2: value[[3]](cond) # 3: tryCatchOne(expr, names, parentenv, handlers[[1]]) # 4: tryCatchList(expr, classes, parentenv, handlers) # 5: tryCatch(stop(), error = function(e) sprintf(%s, e)) # Possible actions: # 1: abort (with core dump, if enabled) # 2: normal R exit # 3: exit R without saving workspace # 4: exit R saving workspace # Selection: interestingly, it is possible to stay in the session by typing ^C. the session seems to work, but if the tryCatch above is tried once again, a segfault causes r to crash immediately: # ^C tryCatch(stop(), error=function(e) sprintf('%s', e)) # [whoe...@wherever] $ however, this doesn't happen if some other code is evaluated first: # ^C x = 1:10^8 tryCatch(stop(), error=function(e) sprintf('%s', e)) # Error in sprintf(%s, e) : 'getEncChar' must be called on a CHARSXP this can't be a feature. (tried in both 2.8.0 and r-devel; version info at the bottom.) suggestion: trace down and fix the bug. (3) the error argument to tryCatch is used in two examples in ?tryCatch, but it is not explained anywhere in the help page. one can guess that the argument name corresponds to the class of conditions the handler will handle, but it would be helpful to have this stated explicitly. the help page simply says: If a condition is signaled while evaluating 'expr' then established handlers are checked, starting with the most recently established ones, for one matching the class of the condition. When several handlers are supplied in a single 'tryCatch' then the first one is considered more recent than the second. which is uninformative in this respect -- what does 'one matching the class' mean? suggestion: improve the documentation. vQ version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status major 2 minor 8.0 year 2008 month 10 day20 svn rev46754 language R version.string R version 2.8.0 (2008-10-20) version _ platform i686-pc-linux-gnu arch i686 os linux-gnu system i686, linux-gnu status Under development (unstable) major 2 minor 9.0 year 2009 month 03 day 19 svn rev 48152 language R version.string R version 2.9.0 Under development (unstable) (2009-03-19 r48152) __ R-devel@r
Re: [Rd] Match .3 in a sequence
Petr Savicky wrote: On Mon, Mar 16, 2009 at 07:39:23PM -0400, Stavros Macrakis wrote: ... Let's look at the extraordinarily poor behavior I was mentioning. Consider: nums - (.3 + 2e-16 * c(-2,-1,1,2)); nums [1] 0.3 0.3 0.3 0.3 Though they all print as .3 with the default precision (which is normal and expected), they are all different from .3: nums - .3 = -3.885781e-16 -2.220446e-16 2.220446e-16 3.885781e-16 When we convert nums to a factor, we get: fact - as.factor(nums); fact [1] 0.300 0.3 0.3 0.300 Levels: 0.300 0.3 0.3 0.300 Not clear what the difference between 0.300 and 0.3 is supposed to be, nor why some 0.300 are .3 and others are ... When creating a factor from numeric vector, the list of levels and the assignment of original elements to the levels is done using double precision. Since the four elements in the vector are distinct, we get four distinct levels. After this is done, the levels attribute is formed using as.character(). This can map different numbers to the same string, so in the example above, this leads to a factor, which contains repeated levels. This part of the problem may be avoided using fact - as.factor(as.character(nums)); fact [1] 0.300 0.3 0.3 0.300 Levels: 0.3 0.300 The reason for having 0.300 and 0.3 is that as.character() works the same as printing with digits=15. The R printing mechanism works in two steps. In the first step it tries to determine the shortest format needed to achieve the required relative precision of the output. This step uses an algorithm, which need not provide an accurate result. The next step is that the number is printed using C function sprintf with the chosen format. This step is accurate, so we cannot get wrong digits. We only can get wrong number of digits. In order to avoid using 15 digits in as.character(), we can use round(,digits), with digits argument appropriate for the current situation. fact - as.factor(round(nums,digits=1)); fact [1] 0.3 0.3 0.3 0.3 Levels: 0.3 with the examples above, it looks like a design flaw that factor levels and their *labels* are messed up into one clump. if, in the above, levels were the numbers, and their labels were produced with as.character, as you show, but kept separately (or generated on the fly, when displaying the factor), the problem would have been solved. you would then have something like: nums - (.3 + 2e-16 * c(-2,-1,1,2)); nums # [1] 0.3 0.3 0.3 0.3 sum(nums[rep(1:4, each=4)] == nums[rep(1:4, 4)]) # 4 fact - as.factor(nums); fact # [1] 0.300 0.3 0.3 0.300 # Levels: 0.300 0.3 0.3 0.300 sum(fact[rep(1:4, each=4)] == fact[rep(1:4, 4)]) # 4 (currently, it's 8) there's one more curiosity about factors, in particular, ordered factors: ord - as.ordered(nums); ord # [1] 0.300 0.3 0.3 0.300 # Levels: 0.300 0.3 0.3 0.300 ord[1] ord[4] # TRUE ord[1] == ord[4] # TRUE vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Match .3 in a sequence
Wacek Kusnierczyk wrote: there's one more curiosity about factors, in particular, ordered factors: ord - as.ordered(nums); ord # [1] 0.300 0.3 0.3 0.300 # Levels: 0.300 0.3 0.3 0.300 ord[1] ord[4] # TRUE ord[1] == ord[4] # TRUE as a corollary, the warning printed when comparing elements of a factor is misleading: f = factor(1:2) f[1] f[2] # [1] NA # Warning message: # In Ops.factor(f[1], f[2]) : not meaningful for factors g = as.ordered(f) is.factor(g) # TRUE g[1] g[2] # TRUE *is* meaningful for factors, though not for unordered ones. the warning is generated in Ops.factor, src/library/base/all.R:7162, and with my limited knowledge of the r internals i can't judge how easy it is to fix the problem. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: '*tmp*' = 0 `*tmp*` # 0 x = 1 names(x) = 'foo' `*tmp*` # error: object *tmp* not found `*ugly*` I agree, and I am a bit flabbergasted. I had not expected that something like this would happen and I am indeed not aware of anything in the documentation that warns about this; but others may prove me wrong on this. hopefully. given that `*tmp*`is a perfectly legal (though some would say 'non-standard') name, it would be good if somewhere here a warning were issued -- perhaps where i assign to `*tmp*`, because `*tmp*` is not just any non-standard name, but one that is 'obviously' used under the hood to perform black magic. Now I wonder whether there are any other objects (with non-standard) names) that can be nuked by operations performed under the hood. any such risk should be clearly documented, if not with a warning issued each time the user risks h{is,er} workspace corrupted by the under-the-hood. I guess the best thing is to stay away from non-standard names, if only to save the typing of back-ticks. :) agree. but then, there may be -- and probably are -- other such 'best to stay away' things in r, all of which should be documented so that a user know what may happen on the surface, *without* having to peek under the hood. Thanks for letting me know, I have learned something new today. wow. most of my fiercely truculent ranting is meant to point out things that may not be intentional, or if they are, they seem to me design flaws rather than features -- so that either i learn that i am ignorant or wrong, or someone else does, pro bono. hopefully. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Definition of [[
somewhat one the side, l = list(1) l[[2]] # error, index out of bounds l[2][[1]] # NULL that is, we can't extract from l any element at an index exceeding the list's length (if we could, it would have been NULL or some sort of _NA_list), but we can extract a sublist at an index out of bounds, and from that sublist extract the element (which is NULL, 'the _NA_list'). that's not necessarily wrong, but the item at index i (l[[i]]) is not equivalent to the item in the sublist at index i. vQ Thomas Lumley wrote: On Sun, 15 Mar 2009, Stavros Macrakis wrote: The semantics of [ and [[ don't seem to be fully specified in the Reference manual. In particular, I can't find where the following cases are covered: cc - c(1); ll - list(1) cc[3] [1] NA OK, RefMan says: If i is positive and exceeds length(x) then the corresponding selection is NA. dput(ll[3]) list(NULL) ? i is positive and exceeds length(x); why isn't this list(NA)? I think some of these are because there are only NAs for character, logical, and the numeric types. There isn't an NA of list type. This one shouldn't be list(NA) - which NA would it use? It should be some sort of list(_NA_list_) type, and list(NULL) is playing that role. ll[[3]] Error in list(1)[[3]] : subscript out of bounds ? Why does this return NA for an atomic vector, but give an error for a generic vector? Again, because there isn't an NA of generic vector type. cc[[3]] - 34; dput(cc) c(1, NA, 34) OK ll[[3]] - 34; dput(ll) list(1, NULL, 34) Why is second element NULL, not NA? And why is it OK to set an undefined ll[[3]], but not to get it? Same reason for NULL vs NA. The fact that setting works may just be an inconsistency -- as you can see from previous discussions, R often does not effectively forbid code that shouldn't work -- or it may be bug-compatibility with some version of S or S-PLUS. __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Thomas Lumley wrote: Wacek, In this case I think the *tmp* dates from the days before backticks, when it was not a legal name (it still isn't) and it was much, much harder to use illegal names, so the collision issue really didn't exist. thanks for the explanation. You're right about the documentation. thanks for the acknowledgement. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Match .3 in a sequence
Duncan Murdoch wrote: On 3/16/2009 9:36 AM, Daniel Murphy wrote: Hello:I am trying to match the value 0.3 in the sequence seq(.2,.3). I get 0.3 %in% seq(from=.2,to=.3) [1] FALSE Yet 0.3 %in% c(.2,.3) [1] TRUE For arbitrary sequences, this invisible .3 has been problematic. What is the best way to work around this? Don't assume that computations on floating point values are exact. Generally computations on small integers *are* exact, so you could change that to 3 %in% seq(from=2, to=3) and get the expected result. You can divide by 10 just before you use the number, or if you're starting with one decimal place, multiply by 10 *and round to an integer* before doing the test. Alternatively, use some approximate test rather than an exact one, e.g. all.equal() (but you'll need a bit of work to make use of all.equal() in an expression like 0.3 %in% c(.2,.3)). there's also the problem that seq(from=0.2, to=0.3) does *not* include 0.3 (in whatever internal form), simply because the default step is 1. however, 0.3 %in% seq(from=.2,to=.3, by=0.1) # FALSE so it won't help anyway. (but in general be careful about using seq and the like.) vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Match .3 in a sequence
Petr Savicky wrote: On Mon, Mar 16, 2009 at 06:36:53AM -0700, Daniel Murphy wrote: Hello:I am trying to match the value 0.3 in the sequence seq(.2,.3). I get 0.3 %in% seq(from=.2,to=.3) [1] FALSE As others already pointed out, you should use seq(from=0.2,to=0.3,by=0.1) to get 0.3 in the sequence. In order to get correct %in%, it is also possible to use round(), for example 0.3 %in% round(seq(from=0.2,to=0.3,by=0.1),digits=1) [1] TRUE half-jokingly, there's another solution, which avoids rounding: 0.3 %in% (seq(0.4, 0.5, 0.1)-0.2) # TRUE vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: Obviously, assuming that R really executes *tmp* - x x - names-('*tmp*', value=c(a,b)) under the hood, in the C code, then *tmp* does not end up in the symbol table and does not persist beyond the execution of names(x) - c(a,b) to prove that i take you seriously, i have peeked into the code, and found that indeed there is a temporary binding for *tmp* made behind the scenes -- sort of. unfortunately, it is not done carefully enough to avoid possible interference with the user's code: '*tmp*' = 0 `*tmp*` # 0 x = 1 names(x) = 'foo' `*tmp*` # error: object *tmp* not found `*ugly*` given that `*tmp*`is a perfectly legal (though some would say 'non-standard') name, it would be good if somewhere here a warning were issued -- perhaps where i assign to `*tmp*`, because `*tmp*` is not just any non-standard name, but one that is 'obviously' used under the hood to perform black magic. it also appears that the explanation given in, e.g., the r language definition (draft, of course) sec. 3.4.4: Assignment to subsets of a structure is a special case of a general mechanism for complex assignment: x[3:5] - 13:15 The result of this commands is as if the following had been executed ‘*tmp*‘ - x x - [-(‘*tmp*‘, 3:5, value=13:15) is incomplete (because the final result is not '*tmp*' having the value of x, as it might seem, but rather '*tmp*' having been unbound). so the suggestion for the documenters is to add to the end of the section (or wherever else it is appropriate) a warning to the effect that in the end '*tmp*' will be removed, even if the user has explicitly defined it earlier in the same scope. or maybe have the implementation not rely on a user-forgeable name? for example, the '.Last.value' name is automatically bound to the most recently returned value, but it resides in package:base and does not collide with bindings using it made by the user: .Last.value = 0 1 .Last.value # 0, not 1 1 base::.Last.value # 1, not 0 why could not '*tmp*' be bound and unbound outside of the user's namespace? (i guess it's easier to update the docs -- or just ignore the issue.) on the margin, traceback('-') will pick only one of the uses of '-' suggested by the code above: x - 1:10 trace('-') x[3:5] - 13:15 # trace: x[3:5] - 13:15 # trace: x - `[-`(`*tmp*`, 3:5, value = 13:15) which is somewhat confusing, because then '*tmp*' appears in the trace somewhat ex machina. (again, the explanation is in the source code, but the traceback could have been more informative.) cheers, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Definition of [[
Stavros Macrakis wrote: Well, that's one issue. But another is that there should be a specification addressed to users, who should not have to understand internals. this should really be taken seriously. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: On Sat, 14 Mar 2009 07:22:34 +0100 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: [...] Well, I don't see any new object created in my workspace after x - 4 names(x) - foo Do you? of course not. that's why i'd say the two above are *not* equivalent. i haven't noticed the 'in the c code'; do you mean the r interpreter actually generates, in the c code, such r expressions for itself to evaluate? As I said before, I have little knowledge about how the parser works and what goes on under the hood; and I have also little time and inclination to learn about it. But if you are interested in these details, then by all means invest the time to investigate. berwin, you're playing radio erewan now. i talk about what the user sees at the interface, and you talk about c code. then you admit you don't know the code, and suggest i examine it if i'm interested. i incidentally am, but the whole point was that the user should not be forced to look under the hood to know the interface to a function. prefix 'names-' seems to have a certain behaviour that is not properly documented. Alternatively, you would hope that Simon eventually finishes the book that he is writing on programming in R; as I understand it, that book would explain part of these issues in details. Hopefully, along with the book he makes the tools that he has for introspection available. simon: i'd be happy to contribute in any way you might find useful. i guess you have looked under the hood; point me to the relevant code. No I did not, because I am not interested in knowing such intimate details of R, but it seems you were interested. yes, but then your claim about what happens under the hood, in the c code, is a pure stipulation. I made no claim about what is going on under the hood because I have no knowledge about these matters. But, yes, I was speculating of what might go on. owe me a beer. and you got the example from the r language definition sec. 10.2, which says the forms are equivalent, with no 'under the hood, in the c code' comment. Trying to figure out what a writer/painter actually means/says beyond the explicitly stated/painted, something that is summed up in Australia (and other places) under the term critical thinking, was not high in the curriculum of your school, was it? :-) sure, but probably not the way you seem to think about. have you incidentally read ferdydurke by gombrowicz? you're just showing that your statements cannot be taken seriously. Usually, my statement can be taken seriously, unless followed by some indication that I said them tongue-in-cheek. Of course, statements that I allegedly made but were in fact put into my mouth cannot, and should not, be taken seriously. i'm talking about your speculations about what the parser does (wrt. infix and prefix forms having exactly the same parse tree), rather vague statements such as 'names-'(x,'foo') should create (more or less) a parse tree equivalent to that expression, and other statements (surely, qualified with 'assuming', 'strongly suggests', and the like), coupled with your admitting that you in fact donæt know what happens there, is not particularly reassuring. yes, *if* you are able to predict the refcount of the object passed to 'names-' *then* you can predict what 'names-' will do, [...] I think Simon pointed already out that you seem to have a wrong picture of what is going on. [...] so what you quote effectively talks about a specific refcount mechanism. it's not refcount that would be used by the garbage collector, but it's a refcount, or maybe refflag. Fair enough, if you call this a refcount then there is no problem. Whenever I came across the term refcount in my readings, it was referring to different mechanisms, typically mechanisms that kept exact track on how often an object was referred too. So I would not call the value of the named field a refcount. And we can agree to call it from now on a refcount as long as we realise what mechanism is really used. the major point of the discussion was that 'names-' will sometimes modify and othertimes copy its argument. you chose to justify this by looking under the hood, and i suppose you were pretty clear what i meant by refcount, because it should have been clear from the context. yes, that's my opinion: the effects of implementation tricks should not be observable by the user, because they can lead to hard to explain and debug behaviour in the user's program. you surely don't suggest that all users consult the source code before writing programs in r. Indeed, I am not suggesting this. Only users who use/rely on features that are not sufficiently documented would have to study the source code to find out what the exact
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: foo = function(arg) arg$foo = foo e = new.env() foo(e) e$foo are you sure this is pass by value? But that is what environments are for, aren't they? might be. And it is documented behaviour. sure! Read section 2.1.10 (Environments) in the R Language Definition, haven't objected to that. i object to your 'r uses pass by value', which is only partially correct. in particular the last paragraph: Unlike most other R objects, environments are not copied when passed to functions or used in assignments. Thus, if you assign the same environment to several symbols and change one, the others will change too. In particular, assigning attributes to an environment can lead to surprises. [..] and actually, in the example we discuss, 'names-' does *not* return an updated *tmp*, so there's even less to entertain. How do you know? Are you sure? Have you by now studied what goes on under the hood? yes, a bit. but in this example, it's enough to look into *tmp* to see that it hasn't got the names added, and since x does have names, names- must have returned a copy of *tmp* rather than *tmp* changed: x = 1 tmp = x x = 'names-'(tmp, 'foo') names(tmp) # NULL you suggested that One reads the manual, (...) one reflects and investigates, ... -- had you done it, you wouldn't have asked the question. for fun and more guesswork, the example could have been: x = x x = 'names-'(x, value=c('a', 'b')) But it is manifestly not written that way in the manual; and for good reasons since 'names-' might have side effects which invokes in the last line undefined behaviour. Just as in the equivalent C snippet that I mentioned. i just can't get it why the manual does not manifestly explain what 'names-' does, and leaves you doing the guesswork you suggest. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: sure! Glad to see that we agree on this. owe you a beer. Read section 2.1.10 (Environments) in the R Language Definition, haven't objected to that. i object to your 'r uses pass by value', which is only partially correct. Well, I used qualifiers and did not stated it categorically. indeed, you said R supposedly uses call-by-value (though we know how to circumvent that, don't we?). in that vain, R supposedly can be used to do valid statistical computations (though we know how to circumvent it) ;) and actually, in the example we discuss, 'names-' does *not* return an updated *tmp*, so there's even less to entertain. How do you know? Are you sure? Have you by now studied what goes on under the hood? yes, a bit. but in this example, it's enough to look into *tmp* to see that it hasn't got the names added, and since x does have names, names- must have returned a copy of *tmp* rather than *tmp* changed: x = 1 tmp = x x = 'names-'(tmp, 'foo') names(tmp) # NULL Indeed, if you type these two commands on the command line, then it is not surprising that a copy of tmp is returned since you create a temporary object that ends up in the symbol table and persist after the commands are finished. what does command line have to do with it? Obviously, assuming that R really executes *tmp* - x x - names-('*tmp*', value=c(a,b)) under the hood, in the C code, then *tmp* does not end up in the symbol table no? and does not persist beyond the execution of names(x) - c(a,b) no? i guess you have looked under the hood; point me to the relevant code. This looks to me as one of the situations where a value of 1 is used for the named field of some of the objects involves so that a copy can be avoided. That's why I asked whether you looked under the hood. anyway, what happens under the hood is much less interesting from the user's perspective that what can be seen over the hood. what i can see, is that 'names-' will incoherently perform in-place modification or copy-on-assignment. yes, *if* you are able to predict the refcount of the object passed to 'names-' *then* you can predict what 'names-' will do, but in general you may not have the chance. and in general, this should not matter because it should be unobservable, but it isn't. back to your i += i++ example, the outcome may differ from a compiler to a compiler, but, i guess, compilers will implement the order coherently, so that whatever version they choose, the outcome will be predictable, and not dependent on some earlier code. (prove me wrong. or maybe i'll do it myself.) you suggested that One reads the manual, (...) one reflects and investigates, ... Indeed, and I am not giving up hope that one day you will master this art. well, this time i meant you. -- had you done it, you wouldn't have asked the question. Sorry, I forgot that you have a tendency to interpret statements extremely verbatim yes, i have two hooks installed: one says \begin{verbatim}, the other says \end{verbatim}. and with little reference to the context in which they are made. not that you're trying to be extremely accurate or polite here... I will try to be more explicit in future. it will certainly do good to you. i just can't get it why the manual does not manifestly explain what 'names-' does, and leaves you doing the guesswork you suggest. As I said before, patched to documentation are also welcome. i'll give it a try. Best wishes, hope you mean it. likewise, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
William Dunlap wrote: Would it make anyone any happier if the manual said that the replacement functions should not be called in the form xNew - `func-` (xOld, value) and should only be used as func(xToBeChanged) - value surely better than guesswork. ? The explanation names(x) - c(a,b) is equivalent to '*tmp*' - x x - names-('*tmp*', value=c(a,b)) could also be extended a bit, adding a line like rm(`*tmp*`) Those 3 lines should be considered an atomic operation: the value that `*tmp*` or `x` may have or what is in the symbol table at various points in that sequence is not defined. (Letting details be explicitly undefined is important: it gives developers room to improve the efficiency of the interpreter and tells users where not to go.) there is a difference between letting things be undefined and explicitly stating that things are unspecified. the c99 standard [1], for example, is explicit about the non-determinism of expressions that involve side effects, as it is about that some expressions may actually not be evaluated if the optimizer decides so. berwin has already suggested that one reads from what docs do *not* say; it's a very bad idea. it's best that the documentation *does* say that, for example, a particular function should be used only in the infix form because the semantics of the prefix form are not guaranteed and may change in future versions. if the current state is that 'names-' will modify the object it is given as an argument in some situations, but not in others, and this is visible to the user, the best thing to do is to give an explicit warning -- perhaps with an annotation that things may change, if they may. best, vQ [1] http://www.open-std.org/JTC1/SC22/WG14/www/docs/n1256.pdf __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Tony Plate wrote: Wacek Kusnierczyk wrote: [snip] i just can't get it why the manual does not manifestly explain what 'names-' does, and leaves you doing the guesswork you suggest. I'm having trouble understanding the point of this discussion. Someone is calling a replacement function in a way that it's not meant to be used, and is them complaining about it not doing what he thinks it should, or about the documentation not describing what happens when one does that? where is it written that the function is not meant to be used this way? you get an example in the man page, showing precisely how it could be used that way. it also explains the value of 'names-': For 'names-', the updated object. (Note that the value of 'names(x) - value' is that of the assignment, 'value', not the return value from the left-hand side.) it does speak of 'names-' used in prefix form, and does not do it in any negative (discouraging) way. Is there anything incorrect or missing in the help page for normal usage of the replacement function for 'names'? (i.e., when used in an expression like 'names(x) - ...') what is missing here in the first place is a specification of what 'normal' means. as far as i can see from the man page, 'normal' does not exclude prefix use. and if so, what is missing in the help page is a clear statement what an application of 'names-' will do, in the sense of what a user may observe. R does give one the ability to use its facilities in non-standard ways. However, I don't see much value in the help page for 'gun' attempting to describe the ways in which the bones in your foot will be shattered should you choose to point the gun at your foot and pull the trigger. Reminds me of the story of the guy in New York, who after injuring his back in refrigerator-carrying race, sued the manufacturer of the refrigerator for not having a warning label against that sort of use. very funny. little relevant. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Tony Plate wrote: Wacek Kusnierczyk wrote: Tony Plate wrote: Is there anything incorrect or missing in the help page for normal usage of the replacement function for 'names'? (i.e., when used in an expression like 'names(x) - ...') what is missing here in the first place is a specification of what 'normal' means. as far as i can see from the man page, 'normal' does not exclude prefix use. and if so, what is missing in the help page is a clear statement what an application of 'names-' will do, in the sense of what a user may observe. Fair enough. I looked at the help page for names after sending my email, and was surprised to see the following in the DETAILS section: It is possible to update just part of the names attribute via the general rules: see the examples. This works because the expression there is evaluated as |z - names-(z, [-(names(z), 3, c2))|. To me, this paragraph is far more confusing than enlightening, especially as also gives the impression that it's OK to use a replacement function in a functional form. In my own personal opinion it would be a enhancement to remove that example from the documentation, and just say you can do things like 'names(x)[2:3] - c(a,b)'. i must say that this part of the man page does explain things to me. much less the code [1] berwin suggested as a piece to read and investigate (slightly modified): tmp = x x = 'names-'(tmp, 'foo') berwin's conclusion seemed to be that this code hints/suggests/fortune-tells the user that 'names-' might be doing side effects. this code illustrates what names(x) = 'foo' (the infix form) does -- that it destructively modifies x. now, if the code were to illustrate that the prefix form does perform side effects too, then the following would be enough: 'names-'(x, 'foo') if the code were to illustrate that the prefix form, unlike the infix form, does not perform side effects, then the following would suffice for a discussion: x = 'names-'(x, 'foo') if the code wee to illustrate that the prefix form may or may not do side effects depending on the situation, then it surely fails to show that, unless the user performs some sophisticated inference which i am not capable of, or, more likely, unless the user already knows that this was to be shown. without a discussion, the example is simply an unworked rubbish. and it's obviously wrong; it says that (slightly and irrelevantly simplified) names(x) = 'foo' is equivalent to tmp = x x = 'names-'(tmp, 'foo') which is nonsense, because in the latter case you either have an additional binding that you don't have in the former case, or, worse, you rebind, possibly with a different value, a name that has had a binding already. it's a gritty-nitty detail, but so is most of statistics based on nitty-gritty details which non-statisticians are happy to either ignore or be ignorant about. [1] http://stat.ethz.ch/R-manual/R-devel/doc/manual/R-lang.html#Comments I often use name replacement functions in a functional way, and because one can't use 'names-' etc in this way, note, this 'because' does not follow in any way from the man page, or the section of 'r language definition' referred to above. I define my own functions like the following: set.names - function(n,x) {names(x) - n; x} it appears that set.names = function(n, x) 'names-'(x, n) would do the job (guess why). (and similarly for set.rownames(), set colnames(), etc.) I would highly recommend you do this rather than try to use a call like names-(x, ...). i'm almost tempted to extend your recommendation to 'define your own function for about every function already in r' ;) vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: On Wed, 11 Mar 2009 20:31:18 +0100 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: Simon Urbanek wrote: On Mar 11, 2009, at 10:52 , Simon Urbanek wrote: Wacek, Peter gave you a full answer explaining it very well. If you really want to be able to trace each instance yourself, you have to learn far more about R internals than you apparently know (and Peter hinted at that). Internally x=1 an x=c(1) are slightly different in that the former has NAMED(x) = 2 whereas the latter has NAMED(x) = 0 which is what causes the difference in behavior as Peter explained. The reason is that c(1) creates a copy of the 1 (which is a constant [=unmutable] thus requiring a copy) and the new copy has no other references and thus can be modified and hence NAMED(x) = 0. Errata: to be precise replace NAMED(x) = 0 with NAMED(x) = 1 above -- since NAMED(c(1)) = 0 and once it's assigned to x it becomes NAMED(x) = 1 -- this is just a detail on how things work with assignment, the explanation above is still correct since duplication happens conditional on NAMED == 2. i guess this is what every user needs to know to understand the behaviour one can observe on the surface? Nope, only users who prefer to write '+'(1,2) instead of 1+2, or 'names-'(x, 'foo') instead of names(x)='foo'. well, as far as i remember, it has been said on this list that in r the infix syntax is equivalent to the prefix syntax, so no one wanting to use the form above should be afraid of different semantics; these two forms should be perfectly equivalent. after all, x = 1 names(x) = 'foo' names(x) should return NULL, because when the second assignment is made, we need to make a copy of the value of x, so it is the copy that should have changed names, not the value of x (which would still be the original 1). on the other hand, the fact that names(x) = 'foo' is (or so it seems) a shorthand for x = 'names-'(x, 'foo') is precisely why i'd think that the prefix 'names-' should never do destructive modifications, because that's what x = 'names-'(x, 'foo'), and thus also names(x) = 'foo', is for. i guess the above is sort of blasphemy. Attempting to change the name attribute of x via 'names-'(x, 'foo') looks to me as if one relies on a side effect of the function 'names-'; which, in my book would be a bad thing. indeed; so, for coherence, 'names-' should always do the modification on a copy. it would then have semantics different from the infix form of 'names-', but at least consistently so. I.e. relying on side effects of a function, or writing functions with side effects which are then called for their side-effects; this, of course, excludes functions like plot() :) I never had the need to call 'names-'() directly and cannot foresee circumstances in which I would do so. Plenty of users, including me, are happy using the latter forms and, hence, never have to bother with understanding these implementation details or have to bother about them. Your mileage obviously varies, but that is when you have to learn about these internal details. If you call functions because of their side-effects, you better learn what the side-effects are exactly. well, i can imagine a user using the prefix 'names-' precisely under the assumption that it will perform functionally; i.e., 'names-'(x, 'foo') will always produce a copy of x with the new names, and never change the x. that there will be a destructive modification made to x on some, but not all, occasions, is hardly a good thing in this context -- and it's not a situation where a user wants to use the function because of its side effects, quite to the contrary. this was actually the situation i had when i first discovered the surprizing behaviour of 'names-'; i thought 'names-' did *not* have side effects. cheers, and thanks for the discussion. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: On Wed, 11 Mar 2009 20:29:14 +0100 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: Simon Urbanek wrote: Wacek, Peter gave you a full answer explaining it very well. If you really want to be able to trace each instance yourself, you have to learn far more about R internals than you apparently know (and Peter hinted at that). Internally x=1 an x=c(1) are slightly different in that the former has NAMED(x) = 2 whereas the latter has NAMED(x) = 0 which is what causes the difference in behavior as Peter explained. The reason is that c(1) creates a copy of the 1 (which is a constant [=unmutable] thus requiring a copy) and the new copy has no other references and thus can be modified and hence NAMED(x) = 0. simon, thanks for the explanation, it's now as clear as i might expect. now i'm concerned with what you say: that to understand something visible to the user one needs to learn far more about R internals than one apparently knows. your response suggests that to use r without confusion one needs to know the internals, Simon can probably speak for himself, but according to my reading he has not suggested anything similar to what you suggest he suggested. :) so i did not say *he* suggested this. 'your response suggests' does not, on my reading, imply any intention from simon's side. but it's you who is an expert in (a dialect of) english, so i won't argue. and this would be a really bad thing to say.. No problems, since he did not say anything vaguely similar to what you suggest he said. let's not depart from the point. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: Whoever said that must have been at that moment not as precise as he or she could have been. Also, R does not behave according to what people say on this list (which is good, because some times people they wrong things on this list) but according to how it is documented to do; at least that is what people on this list (and others) say. :) well, ?'names-' says: Value: For 'names-', the updated object. which is only partially correct, in that the value will sometimes be an updated *copy* of the object. And the R Language manual (ignoring for the moment that it is a draft and all that), since we must... clearly states that names(x) - c(a,b) is equivalent to '*tmp*' - x x - names-('*tmp*', value=c(a,b)) ... and? does this say anything about what 'names-'(...) actually returns? updated *tmp*, or a copy of it? [...] well, i can imagine a user using the prefix 'names-' precisely under the assumption that it will perform functionally; You mean y - 'names-'(x, foo) instead of y - x names(y) - foo ? what i mean is, rather precisely, that 'names-'(x, 'foo') will produce a *new* object with a copy of the value of x and names as specified, and will *not*, under any circumstances, modify x. the first line above does not quite address this, e.g.: x = c(1) y = 'names-'(x, 'foo') names(x) # foo, 'should' be NULL Fair enough. But I would still prefer the latter version this it is (for me) easier to read and to decipher the intention of the code. you're welcome to use it. but this is personal preference, and i'm trying to discuss the semantics of r here. what you show is a way to clutter the code, and you need to explicitly name the new object, while, in functional programming, it is typical to operate on anonymous objects passed from one function to another, e.g. f('names-'(x, 'foo')) which would have to become y = x names(y) = 'foo' f(y) or f({y = x; names(y) = 'foo'; y}) with 'y' being a nuissance name. i.e., 'names-'(x, 'foo') will always produce a copy of x with the new names, and never change the x. I am not sure whether R ever behaved in that way, but as Peter pointed out, this would be quite undesirable from a memory management and performance point of view. why? you can still use the infix names- with destructive semantics to avoid copying. Image that every time you modify a (name) component of a large object a new copy of that object is created. see above. besides, r has been several times claimed here (but see your remark above) to be a functional language, and in this context it is surprising that the smart (i mean it) copy-on-assignment mechanism, which is an implementational optimization, not only becomes visible, but also makes functions (hmm, procedures?) such as 'names-' non-functional -- in some, but not all, cases. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Wacek Kusnierczyk wrote: is precisely why i'd think that the prefix 'names-' should never do destructive modifications, because that's what x = 'names-'(x, 'foo'), and thus also names(x) = 'foo', is for. to make the point differently, i'd expect the following two to be equivalent: x = c(1); 'names-'(x, 'foo'); names(x) # foo x = c(1); do.call('names-', list(x, 'foo')); names(x) # NULL but they're obviously not. and of course, just that i'd expect it is not a strong argument. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: On Thu, 12 Mar 2009 10:53:19 +0100 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: well, ?'names-' says: Value: For 'names-', the updated object. which is only partially correct, in that the value will sometimes be an updated *copy* of the object. But since R supposedly *supposedly* uses call-by-value (though we know how to circumvent that, don't we?) we know how a lot of built-ins hack around this, don't we, and we also know that call-by-value is not really the argument passing mechanism in r. wouldn't you always expect that a copy of the object is returned? indeed! that's what i have said previously, no? there is still space for the smart (i mean it) copy-on-assignment behaviour, but it should not be visible to the user, in particular, not in that 'names-' destructively modifies the object it is given when the refcount is 1. in my humble opinion, there is either a design flaw or a bug here. And the R Language manual (ignoring for the moment that it is a draft and all that), since we must... clearly states that names(x) - c(a,b) is equivalent to '*tmp*' - x x - names-('*tmp*', value=c(a,b)) ... and? This seems to suggest seems to suggest? is not the purpose of documentation to clearly, ideally beyond any doubt, specify what is to be specified? that in this case the infix and prefix syntax is not equivalent as it does not say that are you suggesting fortune telling from what the docs do *not* say? names(x) - c(a,b) is equivalent to x - names-(x, value=c(a,b)) and I was commenting on the claim that the infix syntax is equivalent to the prefix syntax. does this say anything about what 'names-'(...) actually returns? updated *tmp*, or a copy of it? Since R uses pass-by-value, since? it doesn't! you would expect the latter, wouldn't you? yes, that's what i'd expect in a functional language. If you entertain the idea that 'names-' updates *tmp* and returns the updated *tmp*, then you believe that 'names-' behaves in a non-standard way and should take appropriate care. i got lost in your argumentation. i have given examples of where 'names-' destructively modifies and returns the updated object, not a copy. what is your point here? And the fact that a variable *tmp* is used hints to the fact that 'names-' might have side-effect. are you suggesting fortune telling from the fact that a variable *tmp* is used? If 'names-' has side effects, then it might not be well defined with what value x ends up with if one executes: x - 'names-'(x, value=c(a,b)) not really, unless you mean the returned object in the referential sense (memory location) versus value conceptually. here x will obviously have the value of the original x plus the names, *but* indeed you cannot tell from this snippet whether after the assignment x will be the same, though updated, object or will rather be an updated copy: x = c(1) x = 'names-'(x, 'foo') # x is the same object x = c(1) y = x x = 'names-'(x, 'foo') # x is another object so, as you say, it is not well defined with what object will x end up as its value, though the value of the object visible to the user is well defined. rewrite the above and play: x = c(1) y = 'names-'(x, 'foo') names(x) what are the names of x? is y identical (sensu refernce) with x, is y different (sensu reference) but indiscernible (sensu value) from x, or is y different (sensu value) from x in that y has names and x doesn't? This is similar to the discussion what value i should have in the following C snippet: i = 0; i += i++; nonsense, it's a *completely* different issue. here you touch the issue of the order of evaluation, and not of whether an object is copied or modified; above, the inverse is true. in fact, your example is useless because the result here is clearly specified by the semantics (as far as i know -- prove me wrong). you lookup i (0) and i (0) (the order does not matter here), add these values (0), assign to i (0), and increase i (1). i have a better example for you: int i = 0; i += ++i - ++i which will give different final values for i in c (2 with gcc 4.2, 1 with gcc 3.4), c# and java (-1), perl (2) and php (1). again, this has nothing to do with the above. [..] I am not sure whether R ever behaved in that way, but as Peter pointed out, this would be quite undesirable from a memory management and performance point of view. why? you can still use the infix names- with destructive semantics to avoid copying. I guess that would require a rewrite (or extension) of the parser. To me, Section 10.1.2 of the Language Definition manual suggests that once an expression is parsed, you cannot distinguish any more whether
Re: [Rd] surprising behaviour of names-
Wacek Kusnierczyk wrote: Berwin A Turlach wrote: This is similar to the discussion what value i should have in the following C snippet: i = 0; i += i++; in fact, your example is useless because the result here is clearly specified by the semantics (as far as i know -- prove me wrong). you lookup i (0) and i (0) (the order does not matter here), add these values (0), assign to i (0), and increase i (1). i'm happy to prove myself wrong. the c programming language, 2nd ed. by ritchie and kernigan, has the following discussion: One unhappy situation is typified by the statement a[i] = i++; The question is whether the subscript is the old value of i or the new. Compilers can interpret this in different ways, and generate different answers depending on their interpretation. The standard intentionally leaves most such matters unspecified. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] E``rrors in recursive default argument references
l...@stat.uiowa.edu wrote: Thanks to Stavros for the report. This should now be fixed in R-devel. indeed, though i find some of the error messages strange: (function(a=a) -a)() # Error in (function(a = a) -a)() : # element 1 is empty; # the part of the args list of '-' being evaluated was: # (a) (function(a=a) c(a))() # Error in (function(a = a) c(a))() : # promise already under evaluation: recursive default argument reference or earlier problems? why are they different? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Simon Urbanek wrote: On Mar 11, 2009, at 10:52 , Simon Urbanek wrote: Wacek, Peter gave you a full answer explaining it very well. If you really want to be able to trace each instance yourself, you have to learn far more about R internals than you apparently know (and Peter hinted at that). Internally x=1 an x=c(1) are slightly different in that the former has NAMED(x) = 2 whereas the latter has NAMED(x) = 0 which is what causes the difference in behavior as Peter explained. The reason is that c(1) creates a copy of the 1 (which is a constant [=unmutable] thus requiring a copy) and the new copy has no other references and thus can be modified and hence NAMED(x) = 0. Errata: to be precise replace NAMED(x) = 0 with NAMED(x) = 1 above -- since NAMED(c(1)) = 0 and once it's assigned to x it becomes NAMED(x) = 1 -- this is just a detail on how things work with assignment, the explanation above is still correct since duplication happens conditional on NAMED == 2. there is an interesting corollary. self-assignment seems to increase the reference count: x = 1; 'names-'(x, 'foo'); names(x) # NULL x = 1; x = x; 'names-'(x, 'foo'); names(x) # foo vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Berwin A Turlach wrote: On Thu, 12 Mar 2009 15:21:50 +0100 Wacek Kusnierczyk waclaw.marcin.kusnierc...@idi.ntnu.no wrote: seems to suggest? is not the purpose of documentation to clearly, ideally beyond any doubt, specify what is to be specified? The R Language Definition manual is still a draft. :) this is indeed a good explanation for all sorts of nonsense. worse if stuff tends to persist despite critique. that in this case the infix and prefix syntax is not equivalent as it does not say that are you suggesting fortune telling from what the docs do *not* say? My experience is that sometimes you have to realise what is not stated. in general, yes. in r, this often ends up with 'have you seen the documentation saying that??' in response. I remember a discussion with somebody who asked why he could not run, on windows, R CMD INSTALL on a *.zip file. I pointed out to him that the documentation states that you can run R CMD INSTALL on *.tar.gz or *.tgz files and, thus, there should be no expectation that it can be run on *.zip file. yes, that's a good point. this reminds me of a (possibly anectodal) lady who sued the manufacturer of her microwave after she had dried in it her cat after a bath. YMMV, but when I read a passage like this in R documentation, I start to wonder why it is stated that names(x) - c(a,b) is equivalent to *tmp* - x x - names-('*tmp*', value=c(a,b)) and the simpler construct x - names-(x, value=c(a, b)) is not used. There must be a reason, got an explanation: because it probably is as drafty as the aforementioned document. nobody likes to type unnecessarily long code. And, after thinking about this for a while, the penny might drop. that's cool. instead of stating what 'names-' does or does not, one expresses it in a convoluted way an makes you guess from a *tmp* variable. a nice exercise, i like it. [...] does this say anything about what 'names-'(...) actually returns? updated *tmp*, or a copy of it? Since R uses pass-by-value, since? it doesn't! For all practical purposes it is as long as standard evaluation is used. One just have to be aware that some functions evaluate their arguments in a non-standard way. it's maybe a bit of hairsplitting, but what you have in r is not exactly what is called 'pass by value'. here's a relevant quote from [1], p. 309: In the call-by-name (CBN) mechanism, a formal parameter names the computation designated by an unevaluated argument expression. In the call-by-value (CBV) mechanism, a formal parameter names the value of an evaluated argument expression. In the call-by-need or lazy evaluation (CBL), the formal parameter name can be bound to a location that originally stores the computation of the argument expression. The first time the parameter is referenced, the computation is performed, but the resulting value is cached at the location and is used on every subsequent reference. Thus, the argument expression is evaluated at most once and is never evaluated at all if the parameter is never referenced. note the 'unevaluated' and 'evaluated'. you're free to have your pick. but it is possible to send an argument to a function that makes an assignment to the argument, and yet the assignment is made to the original, not to a copy: foo = function(arg) arg$foo = foo e = new.env() foo(e) e$foo are you sure this is pass by value? it appears that r has a pass-by-need mechanism that dispatches to pass-by-value or pass-by-reference depending on the type of the object. with this semantics, all sorts of mess are possible, and 'names-' provides one example. [1] design concepts in programming languages, turbak and gifford, mit press 2008 [...] If you entertain the idea that 'names-' updates *tmp* and returns the updated *tmp*, then you believe that 'names-' behaves in a non-standard way and should take appropriate care. i got lost in your argumentation. [..] I was commenting on does this say anything about what 'names-'(...) actually returns? updated *tmp*, or a copy of it? As I said, if you entertain the idea that 'names-' returns an updated *tmp*, then you believe that 'names-' behaves in a non-standard way and appropriate care has to be taken. i can check, by experimentation, whether 'names-' returns a copy or the original; even if i can establish that it returns the original after having modified it, it's not something to entertain. maybe you entertain the idea of your users performing the guesswork instead of reading an unambiguous specification. you have already said that you don't care if your users get confused, it would fit the image. and actually, in the example we discuss, 'names-' does *not* return an updated *tmp*, so there's even less to entertain. for fun and more guesswork, the example could
Re: [Rd] surprising behaviour of names-
Simon Urbanek wrote: On Mar 12, 2009, at 11:12 , Wacek Kusnierczyk wrote: Simon Urbanek wrote: On Mar 11, 2009, at 10:52 , Simon Urbanek wrote: Wacek, Peter gave you a full answer explaining it very well. If you really want to be able to trace each instance yourself, you have to learn far more about R internals than you apparently know (and Peter hinted at that). Internally x=1 an x=c(1) are slightly different in that the former has NAMED(x) = 2 whereas the latter has NAMED(x) = 0 which is what causes the difference in behavior as Peter explained. The reason is that c(1) creates a copy of the 1 (which is a constant [=unmutable] thus requiring a copy) and the new copy has no other references and thus can be modified and hence NAMED(x) = 0. Errata: to be precise replace NAMED(x) = 0 with NAMED(x) = 1 above -- since NAMED(c(1)) = 0 and once it's assigned to x it becomes NAMED(x) = 1 -- this is just a detail on how things work with assignment, the explanation above is still correct since duplication happens conditional on NAMED == 2. there is an interesting corollary. self-assignment seems to increase the reference count: x = 1; 'names-'(x, 'foo'); names(x) # NULL x = 1; x = x; 'names-'(x, 'foo'); names(x) # foo Not for me, at least in current R: not for me either. i messed up the example, sorry. here's the intended version: x = c(1); 'names-'(x, 'foo'); names(x) # foo x = c(1); x = x; 'names-'(x, 'foo'); names(x) # NULL x = 1; 'names-'(x, 'foo'); names(x) foo 1 NULL x = 1; x = x; 'names-'(x, 'foo'); names(x) foo 1 NULL (both R 2.8.1 and R-devel 3/11/09, darwin 9.6) In addition, you still got it backwards - your output suggests that the assignment created a new, clean copy. Functional call of `names-` (whose side-effect on x is undefined BTW) is destructive when you get a clean copy (e.g. as a result of the c function) and non-destructive when the object was referenced. It is left as an exercise to the reader to reason why constants such as 1 are referenced. all true, again because of my mistake. anyway, it may be suprising that with all its smartness (i mean it) about copy-on-assingment, r does not see that it makes no sense to increase refcount here. of course, you can't judge from just the syntactic form 'x=x', but still it should not be very difficult to have the interpreter see when it finds an object named 'x' in the same environment where it attempts the assignment. (of course, who'd do self-assignments in practical code?) cheers, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
G. Jay Kerns wrote: Wacek Kusnierczyk wrote: I am prompted to imagine someone pointing out to the volunteers of the International Red Cross - on the field of a natural disaster, no less - that their uniforms are not an acceptably consistent shade of pink... or that the screws on their tourniquets do not have the appropriate pitch as to minimize the friction for the turner... not that it is very accurate, because unintuitive and confusing semantics may lead to hidden and dangerous errors in users' code. wrong shade of a uniform might lead to the person being shot, for example, but then your point vanishes. As a practicing statistician I am simply thankful that the bleeding is stopped. :-) when it is stopped, not turned to an internal bleeding, which you simply don't see. Cheers to R-Core (and the hundreds of other volunteers). absolutely. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Simon Urbanek wrote: Wacek, Peter gave you a full answer explaining it very well. If you really want to be able to trace each instance yourself, you have to learn far more about R internals than you apparently know (and Peter hinted at that). Internally x=1 an x=c(1) are slightly different in that the former has NAMED(x) = 2 whereas the latter has NAMED(x) = 0 which is what causes the difference in behavior as Peter explained. The reason is that c(1) creates a copy of the 1 (which is a constant [=unmutable] thus requiring a copy) and the new copy has no other references and thus can be modified and hence NAMED(x) = 0. simon, thanks for the explanation, it's now as clear as i might expect. now i'm concerned with what you say: that to understand something visible to the user one needs to learn far more about R internals than one apparently knows. your response suggests that to use r without confusion one needs to know the internals, and this would be a really bad thing to say.. i have long been concerned with that r unnecessarily exposes users to its internals, and here's one more example of how the interface fails to hide the guts. (and peter did not give me a full answer, but a vague hint.) vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Simon Urbanek wrote: On Mar 11, 2009, at 10:52 , Simon Urbanek wrote: Wacek, Peter gave you a full answer explaining it very well. If you really want to be able to trace each instance yourself, you have to learn far more about R internals than you apparently know (and Peter hinted at that). Internally x=1 an x=c(1) are slightly different in that the former has NAMED(x) = 2 whereas the latter has NAMED(x) = 0 which is what causes the difference in behavior as Peter explained. The reason is that c(1) creates a copy of the 1 (which is a constant [=unmutable] thus requiring a copy) and the new copy has no other references and thus can be modified and hence NAMED(x) = 0. Errata: to be precise replace NAMED(x) = 0 with NAMED(x) = 1 above -- since NAMED(c(1)) = 0 and once it's assigned to x it becomes NAMED(x) = 1 -- this is just a detail on how things work with assignment, the explanation above is still correct since duplication happens conditional on NAMED == 2. i guess this is what every user needs to know to understand the behaviour one can observe on the surface? thanks for further clarifications. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
[Rd] surprising behaviour of names-
playing with 'names-', i observed the following: x = 1 names(x) # NULL 'names-'(x, 'foo') # c(foo=1) names(x) # NULL where 'names-' has a functional flavour (does not change x), but: x = 1:2 names(x) # NULL 'names-'(x, 'foo') # c(foo=1, 2) names(x) # foo NA where 'names-' seems to perform a side effect on x (destructively modifies x). furthermore: x = c(foo=1) names(x) # foo 'names-'(x, NULL) names(x) # NULL 'names-'(x, 'bar') names(x) # bar !!! x = c(foo=1) names(x) # foo 'names-'(x, 'bar') names(x) # bar !!! where 'names-' is not only able to destructively remove names from x, but also destructively add or modify them (quite unlike in the first example above). analogous code but using 'dimnames-' on a matrix performs a side effect on the matrix even if it initially does not have dimnames: x = matrix(1,1,1) dimnames(x) # NULL 'dimnames-'(x, list('foo', 'bar')) dimnames(x) # list(foo, bar) this is incoherent with the first example above, in that in both cases the structure initially has no names or dimnames attribute, but the end result is different in the two examples. is there something i misunderstand here? there is another, minor issue with names: 'names-'(1, c('foo', 'bar')) # error: 'names' attribute [2] must be the same length as the vector [1] 'names-'(1:2, 'foo') # no error since ?names says that If 'value' is shorter than 'x', it is extended by character 'NA's to the length of 'x' (where x is the vector and value is the names vector), the error message above should say that the names attribute must be *at most*, not *exactly*, of the length of the vector. regards, vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Peter Dalgaard wrote: Wacek Kusnierczyk wrote: playing with 'names-', i observed the following: x = 1 names(x) # NULL 'names-'(x, 'foo') # c(foo=1) names(x) # NULL where 'names-' has a functional flavour (does not change x), but: x = 1:2 names(x) # NULL 'names-'(x, 'foo') # c(foo=1, 2) names(x) # foo NA where 'names-' seems to perform a side effect on x (destructively modifies x). furthermore: x = c(foo=1) names(x) # foo 'names-'(x, NULL) names(x) # NULL 'names-'(x, 'bar') names(x) # bar !!! x = c(foo=1) names(x) # foo 'names-'(x, 'bar') names(x) # bar !!! where 'names-' is not only able to destructively remove names from x, but also destructively add or modify them (quite unlike in the first example above). analogous code but using 'dimnames-' on a matrix performs a side effect on the matrix even if it initially does not have dimnames: x = matrix(1,1,1) dimnames(x) # NULL 'dimnames-'(x, list('foo', 'bar')) dimnames(x) # list(foo, bar) this is incoherent with the first example above, in that in both cases the structure initially has no names or dimnames attribute, but the end result is different in the two examples. is there something i misunderstand here? Only the ideology/pragmatism... In principle, R has call-by-value semantics and a function does not destructively modify its arguments(*), and foo(x)-bar behaves like x - foo-(x, bar). HOWEVER, this has obvious performance repercussions (think x - rnorm(1e7); x[1] - 0), so we do allow destructive modification by replacement functions, PROVIDED that the x is not used by anything else. On the least suspicion that something else is using the object, a copy of x is made before the modification. So (A) you should not use code like y - foo-(x, bar) because (B) you cannot (easily) predict whether or not x will be modified destructively that's fine, thanks, but i must be terribly stupid as i do not see how this explains the examples above. where is the x used by something else in the first example, so that 'names-'(x, 'foo') does *not* modify x destructively, while it does in the other cases? i just can't see how your explanation fits the examples -- it probably does, but i beg you show it explicitly. thanks. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Stavros Macrakis wrote: (B) you cannot (easily) predict whether or not x will be modified destructively that's fine, thanks, but i must be terribly stupid as i do not see how this explains the examples above. where is the x used by something else in the first example, so that 'names-'(x, 'foo') does *not* modify x destructively, while it does in the other cases? i just can't see how your explanation fits the examples -- it probably does, but i beg you show it explicitly. I think the following shows what Peter was referring to: In this case, there is only one pointer to the value of x: x - c(1,2) names-(x,foo) foo NA 12 x foo NA 12 In this case, there are two: x - c(1,2) y - x names-(x,foo) foo NA 12 x [1] 1 2 y [1] 1 2 that is and was clear to me, but none of my examples was of the second form, and hence i think peter's answer did not answer my question. what's the difference here: x = 1 'names-'(x, 'foo') names(x) # NULL x = c(foo=1) 'names-'(x, 'foo') names(x) # foo certainly not something like what you show. what's the difference here: x = 1 'names-'(x, 'foo') names(x) # NULL x = 1:2 'names-'(x, c('foo', 'bar')) names(x) # foo bar certainly not something like what you show. It seems as though `names-` and the like cannot be treated as R functions (which do not modify their arguments) but as special internal routines which do sometimes modify their arguments. they seem to behave somewhat like macros: 'names-'(a, b) with the destructive 'names-' is sort of replaced with a = 'names-'(a, b) with a functional 'names-'. but this still does not explain the incoherence above. my problem was and is not that 'names-' is not a pure function, but that it sometimes is, sometimes is not, without any obvious explanation. that is, i suspect (not claim) that the behaviour is not a design feature, but an incident. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
Peter Dalgaard wrote: (*) unless you mess with match.call() or substitute() and the like. But that's a different story. different or not, it is a story that happens quite often -- too often, perhaps -- to the degree that one may be tempted to say that the semantics of argument passing in r is a mess. which of course is not true, but since it is possible to mess with match.call co, people (including r core) do mess with them, and the result is obviously a mess. on top of the clear call-by-need semantics -- and on the surface, you cannot tell how the arguments of a function will be taken (by value? by reference? not at all?), which in effect looks like a messy semantics. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] logical comparison of functions (PR#13588)
Duncan Murdoch wrote: On 10/03/2009 4:35 PM, michael_ka...@earthlink.net wrote: Full_Name: Michael Aaron Karsh Version: 2.8.0 OS: Windows XP Submission from: (NULL) (164.67.71.215) When I try to say if (method==f), where f is a function, it says that the comparison is only possible for list and atomic types. I tried saying if (method!=f), and it gave the same error message. Would it be possible to repair it say that == and != comparisons would be possible for functions? This is not a bug. Please don't report things as bugs when they aren't. == and != are for atomic vectors, as documented. Use identical() for more general comparisons, as documented on the man page for ==. note that in most programming languages comparing function objects is either not supported or returns false unless you compare a function object to itself. r is a notable exception: identical(function(a) a, function(a) a) # TRUE which would be false in all other languages i know; however, identical(function(a) a, function(b) b) # FALSE though they are surely identical functionally. btw. it's not necessarily intuitive that == works only for atomic vectors. vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] surprising behaviour of names-
i got an offline response saying that my original post may have not been clear as to what the problem was, essentially, and that i may need to restate it in words, in addition to code. the problem is: the performance of 'names-' is incoherent, in that in some situations it acts in a functional manner, producing a copy of its argument with the names changed, while in others it changes the object in-place (and returns it), without copying first. your explanation below is of course valid, but does not seem to address the issue. in the examples below, there is always (or so it seems) just one reference to the object. why are the following functional: x = 1; 'names-'(x, 'foo'); names(x) x = 'foo'; 'names-'(x, 'foo'); names(x) while these are destructive: x = c(1); 'names-'(x, 'foo'); names(x) x = c('foo'); 'names-'(x, 'foo'); names(x) it is claimed that in r a singular value is a one-element vector, and indeed, identical(1, c(1)) # TRUE all.equal(is(1), is(c(1))) # TRUE i also do not understand the difference here: x = c(1); 'names-'(x, 'foo'); names(x) # foo x = c(1); names(x); 'names-'(x, 'foo'); names(x) # foo x = c(1); print(x); 'names-'(x, 'foo'); names(x) # NULL x = c(1); print(c(x)); 'names-'(x, 'foo'); names(x) # foo does print, but not names, increase the reference count for x when applied to x, but not to c(x)? if the issue is that there is, in those examples where x is left unchanged, an additional reference to x that causes the value of x to be copied, could you please explain how and when this additional reference is created? thanks, vQ Peter Dalgaard wrote: is there something i misunderstand here? Only the ideology/pragmatism... In principle, R has call-by-value semantics and a function does not destructively modify its arguments(*), and foo(x)-bar behaves like x - foo-(x, bar). HOWEVER, this has obvious performance repercussions (think x - rnorm(1e7); x[1] - 0), so we do allow destructive modification by replacement functions, PROVIDED that the x is not used by anything else. On the least suspicion that something else is using the object, a copy of x is made before the modification. So (A) you should not use code like y - foo-(x, bar) because (B) you cannot (easily) predict whether or not x will be modified destructively (*) unless you mess with match.call() or substitute() and the like. But that's a different story. -- --- Wacek Kusnierczyk, MD PhD Email: w...@idi.ntnu.no Phone: +47 73591875, +47 72574609 Department of Computer and Information Science (IDI) Faculty of Information Technology, Mathematics and Electrical Engineering (IME) Norwegian University of Science and Technology (NTNU) Sem Saelands vei 7, 7491 Trondheim, Norway Room itv303 Bioinformatics Gene Regulation Group Department of Cancer Research and Molecular Medicine (IKM) Faculty of Medicine (DMF) Norwegian University of Science and Technology (NTNU) Laboratory Center, Erling Skjalgsons gt. 1, 7030 Trondheim, Norway Room 231.05.060 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] E``rrors in recursive default argument references
Stavros Macrakis wrote: Tested in: R version 2.8.1 (2008-12-22) / Windows Recursive default argument references normally give nice clear errors. In the first set of examples, you get the error: Error in ... : promise already under evaluation: recursive default argument reference or earlier problems? (function(a = a) a ) () (function(a = a) c(a) ) () (function(a = a) a[1] ) () (function(a = a) a[[1]] ) () (function(a = a) a$x) () (function(a = a) mean(a) ) () (function(a = a) sort(a) ) () (function(a = a) as.list(a) ) () But in the following examples, R seems not to detect the 'promise already under evaluation' condition and instead gets a stack overflow, with the error message: Error: C stack usage is too close to the limit when i run these examples, the execution seems to get into an endless loop with no error messages whatsoever. how much time does it take before you get the error? (using r 2.8.0 and also the latest r-devel). vQ (function(a = a) (a)) () (function(a = a) -a ) () btw. ?'-' talks about '-' as a *binary* operator, but the only example given there which uses '-' uses it as a *unary* operator. since '-'() complains that '-' takes 1 or 2 arguments, it might be a good idea to acknowledge it in the man page. (function(a = a) var(a) ) () (function(a = a) sum(a) ) () (function(a = a) is.vector(a) ) () (function(a = a) as.numeric(a) ) () I don't understand why the two sets of examples behave differently. a bug in excel? vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] question
ivo...@gmail.com wrote: Gentlemen---these are all very clever workarounds, but please forgive me for voicing my own opinion: IMHO, returning multiple values in a statistical language should really be part of the language itself. there should be a standard syntax of some sort, whatever it may be, that everyone should be able to use and which easily transfers from one local computer to another. It should not rely on clever hacks in the .Rprofile that are different from user to user, and which leave a reader of end user R code baffled at first by all the magic that is going on. Even the R tutorials for beginners should show a multiple-value return example right at the point where function calls and return values are first explained. hi again, i was playing a bit with the idea of multiple assignment, and came up with a simple codebit [1] that redefines the operator '='. it hasn't been extensively tested and is by no means foolproof, but allows various sorts of tricks with multiple assignments: source('http://miscell.googlecode.com/svn/rvalues/rvalues.r', local=TRUE) a = function(n) 1:n # a is a function b = a(3) # b is c(1, 2, 3) c(c, d) = a(1) # c is 1, d is NULL c(a, b) = list(b, a) # swap: a is 1:3, b is a function # these are equivalent: c(a, b) = 1:2 {a; b} = 1:2 list(a, b) = 1:2 a = data.frame(x=1:3, y=3) # a is a 2-column data frame c(a, b) = data.frame(x=1:3, b=3) # a is c(1, 2, 3), b is c(3, 3, 3) and so on. this is sort of pattern matching as in some functional languages, but only sort of: it does not do recursive matching, for example: c(c(a, b), c) = list(1:2, 3) # error # not: a = 1, b = 2, c = 3 anyway, it's just a toy for which there is no need. vQ [1] svn checkout */http/*://miscell.googlecode.com/svn/rvalues __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] question
mark.braving...@csiro.au wrote: The syntax for returning multiple arguments does not strike me as particularly appealing. would it not possible to allow syntax like: f= function() { return( rnorm(10), rnorm(20) ) } (a,d$b) = f() FWIW, my own solution is to define a multi-assign operator: '%-%' - function( a, b){ # a must be of the form '{thing1;thing2;...}' a - as.list( substitute( a))[-1] e - sys.parent() stopifnot( length( b) == length( a)) for( i in seq_along( a)) eval( call( '-', a[[ i]], b[[i]]), envir=e) NULL } you might want to have the check less stringent, so that rhs may consist of more values that the lhs has variables. or even skip the check and assign NULL to a[i] for i length(b). another idea is to allow %-% to be used with just one variable on the lhs. here's a modified version: '%-%' - function(a, b){ a - as.list( substitute(a)) if (length(a) 1) a - a[-1] if (length(a) length(b)) b - c(b, rep(list(NULL), length(a) - length(b))) e - sys.parent() for( i in seq_along( a)) eval( call( '-', a[[ i]], b[[i]]), envir=e) NULL } {a; b} %-% 1:2 # a = 1; b = 2 a %-% 3:4 # a = 3 {a; b} %-% 5 # a = 5; b = NULL vQ __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] question
ivo...@gmail.com wrote: Gentlemen---these are all very clever workarounds, hacks around the lack of a feature but please forgive me for voicing my own opinion: IMHO, returning multiple values in a statistical language should really be part of the language itself. returning multiple values is supported by many programming languages, in particular scripting languages. while in r you can use the %-% hack or have functions return lists of values, it could indeed be useful to have such a feature in a statistical language like r. there should be a standard syntax of some sort, if you mean that r should have such a syntax, you're likely to learn more about saying 'should' soon. whatever it may be, that everyone should be able to use and which easily transfers from one local computer to another. It should not rely on clever hacks in the .Rprofile that are different from user to user, and which leave a reader of end user R code baffled at first by all the magic that is going on. Even the R tutorials for beginners should show a multiple-value return example right at the point where function calls and return values are first explained. as gabor says in another post, you probably should first show why having multiple value returns would be useful in r. however, i don't think there are good counterarguments anyway, and putting on you the burden of proving a relatively obvious (or not so?) thing is a weak escape. to call for a reference, sec. 9.2.3, p. 450+ in [1] provides some discussion and examples. vQ [1] Design Concepts in Programming Languages, Turbak and Gifford with Sheldon, MIT 2008 __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel