Re: [Rd] Improvement of SignRank functions
Hi Ivo, IU == Ivo Ugrina [EMAIL PROTECTED] on Fri, 14 Dec 2007 23:03:37 +0100 writes: IU I took some time and liberty and tried to improve IU existing implementation of SignRank functions IU in R. (dsignrank, ...) IU As I have seen they've been based on csignrank. IU So I modified csignrank and, I believe, IU improved calculation time and memory efficiency. do you have evidence for your belief? i.e. a set of system.time(.) calls where you see the difference? IU The idea is basically the same. I use the same recursion IU as original author (Kurt Hornik) IU used with one slight modification. IU I am generating Wilcoxon SignRank density from the IU beginning (for n=2,3,...) with the help of recursion formula. IU What is changed? IU There is no need for SINGRANK_MAX in src/nmath/nmath.h anymore. IU Only functions: IU static void w_free() IU void signrank_free() IU static void w_init_maybe(int n) IU static double csignrank(int k, int n) IU in src/nmath/signrank.c have been altered. IU There was no change to dsignrank, psignrank, ... IU I've tried to make as little changes as possible. IU So, to compile with new functions only src/nmath/signrank.c IU needs to be changed with the one given in attachment. IU I hope it really is an improvement. Me too :-) {Waiting for your test results} The code does look slightly simpler, I agree. BTW: If you had a smart idea to *not* use a static 'w' and still be memory efficient, that could lead to make that code thread-safe, but I am not at all sure this is possible without using thread-library C code. Martin __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Improvement of SignRank functions
Martin Maechler wrote: do you have evidence for your belief? i.e. a set of system.time(.) calls where you see the difference? system.time(dsignrank(17511, 400)) user system elapsed 1.010 0.120 1.145 system.time(dsignrank((0:17511), 400)) user system elapsed 1.250.131.40 system.time(dsignrank((0:17511), 500)) user system elapsed 2.040 0.220 2.296 system.time(psignrank((0:17511), 600)) user system elapsed 20.670 0.580 21.403 system.time(qsignrank(0.56, 300)) user system elapsed 0.700 0.050 0.753 == system.time(dsignrank(17511, 400)) user system elapsed 0.070 0.000 0.078 system.time(dsignrank((0:17511), 400)) user system elapsed 0.100 0.000 0.104 system.time(dsignrank((0:17511), 500)) user system elapsed 0.160 0.000 0.164 system.time(psignrank((0:17511), 600)) user system elapsed 16.330 0.370 16.729 system.time(qsignrank(0.56, 300)) user system elapsed 0.020 0.010 0.029 system.time(dsignrank((0:2), 600)) user system elapsed 3.470 0.280 3.745 RAM: ~130MB == system.time(dsignrank((0:2), 600)) user system elapsed 0.250 0.010 0.26 RAM: ~1MB BTW: If you had a smart idea to *not* use a static 'w' and still be memory efficient, that could lead to make that code thread-safe, but I am not at all sure this is possible without using thread-library C code. I'll look into it. With respect, -- Ivo Ugrina ICQ: 47508335 | www.iugrina.com --- baza matematickih pojmova http://baza.iugrina.com --- anime, manga, Japan fanzin http://yoshi.iugrina.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Improvement of SignRank functions
Hi Ivo, IU == Ivo Ugrina [EMAIL PROTECTED] on Sat, 15 Dec 2007 14:13:10 +0100 writes: IU Martin Maechler wrote: do you have evidence for your belief? i.e. a set of system.time(.) calls where you see the difference? IU system.time(dsignrank(17511, 400)) IU user system elapsed IU 1.010 0.120 1.145 IU system.time(dsignrank((0:17511), 400)) IU user system elapsed IU 1.250.131.40 IU system.time(dsignrank((0:17511), 500)) IU user system elapsed IU 2.040 0.220 2.296 IU system.time(psignrank((0:17511), 600)) IU user system elapsed IU 20.670 0.580 21.403 IU system.time(qsignrank(0.56, 300)) IU user system elapsed IU 0.700 0.050 0.753 IU == IU system.time(dsignrank(17511, 400)) IU user system elapsed IU 0.070 0.000 0.078 IU system.time(dsignrank((0:17511), 400)) IU user system elapsed IU 0.100 0.000 0.104 IU system.time(dsignrank((0:17511), 500)) IU user system elapsed IU 0.160 0.000 0.164 IU system.time(psignrank((0:17511), 600)) IU user system elapsed IU 16.330 0.370 16.729 IU system.time(qsignrank(0.56, 300)) IU user system elapsed IU 0.020 0.010 0.029 IU system.time(dsignrank((0:2), 600)) IU user system elapsed IU 3.470 0.280 3.745 IU RAM: ~130MB IU == IU system.time(dsignrank((0:2), 600)) IU user system elapsed IU 0.250 0.010 0.26 IU RAM: ~1MB that's quite convincing; thank you! and I can verify part of it on my computer. I think I'd just commit your signrank.c (with a few cosmetic changes) to the sources, right? *Not* using a static with all the previously computed counts is probably not possible without a (CPU time) efficiency loss; and to make this thread-safe one could use a thread-global array, but how to do that would really depend on the threading system used, and that's not at all given. Thank you for your contribution! Martin BTW: If you had a smart idea to *not* use a static 'w' and still be memory efficient, that could lead to make that code thread-safe, but I am not at all sure this is possible without using thread-library C code. IU I'll look into it. IU With respect, IU -- IU Ivo Ugrina IU ICQ: 47508335 | www.iugrina.com IU --- IU baza matematickih pojmova IU http://baza.iugrina.com IU --- IU anime, manga, Japan fanzin IU http://yoshi.iugrina.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] S4 class extending data.frame?
Thanks, Martin. In the short term (a) seems best. In the long run we may try (c), because there are other things that data.frame doesn't do that we want it to do (i.e., allow arbitrary objects with [ methods, print methods, and the same length to be bound together, rather than being restricted to atomic vectors + Date/factor). cheers Ben Martin Morgan wrote: Ben, Oleg -- Some solutions, which you've probably already thought of, are (a) move the data.frame into its own slot, instead of extending it, (b) manage the data.frame attributes yourself, or (c) reinvent the data.frame from scratch as a proper S4 class (e.g., extending 'list' with validity constraints on element length and homogeneity of element content). (b) places a lot of dependence on understanding the data.frame implementation, and is probably too tricky (for me) to get right,(c) is probably also tricky, and probably caries significant performance overhead (e.g., object duplication during validity checking). (a) means that you don't get automatic method inheritance. On the plus side, you still get the structure. It is trivial to implement methods like [, [[, etc to dispatch on your object and act on the appropriate slot. And in some sense you now know what methods i.e., those you've implemented, are supported on your object. Oleg, here's my cautionary tale for extending list, where manually subsetting the .Data slot mixes up the names (callNextMethod would have done the right thing, but was not appropriate). This was quite a subtle bug for me, because I hadn't been expecting named lists in my object; the problem surfaced when sapply used the (incorrectly subset) names attribute of the list. My solution in this case was to make sure 'names' were removed from lists used to construct objects. As a consequence I lose a nice little bit of sapply magic. setClass('A', 'list') [1] A setMethod('[', 'A', function(x, i, j, ..., drop=TRUE) { + [EMAIL PROTECTED] - [EMAIL PROTECTED] + x + }) [1] [ names(new('A', list(x=1, y=2))[2]) [1] x Martin Oleg Sklyar [EMAIL PROTECTED] writes: I had the same problem. Generally data.frame's behave like lists, but while you can extend list, there are problems extending a data.frame class. This comes down to the internal representation of the object I guess. Vectors, including list, contain their information in a (hidden) slot .Data (see the example below). data.frame's do not seem to follow this convention. Any idea how to go around? The following example is exactly the same as Ben's for a data.frame, but using a list. It works fine and one can see that the list structure is stored in .Data * ~: R R version 2.6.1 (2007-11-26) setClass(c3,representation(comment=character),contains=list) [1] c3 l = list(1:3,2:4) z3 = new(c3,l,comment=hello) z3 An object of class “c3” [[1]] [1] 1 2 3 [[2]] [1] 2 3 4 Slot comment: [1] hello [EMAIL PROTECTED] [[1]] [1] 1 2 3 [[2]] [1] 2 3 4 Regards, Oleg On Thu, 2007-12-13 at 00:04 -0500, Ben Bolker wrote: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 I would like to build an S4 class that extends a data frame, but includes several more slots. Here's an example using integer as the base class instead: setClass(c1,representation(comment=character),contains=integer) z1 = new(c1,55,comment=hello) z1 z1+10 z1[1] [EMAIL PROTECTED] -- in other words, it behaves exactly as an integer for access and operations but happens to have another slot. If I do this with a data frame instead, it doesn't seem to work at all. setClass(c2,representation(comment=character),contains=data.frame) d = data.frame(1:3,2:4) z2 = new(c2,d,comment=goodbye) z2 ## data all gone!! z2[,1] ## Error ... object is not subsettable [EMAIL PROTECTED] ## still there I can achieve approximately the same effect by adding attributes, but I was hoping for the structure of S4 classes ... Programming with Data and the R Language Definition contain 2 references each to data frames, and neither of them has allowed me to figure out this behavior. (While I'm at it: it would be wonderful to have a rich data frame that could include as a column any object that had an appropriate length and [ method ... has anyone done anything in this direction? ?data.frame says the allowable types are (numeric, logical, factor and character and so on), but I'm having trouble sorting out what the limitations are ...) hoping for enlightenment (it would be lovely to be shown how to make this work, but a definitive statement that it is impossible would be useful too). cheers Ben Bolker -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.6 (GNU/Linux) Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org iD8DBQFHYL1pc5UpGjwzenMRAqErAJ9jj1KgVVSGIf+DtK7Km/+JBaDu2QCaAkl/ eMi+WCEWK6FPpVMpUbo+RBQ= =huvz -END PGP SIGNATURE-
Re: [Rd] Improvement of SignRank functions
Hi Martin, Martin Maechler wrote: that's quite convincing; thank you! and I can verify part of it on my computer. :D I think I'd just commit your signrank.c (with a few cosmetic changes) to the sources, right? Right! There is no need for SIGNRANK_MAX in src/nmath/nmath.h anymore. Thank you for your contribution! You're welcome. With respect, -- Ivo Ugrina ICQ: 47508335 | www.iugrina.com --- baza matematickih pojmova http://baza.iugrina.com --- anime, manga, Japan fanzin http://yoshi.iugrina.com __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] List comprehensions for R
Gabor, Thank you for drawing this previous work to my attention. I've attached below code that extends the list comprehension to include logical 'guard' expressions, as in leap.years - .[ x ~ x - 1900:2100 | (x %% 400 == 0 || x %% 100 != 0 x %% 4 == 0) ] leap.years [1] 1904 1908 1912 1916 1920 1924 1928 1932 1936 1940 1944 1948 1952 1956 1960 [16] 1964 1968 1972 1976 1980 1984 1988 1992 1996 2000 2004 2008 2012 2016 2020 [31] 2024 2028 2032 2036 2040 2044 2048 2052 2056 2060 2064 2068 2072 2076 2080 [46] 2084 2088 2092 2096 I wonder, would many (most?) R users be mathematically-trained statisticians first, and programmers second, and therefore find a mathematical notation like the list comprehension more natural than less declarative programming constructs? I would be genuinely interested in your (and others') thoughts on that question, based on your knowledge of the R user community. Regards, David Gabor Grothendieck wrote: That seems quite nice. Note that there has been some related code posted. See: http://tolstoy.newcastle.edu.au/R/help/03b/6406.html which discusses some R idioms for list comprehensions. Also the gsubfn package has some functionality in this direction. We preface any function with fn$ to allow functions in its arguments to be specified as formulas. Its more R-ish than your code and applies to more than just list comprehensions while your code is more faithful to list comprehensions. ## Updated to include logical guards in list comprehensions ## ## Define syntax for list/vector/array comprehensions ## . - structure(NA, class=comprehension) comprehend - function(expr, vars, seqs, guard, comprehension=list()){ if(length(vars)==0){ # base case of recursion if(eval(guard)) comprehension[[length(comprehension)+1]] - eval(expr) } else { for(elt in eval(seqs[[1]])){ assign(vars[1], elt, inherits=TRUE) comprehension - comprehend(expr, vars[-1], seqs[-1], guard, comprehension) } } comprehension } ## List comprehensions specified by close approximation to set-builder notation: ## ## { x+y | 0x9, 0yx, x*y30 } --- .[ x+y ~ {x-0:9; y-0:x} | x*y30 ] ## [.comprehension - function(x, f){ f - substitute(f) ## First, we pluck out the optional guard, if it is present: if(is.call(f) is.call(f[[3]]) f[[3]][[1]]=='|'){ guard - f[[3]][[3]] f[[3]] - f[[3]][[2]] } else { guard - TRUE } ## To allow omission of braces around a lone comprehension generator, ## as in 'expr ~ var - seq' we make allowances for two shapes of f: ## ## (1)(`-` (`~` expr ## var) ## seq) ## and ## ## (2)(`~` expr ## (`{` (`-` var1 seq1) ## (`-` var2 seq2) ## ... ## (`-` varN - seqN))) ## ## In the former case, we set gens - list(var - seq), unifying the ## treatment of both shapes under the latter, more general one. syntax.error - Comprehension expects 'expr ~ {x1 - seq1; ... ; xN - seqN}'. if(!is.call(f) || (f[[1]]!='-' f[[1]]!='~')) stop(syntax.error) if(is(f,'-')){ # (1) lhs - f[[2]] if(!is.call(lhs) || lhs[[1]] != '~') stop(syntax.error) expr - lhs[[2]] var - as.character(lhs[[3]]) seq - f[[3]] gens - list(call('-', var, seq)) } else { # (2) expr - f[[2]] gens - as.list(f[[3]])[-1] if(any(lapply(gens, class) != '-')) stop(syntax.error) } ## Fill list comprehension .LC vars - as.character(lapply(gens, function(g) g[[2]])) seqs - lapply(gens, function(g) g[[3]]) .LC - comprehend(expr, vars, seqs, guard) ## Provided the result is rectangular, convert it to a vector or array ## TODO: Extend to handle .LC structures more than 2-deep. ## TODO: Avoid rectangularizing nested comprehensions along guarded dimensions? if(!length(.LC)) return(.LC) dim1 - dim(.LC[[1]]) if(is.null(dim1)){ lengths - sapply(.LC, length) if(all(lengths == lengths[1])){ # rectangular .LC - unlist(.LC) if(lengths[1] 1) # matrix dim(.LC) - c(lengths[1], length(lengths)) } else { # ragged # leave .LC as a list } } else { # elements of .LC have dimension dim - c(dim1, length(.LC)) .LC - unlist(.LC) dim(.LC) - dim } .LC } __ R-devel@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-devel
Re: [Rd] Wrong length of POSIXt vectors (PR#10507)
TP == Tony Plate [EMAIL PROTECTED] on Fri, 14 Dec 2007 13:58:30 -0700 writes: TP Duncan Murdoch wrote: On 12/13/2007 1:59 PM, Tony Plate wrote: Duncan Murdoch wrote: On 12/11/2007 6:20 AM, [EMAIL PROTECTED] wrote: Full_Name: Petr Simecek Version: 2.5.1, 2.6.1 OS: Windows XP Submission from: (NULL) (195.113.231.2) Several times I have experienced that a length of a POSIXt vector has not been computed right. Example: tv-structure(list(sec = c(50, 0, 55, 12, 2, 0, 37, NA, 17, 3, 31 ), min = c(1L, 10L, 11L, 15L, 16L, 18L, 18L, NA, 20L, 22L, 22L ), hour = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, NA, 12L, 12L, 12L), mday = c(13L, 13L, 13L, 13L, 13L, 13L, 13L, NA, 13L, 13L, 13L), mon = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, NA, 5L, 5L, 5L), year = c(105L, 105L, 105L, 105L, 105L, 105L, 105L, NA, 105L, 105L, 105L), wday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L), yday = c(163L, 163L, 163L, 163L, 163L, 163L, 163L, NA, 163L, 163L, 163L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, -1L, 1L, 1L, 1L)), .Names = c(sec, min, hour, mday, mon, year, wday, yday, isdst ), class = c(POSIXt, POSIXlt)) print(tv) # print 11 time points (right) length(tv) # returns 9 (wrong) tv is a list of length 9. The answer is right, your expectation is wrong. I have tried that on several computers with/without switching to English locales, i.e. Sys.setlocale(LC_TIME, en). I have searched a help pages but I cannot imagine how that could be OK. See this in ?POSIXt: Class 'POSIXlt' is a named list of vectors... You could define your own length measurement as length.POSIXlt - function(x) length(x$sec) and you'll get the answer you expect, but be aware that length.XXX methods are quite rare, and you may surprise some of your users. On the other hand, isn't the fact that length() currently always returns 9 for POSIXlt objects likely to be a surprise to many users of POSIXlt? The back of The New S Language says Easy-to-use facilities allow you to organize, store and retrieve all sorts of data. ... S functions and data organization make applications easy to write. Now, POSIXlt has methods for c() and vector subsetting [ (and many other vector-manipulation methods - see methods(class=POSIXlt)). Hence, from the point of view of intending to supply easy-to-use facilities ... [for] all sorts of data, isn't it a little incongruous that length() is not also provided -- as 3 functions (any others?) comprise a core set of vector-manipulation functions? Would it make sense to have an informal prescription (e.g., in R-exts) that a class that implements a vector-like object and provides at least of one of functions 'c', '[' and 'length' should provide all three? It would also be easy to describe a test-suite that should be included in the 'test' directory of a package implementing such a class, that had some tests of the basic vector-manipulation functionality, such as: # at this point, x0, x1, x3, x10 should exist, as vectors of the # class being tested, of length 0, 1, 3, and 10, and they should # contain no duplicate elements length(x0) [1] 1 length(c(x0, x1)) [1] 2 length(c(x1,x10)) [1] 11 all(x3 == x3[seq(len=length(x3))]) [1] TRUE all(x3 == c(x3[1], x3[2], x3[3])) [1] TRUE length(c(x3[2], x10[5:7])) [1] 4 It would also be possible to describe a larger set of vector manipulation functions that should be implemented together, including e.g., 'rep', 'unique', 'duplicated', '==', 'sort', '[-', 'is.na', head, tail ... (many of which are provided for POSIXlt). Or is there some good reason that length() cannot be provided (while 'c' and '[' can) for some vector-like classes such as POSIXlt? What you say sounds good in general, but the devil is in the details. Changing the meaning of length(x) for some objects has fairly widespread effects. Are they all positive? I don't know. Adding a prescription like the one you suggest would be good if it's easy to implement, but bad if it's already widely violated. How many base or CRAN or Bioconductor packages violate it currently? Do the ones that provide all 3 methods do so in a consistent way, i.e. does length(x) mean the same thing in all of them? TP I'm not sure doing something like this would be so bad even if it is TP already widely violated. R has evolved significantly over time, and TP many rough edges have been cleaned up, sometimes in ways that were not TP backward compatible.
Re: [Rd] Wrong length of POSIXt vectors (PR#10507)
If it were simply deprecated and then changed then everyone using it would get a warning during the period of deprecation so it would not be so bad. Given that its current behavior is not very useful I suspect its not widely used anyways. | haven't followed the whole discussion so sorry if these points have already been made. On Dec 15, 2007 5:17 PM, Martin Maechler [EMAIL PROTECTED] wrote: TP == Tony Plate [EMAIL PROTECTED] on Fri, 14 Dec 2007 13:58:30 -0700 writes: TP Duncan Murdoch wrote: On 12/13/2007 1:59 PM, Tony Plate wrote: Duncan Murdoch wrote: On 12/11/2007 6:20 AM, [EMAIL PROTECTED] wrote: Full_Name: Petr Simecek Version: 2.5.1, 2.6.1 OS: Windows XP Submission from: (NULL) (195.113.231.2) Several times I have experienced that a length of a POSIXt vector has not been computed right. Example: tv-structure(list(sec = c(50, 0, 55, 12, 2, 0, 37, NA, 17, 3, 31 ), min = c(1L, 10L, 11L, 15L, 16L, 18L, 18L, NA, 20L, 22L, 22L ), hour = c(12L, 12L, 12L, 12L, 12L, 12L, 12L, NA, 12L, 12L, 12L), mday = c(13L, 13L, 13L, 13L, 13L, 13L, 13L, NA, 13L, 13L, 13L), mon = c(5L, 5L, 5L, 5L, 5L, 5L, 5L, NA, 5L, 5L, 5L), year = c(105L, 105L, 105L, 105L, 105L, 105L, 105L, NA, 105L, 105L, 105L), wday = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, NA, 1L, 1L, 1L), yday = c(163L, 163L, 163L, 163L, 163L, 163L, 163L, NA, 163L, 163L, 163L), isdst = c(1L, 1L, 1L, 1L, 1L, 1L, 1L, -1L, 1L, 1L, 1L)), .Names = c(sec, min, hour, mday, mon, year, wday, yday, isdst ), class = c(POSIXt, POSIXlt)) print(tv) # print 11 time points (right) length(tv) # returns 9 (wrong) tv is a list of length 9. The answer is right, your expectation is wrong. I have tried that on several computers with/without switching to English locales, i.e. Sys.setlocale(LC_TIME, en). I have searched a help pages but I cannot imagine how that could be OK. See this in ?POSIXt: Class 'POSIXlt' is a named list of vectors... You could define your own length measurement as length.POSIXlt - function(x) length(x$sec) and you'll get the answer you expect, but be aware that length.XXX methods are quite rare, and you may surprise some of your users. On the other hand, isn't the fact that length() currently always returns 9 for POSIXlt objects likely to be a surprise to many users of POSIXlt? The back of The New S Language says Easy-to-use facilities allow you to organize, store and retrieve all sorts of data. ... S functions and data organization make applications easy to write. Now, POSIXlt has methods for c() and vector subsetting [ (and many other vector-manipulation methods - see methods(class=POSIXlt)). Hence, from the point of view of intending to supply easy-to-use facilities ... [for] all sorts of data, isn't it a little incongruous that length() is not also provided -- as 3 functions (any others?) comprise a core set of vector-manipulation functions? Would it make sense to have an informal prescription (e.g., in R-exts) that a class that implements a vector-like object and provides at least of one of functions 'c', '[' and 'length' should provide all three? It would also be easy to describe a test-suite that should be included in the 'test' directory of a package implementing such a class, that had some tests of the basic vector-manipulation functionality, such as: # at this point, x0, x1, x3, x10 should exist, as vectors of the # class being tested, of length 0, 1, 3, and 10, and they should # contain no duplicate elements length(x0) [1] 1 length(c(x0, x1)) [1] 2 length(c(x1,x10)) [1] 11 all(x3 == x3[seq(len=length(x3))]) [1] TRUE all(x3 == c(x3[1], x3[2], x3[3])) [1] TRUE length(c(x3[2], x10[5:7])) [1] 4 It would also be possible to describe a larger set of vector manipulation functions that should be implemented together, including e.g., 'rep', 'unique', 'duplicated', '==', 'sort', '[-', 'is.na', head, tail ... (many of which are provided for POSIXlt). Or is there some good reason that length() cannot be provided (while 'c' and '[' can) for some vector-like classes such as POSIXlt? What you say sounds good in general, but the devil is in the details. Changing the meaning of length(x) for some objects has fairly widespread effects. Are they all positive? I don't know. Adding a prescription like the one you suggest would be good if it's easy to implement, but bad if it's already widely violated. How many base or CRAN or Bioconductor packages violate it currently? Do the ones that provide all 3 methods do so in a