Re: [R] computing distance in miles or km between 2 street addre
On 06-Sep-07 18:42:32, Philip James Smith wrote: > Hi R-ers: > > I need to compute the distance between 2 street addresses in > either km or miles. I do not care if the distance is a "shortest > driving route" or if it is "as the crow flies." > > Does anybody know how to do this? Can it be done in R? I have > thousands of addresses, so I think that Mapquest is out of the > question! > > Please rely to: [EMAIL PROTECTED] > > Thank you! > Phil Smith That's a somewhat ill-posed question! You will for a start need a database of some kind, either of geographical locations (coordinates) of street addresses, or of the metric of the road network with capability to identify the street addresses in the database. If it's just "as the crow flies", then it can be straightforwardly computed in R, either by Pythogoras (when they are not too far apart) or using a function which takes account of the shape of the Earth, There are many R packages which have to do with mapping data. Search for "map" through the list of R packages at http://finzi.psych.upenn.edu/R/library/maptools/html/00Index.html -- maptools in particular. Also look at (for instance) aspace. For "shortest driving route" then you need to find the shortest distance through a network. You may find some hints in the package optim -- but there must be some R experts out there on this sort of thing! However, the primary need is for the database which gives the distance information in one form or another. What were you proposing to use for this? As far as I know, R has no database relevant to street addresses! Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-Sep-07 Time: 23:17:57 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to do ANOVA with fractional values and overcome the
On 06-Sep-07 06:55:36, Emre Sevinc wrote: > I have exported a CSV file from my EXCEL worksheet and its last column > contained decimal values: > > Subject;Group;Side;Difference;PercentError > M3;1;1;; > M5;1;1;375;18,75 > M8;1;1;250;14,58 > M10;1;1;500;12,50 > M12;1;1;375;25,00 > . > . > . > > > When I tried to do ANOVA test on it, R complained by givin error: > >> Anova3LongAuditoryFemaleError.data <- read.csv("C:\\Documents\ and\ >> Settings\\Administrator\\My >> Documents\\CogSci\\tez\\Anova3LongAuditoryFemaleError.csv", header = >> TRUE, sep = ";") Perhaps you need to specify the "dec" option to read.csv(): read.csv(file, header = TRUE, sep = ",", quote="\"", dec=".", fill = TRUE, ...) i.e. Anova3LongAuditoryFemaleError.data <- read.csv("C:\\Documents\ and\ Settings\\Administrator\\My Documents\\CogSci\\tez\\Anova3LongAuditoryFemaleError.csv", header = TRUE, sep = ";", dec=",") Or else use read.csv2(), where dec = "," is the default. See ?read.csv and read about both read.csv and read.csv2 Otherwise, "18,75" is likely to be treated as text, and not as a numerical value, and so PercetError will be converted into a factor with character-valued levels. Ted. > >> Anova3LongAuditoryFemaleError.aov = aov(PercentError ~ (Group * Side), >> data = Anova3LongAuditoryFemaleError.data) > > Error in `storage.mode<-`(`*tmp*`, value = "double") : > invalid to change the storage mode of a factor > In addition: Warning message: > using type="numeric" with a factor response will be ignored in: > model.response(mf, "numeric") > > What must I do in order to make the ANOVA test on these fractional > data? > > Regards. > > -- > Emre Sevinc > > [[alternative HTML version deleted]] > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-Sep-07 Time: 09:39:04 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] length of a string
On 05-Sep-07 13:50:57, João Fadista wrote: > Dear all, > > I would like to know how can I compute the length of a string in a > dataframe. Example: > > SEQUENCE ID > TGCTCCCATCTCCACGGHR04FS00645 > ACTGAACTCCCATCTCCAAT HR0595847847 > > I would like to know how to compute the length of each SEQUENCE. > > Best regards, > João Fadista nchar("ACTGAACTCCCATCTCCAAT") [1] 20 seems to work. Find it, and related functions, with help.search("character") As it happens, help.search("string") will not help! Best wishes, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 05-Sep-07 Time: 15:05:22 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] integers
On 04-Sep-07 09:53:43, Christoph Heibl wrote: > Hi list, > > The function is.integer tests if an object is of type integer: > > see e.g.: > > is.integer(12)# FALSE > is.real(12) # TRUE > mode(12) # "numeric" > > But how can I test if a number is actually an integer? R seek is > difficult to search in this case because it mainly yields entries > about the integer()-function family. > > Thanks for any hint! > Christoph Heibl In infer you want to test whether the real x has an integer value. ((x-floor(x))==0) would do it (covers negative integers as well), though there may well be a smarter way. Best wishes, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 04-Sep-07 Time: 11:11:46 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Derivative of a Function Expression
On 03-Sep-07 21:45:40, Alberto Monteiro wrote: > Rory Winston wrote: >> >> I am currently (for pedagogical purposes) writing a simple numerical >> analysis library in R. I have come unstuck when writing a simple >> Newton-Raphson implementation, that looks like this: >> >> f <- function(x) { 2*cos(x)^2 + 3*sin(x) + 0.5 } >> >> root <- newton(f, tol=0.0001, N=20, a=1) >> >> My issue is calculating the symbolic derivative of f() inside the >> newton() function. >> > If it's pedagogical, maybe returning to basics could help. > > What is f'(x)? > > It's the limit of (f(x + h) - f(x)) / h when h tends to zero. > > So, do it numerically: take a sufficiently small h and > compute the limit. h must be small enough > that h^2 f''(x) is much smaller than h f'(x), but big > enough that f(x+h) is not f(x) > > numerical.derivative <- function(f, x, h = 0.0001) > { > # test something > (f(x + h) - f(x)) / h > } > > Ex: > > numerical.derivative(cos, pi) = 5e-05 # should be -sin(pi) = 0 > numerical.derivative(sin, pi) = -1# ok > numerical.derivative(exp, 0) = 1.5 # close enough > numerical.derivative(sqrt, 0) = 100 # should be Inf > > Alberto Monteiro If you want to "go back to basics", it's worth noting that for a function which has a derivative at x (which excludes your sqrt(x) at x=0, despite the result you got above, since only the 1-sided limit as h-->0+ exists): (f(x+h/2) - f(h-h/2))/h is generally a far better approximation to f'(x) than is (f(x+h) - f(x))/h since the term in h^2 * f''(x) in the expansion is automatically eliminated! So the accuracy is O(h^2), not O(h) as with yur definition. num.deriv<-function(f,x,h=0.001){(f(x + h/2) - f(x-h/2))/h} num.deriv(cos, pi) ## [1] 0 num.deriv(sin, pi) [1] -1 but of course num.deriv(sqrt,0) won't work, since it requires evaluation of sqrt(-h/2). Hovever, other comparisons with your definition are possible: numerical.derivative(sin,pi/3)-0.5 ##[1] -4.33021e-05 num.deriv(sin,pi/3)-0.5 ##[1] -2.083339e-08 Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 04-Sep-07 Time: 00:10:42 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Excel
[Apologies if you sometime get this twice. The first mailing has not been delivered to the list after more than 10 hours] On 31-Aug-07 10:38:07, Jim Lemon wrote: > Rolf Turner wrote: >> On 31/08/2007, at 9:10 AM, Antony Unwin wrote: >> >> >>>Erich's more important point >>>is that you need to speak the language of the people you cooperate >>>with and often that language includes Excel. >> >> >> So if the people you have to deal with are into astrology you should >> learn astrology? >> > Yes, Rolf, if you want them to understand you. I once wrote a very > successful parody of those astrology character profiles simply by > borrowing an acquaintance's book and making dreadfully irreverent use > of it. > > Jim Now that you mention that, Jim, I feel a POMO coming on. if( ! you.know.what.i.mean() ){ browseURL("http://www.elsewhere.org/pomo";) } else { stop("You know what I mean!") } The structural features of horoscope and astrology texts would readily lend themselves to implementation in such an engine. Mind you, from my occasional browsings of these texts, I suspect it has already been done and is widely used. Best wishes to all, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 31-Aug-07 Time: 22:54:32 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Incomplete Gamma function
in cases like the Incomplete Gamma and Beta functions, where there can be dispute over what is meant (see above), it is surely wise to spell it out! Best wishes to all, Ted. >> On 31 Aug 2007, at 00:29, [EMAIL PROTECTED] wrote: >> >>> Hello >>> >>> I am trying to evaluate an Incomplete gamma function >>> in R. Library Zipfr gives the Igamma function. From >>> Mathematica, I have: >>> >>> "Gamma[a, z] is the incomplete gamma function." >>> >>> In[16]: Gamma[9,11.1] >>> Out[16]: 9000.5 >>> >>> Trying the same in R, I get >>> >>>> Igamma(9,11.1) >>> [1] 31319.5 >>> OR >>>> Igamma(11.1,9) >>> [1] 1300998 >>> >>> I know I have to understand the theory and the math >>> behind it rather than just ask for help, but while I >>> am trying to do that (and only taking baby steps, I >>> must admit), I was hoping someone could help me out. >>> >>> Regard >>> >>> Kris. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 01-Sep-07 Time: 15:49:57 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Excel (off-topic, sort of)
On 30-Aug-07 14:43:12, Thomas Lumley wrote: > On Wed, 29 Aug 2007, Alberto Monteiro wrote: >>> >> Do you know what's in my wish list? >> >> I wish spreadsheets and computer languages had gone one >> step further. >> >> I mean, it's nice to define Cell X to be "equal" to >> Cell Y + 10, and then when we change Cell Y, magically we >> see Cell X change. >> >> But why can't it be the reverse? Why can't I change Cell X >> _and see the change in Cell Y_? >> > > Well, most statistical calculations are data-reducing, ie, > many-to-one. > I don't think you are likely to find a language where you > can change the confidence limits for the sample mean and > have the data change in response. Well, I have seen *people* do this (mentioning no names), and with the assistance of software (but only using the software in the direction Data --> CLs). However, there does exist software which is capable of learning people's behavioural criteria, and reacting according to their estimated needs, so probably Thomas's speculation is not beyond reach. Best wishes to all, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 30-Aug-07 Time: 20:07:03 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] OT: distribution of a pathological random variate
On 29-Aug-07 17:39:17, Horace Tso wrote: > Folks, > > I wonder if anything could be said about the distribution of a random > variate x, where > > x = N(0,1)/N(0,1) > > Obviously x is pathological because it could be 0/0. If we exclude this > point, so the set is {x/(0/0)}, does x have a well defined > distribution? or does it exist a distribution that approximates x. > > (The case could be generalized of course to N(mu1, sigma1)/N(mu2, > sigma2) and one still couldn't get away from the singularity.) > > Any insight or reference to related discussion is appreciated. > > Horace Tso A good question -- but it has a long-established answer. X has the Cauchy distribution, whose density function is f(x) = 1/(pi*(1 + x^2)) Have a look at ?dcauchy It is also the distribution of t with 1 degree of freedom. See also ?dt You don;t need to exclude the point (0,0) explicitly, since it has zero probabilityof occurring. But the chance that the denominator could be small enough to give a very large value of X is quite perceptible. Try X<-rcauchy(1000) max(X) and similar. Play around! Best wishes, ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 29-Aug-07 Time: 19:02:32 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Interpreting the eigen value of a population matrix
On 28-Aug-07 14:12:22, Marie Anouk Simard wrote: It would seem (from the headers) that Marie sent her message within a "TNEF" attachment, which the R-help server has duly stripped off! I would suggest thatshe re-sends her message in plain text, so that we can all read it (and it will not get stripped). Ted. > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 28-Aug-07 Time: 15:29:12 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] R "old versions" archive anywhere?
Hi Folks, Does anyone know if (apparently not on CRAN) there is any archive of older versions of R packages? Or is it the case that not only are old versions on CRAN humanesly destoyed, but all older versions out in the wild are hunted down and shot? (I can find older versions of the R "core", but not of other packages). With thanks, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 23-Aug-07 Time: 19:37:30 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to fit an linear model withou intercept
On 23-Aug-07 09:56:33, Michal Kneifl wrote: > Please could anyone help me? > How can I fit a linear model where an intercept has no sense? > Thanks in advance.. > > Michael Presumably you mean "where a non-zero intercept has no sense"? Suppose the model you want to fit is like Response ~ B+C+X+Y Then [A] lm(Response ~ B+C+X+Y) will give you an intecept. But: [B] lm(Response ~ B+C+X+Y - 1 ) will in effect force a zero intercept -- it removes the "Intercept" term (which is implicit in the form [A]) from the model, thus in effect giving it coefficient zero. In "real life", the model at [A] corresponds to the equation Response = a*1 + b*B + c*C + x*X + y*Y + error where a,b,c,x,y are the coefficients, so the "terms in the model" are 1 , B , C , X , Y. The model at [B] corresponds to Response = b*B + c*C + x*X + y*Y + error so the "terms in the model" are B , C , X , Y -- i.e. the same as before but with the term "1" removed, which is what is effected by adding the term "-1" to the model formula. Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 23-Aug-07 Time: 12:21:24 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Does anyone.... worth a warning?!? No warning at all
On 20-Aug-07 19:55:44, Rolf Turner wrote: > On 20/08/2007, at 9:54 PM, Tom Willems wrote: >> dear Mathew >> >> mean is a Generic function >> >> mean(x...) >> >> in wich x is a data object, like a data frame a list >> a numeric vector... >> >> so in your example it only reads the first character >> and then reports it. >> >> try x = c(1,1,2) >> mean(x) > > I think you've completely missed the point. I'm sure Mathew > now understands the syntax of the mean function. His point > was that it would be very easy for someone to use this > function incorrectly --- and he indicated very clearly *why*, > by giving an example using max(). > > If mean() could be made safer to use by incorporating a warning, > without unduly adding to overheads, then it would seem sensible > to incorporate such a warning. Or to change the mean() > function so that mean(1,2,3) returns ``2'' --- just as max > (1,2,3) returns ``3'' --- as Mathew *initially* (and quite > reasonably) expected it to do. > > cheers, > Rolf Turner I think Rolf makes a very important point. There are a lot of idiosyncracies in R, which in time we get used to; but learning about them is something of a "sociological" exercise, just as one learns that when one's friend A says "X Y Z" is may not mean the same as when one's friend B says it. Another example is in the use of %*% for matrix multiplication when one or both of the factors is a vector. If you came to R from matlab/octave, where every vector is already either a row vector or a column vector, you knew where you stood. But in R the semantics of the syntax depend on the context in a more complicated way. In R, x<-c(-1,1) is called a "vector", but it does not have dimensions: x<-c(-1,1) dim(x) NULL So its relationship to matrix multiplication is ambiguous. For example: M<-matrix(c(1,2,3,4),nrow=2); M [,1] [,2] [1,]13 [2,]24 x%*%M [,1] [,2] [1,]11 and x is now coerced into a "column vector", which now (for that immediate purpose) now does have dimensions (just as a row vector would have in matlab/octave). Similarly, M%*%x [,1] [1,]2 [2,]2 coerces it into a column vector. But now (asks the beginner who has not yet got round to looking up ?"%*%") what happens with x%*%x? Will we get column vector times row vector (a 2x2 matrix) or row times column (a scalar)? In fact we get the latter: x%*%x [,1] [1,]2 All this is in accordance with ?"%*%": Description: Multiplies two matrices, if they are conformable. If one argument is a vector, it will be coerced to a either a row or column matrix to make the two arguments conformable. If both are vectors it will return the inner product. But now suppose y<-c(1,2,3), with x<-c(-1,1) as before. x%*%y Error in x %*% y : non-conformable arguments because it is trying to make the inner product of vectors of unequal length. Whereas someone who had got as far as the second sentence of the Description, and did not take the hird sentence as strictly literally as intended, might expect that x would be coerced into column, and y into row, so that they were conformable for multiplication, giving a 2x3 matrix result (perhaps on the grounds that "it will return the inner product" means that it will do this if they are conformable, otherwise doing the coercions described in the first sentence). That misunderstanding could be avoided if the last sentence read: "If both are vectors it will return the inner product provided both are the same length; otherwise it is an error and nothing is returned." Or perhaps x or y should not be called "vector" -- in linear algebra people are used to "vector" being another name for a 1-dimensional matrix, being either "row" or "column". The R entity is not that sort of thing at all. The closest that "R Language Definition" comes to defining it is: "Vectors can be thought of as contiguous cells containing homogeneous data. Cells are accessed through indexing operations such as x[5]." x<-matrix(x) will, of course, turn x into a paid-up column vector (as you might guess from ?matrix, if you copy the "byrow=FALSE" from the "as.matrix" explanation to the "matrix" explanation; though in fact that is irrelevant, since e.g. "byrow=TRUE" has no effect in matrix() -- so in fact there is no specification in ?matrix as to whether to expect a row or column result). Just a few thoughts. As I say we all get used to this stuff in the end, but it can be bewildering (and a trap) for beginners. Best wishes to all, Ted. E-Mail: (
Re: [R] Interweaving cell entries from 2 columns
On 20-Aug-07 09:49:08, jenny wrote: > Hi, how do I interweave 2 columns of data into a single column? See > illustration below. Thx-jenny I have 2 columns of data Col1 Col21 >2 3 45 6 7 89 10 I want to > create > a new column with interlocking cell centries12345678910 Get the new > Windows Live Messenger! Try it! > _ > Did you know you can import your contact list from Microsoft® Office® O > > [[alternative HTML version deleted]] Re-structuring your exmaple as received [please see to your mailer's formating!] into what I think you may mean: Col1 Col2 1 2 3 4 5 6 7 8 9 10 You haven't stated in what form these two columns are available to you at the point where you want to "merge" them, so I'll give two answers (there may be others, deending on the source of Col1 and Col2). 1. Suppose they are two columns of a matrix M: M c1 c2 [1,] 1 2 [2,] 3 4 [3,] 5 6 [4,] 7 8 [5,] 9 10 Then: as.vector(t(M)) [1] 1 2 3 4 5 6 7 8 9 10 2. Suppose they are available as separate vectors Col1 and Col2: as.vector(rbind(Col1,Col2)) [1] 1 2 3 4 5 6 7 8 9 10 Best wishes, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 20-Aug-07 Time: 12:27:54 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] an easy way to construct this special matirx
On 16-Aug-07 03:10:17, [EMAIL PROTECTED] wrote: > Hi, > Sorry if this is a repost. I searched but found no results. > I am wondering if it is an easy way to construct the following matrix: > > r 1000 > r^2 r100 > r^3 r^2 r10 > r^4 r^3 r^2 r1 > > where r could be any number. Thanks. > Wen I dare say there's an even simpler way (and I feel certain someone will post one ... ); but (example): r<-0.1 r1<-r^((-3):0); r2<-r^(3:0) R<-r1%*%t(r2) R ## [,1] [,2] [,3] [,4] ##[1,] 1.000 10.00 100.0 1000 ##[2,] 0.100 1.00 10.0 100 ##[3,] 0.010 0.10 1.0 10 ##[4,] 0.001 0.01 0.11 R[upper.tri(R)]<-0 R ## [,1] [,2] [,3] [,4] ##[1,] 1.000 0.00 0.00 ##[2,] 0.100 1.00 0.00 ##[3,] 0.010 0.10 1.00 ##[4,] 0.001 0.01 0.11 Best wishes, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 16-Aug-07 Time: 09:37:28 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Covariance of data with missing values.
On 15-Aug-07 21:16:32, Rolf Turner wrote: > > I have a data matrix X (n x k, say) each row of which constitutes > an observation of a k-dimensional random variable which I am willing, > if not happy, to assume to be Gaussian, with mean ``mu'' and > covariance matrix ``Sigma''. Distinct rows of X may be assumed to > correspond to independent realizations of this random variable. > > Most rows of X (all but 240 out of 6000+ rows) contain one or more > missing values. > [...] One question, Rolf: How big is k (no of columns)? If it's greater than 30, you may have problems with 'norm', since the function prelim.norm() builds up its image of the places where there are missing values as "packed integers" with code: r <- 1 * is.na(x) mdp <- as.integer((r %*% (2^((1:ncol(x)) - 1))) + 1) i.e. 'x' would be nxk and have 1s where your X had missing, 0s elsewhere. Then each row of 'x' is converted into a 32-bit integer whose "1" bits correspond to the 1s in 'x'. You'll get "NA" warnings if k>30, and things could go wrong! In that case, I hope Chuck's suggestion works! Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 16-Aug-07 Time: 00:10:33 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Covariance of data with missing values.
Hi Rolf! Have a look at the 'norm' package. This does just what you;re asking for (assuming multivariate normal, and allowing CAR missingness -- i.e. probability of missing may depend on observed values, but must not depend on unobserved). Read the documentation for the various function *very* carefully! Drop me a line if you want more info. Best wishes, Ted. On 15-Aug-07 21:16:32, Rolf Turner wrote: > > I have a data matrix X (n x k, say) each row of which constitutes > an observation of a k-dimensional random variable which I am willing, > if not happy, to assume to be Gaussian, with mean ``mu'' and > covariance matrix ``Sigma''. Distinct rows of X may be assumed to > correspond to independent realizations of this random variable. > > Most rows of X (all but 240 out of 6000+ rows) contain one or more > missing values. If I am willing to assume that missing entries are > missing completely at random (MCAR) then I can estimate the covariance > matrix Sigma via maximum likelihood, by employing the EM algorithm. > Or so I believe. > > Has this procedure been implemented in R in an accessible form? > I've had a bit of a scrounge through the searching facilities, > and have gone through the FAQ, and have found nothing that I could > discern to be directly relevant. > > Thanks for any pointers that anyone may be able to give. > > cheers, > > Rolf Turner E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Aug-07 Time: 23:17:48 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Possible to "import" histograms in R?
On 15-Aug-07 18:17:27, Greg Snow wrote: > > > > Ted Harding wrote: > [snip] > >> So you if you want the density plot, you would need to calculate >> this for yourself. E.g. >> >> H1$density <- counts/sum(counts) >> plot.histogram(H1,freq=FALSE) > > shouldn't that be: > H1$density <- counts/sum(counts)/diff(brkpts) > > ? Of course! Thank you. Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Aug-07 Time: 20:23:13 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Negative Binomial: glm.nb
Hi Folks, I'm playing with glm.nb() in MASS. Reference: the negative binomial distribution P(y) = (Gamma(theta+y)/(Gamma(theta)*y!))*(p^theta)*(1-p)^y y = 0,1,2,... in the notation of the MASS book (section 7.4), where p = theta/(mu + theta) so (1-p) = mu/(mu + theta) where mu is the expected value of Y. It seems from ?glm.nb that an initial value of theta is either supplied, or obtained by a method of moments after an initial Poisson fit to the data; then it casts back and forth: A: given theta, glm is applied to fit the linear model, with log link, for mu. Then theta is re-estimed, then the linear model, and so on. I hope I've understood right! However, this procedure if based on the assumption that the same value of theta applies across the board, regardless of the values of any covariates or factors. The MASS book (Section 7.4) illustrates the use of glm.nb() on the Quine data (for numbers of days absence from school in a school year by Australian children). There are four factors in addition to the "outcome" Days: Eth: Ethnicity (Aboriginal "A"/Non-Aboriginal "N") Lrn: Learning Ability (Slow learner"SL", Average Learner "AL") Age: Age Group (Primary "F0", First/Second/Third Forms "F1/2/3") Sex: Gender ("M"/"F") This dataset is earlier analysed thoroughly in the MASS book (Section 6.8), primarily with reference to transforming Days by a Box-Cox transformation. The negative binomial does not play a role in that part. When it comes to the glm.nb(), however, the assumption of "constant theta" is implicit in Section 7.4. But, if you look at the data, this does seem all that reasonable. Preamble: Get the variables out of 'quine' into convenient names. library(MASS) D<-quine$Days ; E<-quine$Eth; L<-quine$Lrn; A<-quine$Age; S<-quine$Sex D.A<-D[E=="A"]; L.A<-L[E=="A"]; A.A<-A[E=="A"]; S.A<-S[E=="A"] D.N<-D[E=="N"]; L.N<-L[E=="N"]; A.N<-A[E=="N"]; S.N<-S[E=="N"] Now, if you look at the histograms of D.A and D.N separately: hist(D.A,breaks=-0.5+4*(0:22)) hist(D.N,breaks=-0.5+4*(0:22)) you can see that there is a major difference in shape, such as would arise if theta<1 for Eth="A", and theta>1 for Eth="B". This is confirmed by a separate covariate-free fit of the mean to each group, from which the fitted theta is returned: summary(glm.nb(D.A~1)) --> Theta: 1.499 Std. Err.: 0.260 summary(glm.nb(D.A~1)) --> Theta: 0.919 Std. Err.: 0.159 (1.499-0.919)/sqrt(0.260^2 + 0.159^2) ##[1] 1.903113 1-pnorm(1.903113) ##[1] 0.0285129 so we have a 1-sided P-value of 0.029, a 2-sided of 0.057 That's a fairly strong indication that theta is not the same for the two groups; and the shapes of the histograms show that the difference has an important effect on the distribution. Of course, this dataset is notoriously unbalanced, so maybe this is a reflection of the mixtures of categories. A simple linear model on each subgroup eases tha above situation a bit: summary(glm.nb(D.A ~ S.A+A.A+L.A)) --> Theta: 1.767 Std. Err.: 0.318 summary(glm.nb(D.N ~ S.N+A.N+L.N)) Theta: 1.107 Std. Err.: 0.201 1-pnorm((1.767-1.107)/sqrt(0.318^2+0.201^2)) = 0.0397 and throwing everything into the pot eases it a bit more: summary(glm.nb(D.A ~ S.A*A.A*L.A)) --> Theta: 2.317 Std. Err.: 0.438 summary(glm.nb(D.N ~ S.N*A.N*L.N)) --> Theta: 1.581 Std. Err.: 0.318 1-pnorm((2.317-1.581)/sqrt(0.438^2+0.318^2)) = 0.0870 But, while the theta-values have now moved well up into the ">1" range (where they have less effect on the shape of the distribution), nevertheless the differences of the "A" and "N" estimates have not changed much: 0.580 --> 0.660 --> 0.736 even though their P-values have eased off somwehat (and, with 69 and 77 data for the "A" and "N" groups respectively, we don't have huge power here, I think). I've gone through the above in detail, since it's an illustration which you cam all work through yourselves in R, to illustrate the issue about variation of theta in a negative binomial model, when you want to get at the effects of covariates on the outcome Y (the counts of events). In real life I'm considering other data where the "theta" effect is more marked, and probably more important. Which finally leads on to my QUERY: Is there anything implement{ed|able} in R which can address the issue that theta may vary according to factors or covariates, as well as the level of mu given theta? Or is one reduced to running glm.nb() (or the like) on subsets?
Re: [R] Possible to "import" histograms in R?
On 15-Aug-07 08:30:08, Nick Chorley wrote: > Hi, > > I have a large amount of data that I would like to create a > histogram of and plot and do things with in R. It is pretty > much impossible to read the data into R, so I have written a > program to bin the data and now have a list of counts in each > bin. Is it possible to somehow import this into R and use > hist(), so I can, for instance, plot the probability density? > I have looked at the help page for hist(), but couldn't find > anything related to this there. > > Regards, > > Nicky Chorley Presumably you now have (or can readily generate) files on your system whose contents are (or are equivalent to) something like: brkpts.dat 0.0 0.5 1.0 9.5 10.0 counts.dat 10 7 38 7 0 where there is one more line in brkpts.dat than in counts.dat Now simply read both files into R, creating variables 'brkpts', 'counts' Now create a histogram template (any silly old data will do): H1 <- hist(c(1,2)) Next, attach your variables to it: H1$breaks <- brkpts H1$counts <- counts and you have your histogram in R. Also, you can use the data in the variables 'brkpts', 'counts' to feed into any other procedure which can acept data in this form. Example (simulating the above in R): brkpts<-0.5*(0:20) counts<-rpois(20,7.5) H1<-hist(c(1,2)) H1$breaks <- brkpts H1$counts <- counts plot(H1) Or, if you want a more "realistic-looking" one, follow on with: midpts<-(brkpts[1:20]+brkpts[2:21])/2 counts<-rpois(20,100*dnorm(midpts,mean=5,sd=3)) H1$breaks <- brkpts H1$counts <- counts plot(H1) In other words, you've already done R's work for it with your program which bins the data. All you need to do in R is to get these results into the right place in R. Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Aug-07 Time: 10:38:05 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Possible to "import" histograms in R?
On 15-Aug-07 10:15:13, Nick Chorley wrote: >>[...] >> Now create a histogram template (any silly old data will do): >> >> H1 <- hist(c(1,2)) >> >> Next, attach your variables to it: >> >> H1$breaks <- brkpts >> H1$counts <- counts >> >> and you have your histogram in R. Also, you can use the data >> in the variables 'brkpts', 'counts' to feed into any other >> procedure which can acept data in this form. > > > This is precisely what I wanted to do, except I didn't realise > that you could assign to the variables in the histogram object > like this. > Normally when constructing the histogram, I'd use hist(x, prob=T) > to plot the probability density against x, but obviously if you're > just assigning values to the variables in the object, you can't do > that. I tried putting "prob=T" in the call to hist when making the > dummy object, but that didn't help. Followup: The histogram object I constructed in the example from my previous reply contains the following: H1 $breaks [1] 0.0 0.5 1.0 1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0 [12] 5.5 6.0 6.5 7.0 7.5 8.0 8.5 9.0 9.5 10.0 $counts [1] 3 6 3 7 7 8 19 10 16 12 16 12 13 12 11 13 11 7 3 6 $intensities [1] 0.992 1.000 $density [1] 0.992 1.000 $mids [1] 1.25 1.75 $xname [1] "c(1, 2)" $equidist [1] TRUE All these things are calculated when hist() is called for raw data. The "$breaks" and "$counts" are what I assigned to it with H1$breaks <- brkptsH1$counts <- counts "$intensities", "$density", "$mids", "$xname" and "$equidist" are what was set up by the initial call H1 <- hist(c(0,1)) Previously, I used "plot(H1)" to illustrate that the above assignments put the breakpoints and the counts into the histogram object. For a more thorough approach, you need to use plot.histogram() instead. Here you can, in fact, set the "prob=TRUE" condition in the plot by setting "freq=FALSE" in t call to plot.histogram(). But then it will look at "$density" for the values to plot. So you if you want the density plot, you would need to calculate this for yourself. E.g. H1$density <- counts/sum(counts) plot.histogram(H1,freq=FALSE) And so on ... there are many relevant details in the help pages ?hist and ?plot.histogram Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Aug-07 Time: 11:53:14 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Linear Regression with slope equals 0
On 14-Aug-07 11:36:37, [EMAIL PROTECTED] wrote: > > Hi there, am trying to run a linear regression with a slope of 0. > > I have a dataset as follows > > t d > 1 303 > 2 302 > 3 304 > 4 306 > 5 307 > 6 303 > > I would like to test the significance that these points would lie on a > horizontal straight line. > > The standard regression lm(d~t) doesn't seem to allow the slope to be > set. The model d~1 will fit a constant (the mean), i.e. a regressio with slope = 0. The model d~t will fit the usual linear regression. The two con be compared with anova(), as well as getting the details of the individual fits with summary(). E.g. (with your example): d<-c(303,302,304,306,207,303) t<-c(1,2,3,4,5,6) lm0<-lm(u~1);lm1<-lm(u~t);anova(lm0,lm1) ##Analysis of Variance Table ##Model 1: u ~ 1 ##Model 2: u ~ t ## Res.DfRSS Df Sum of Sq F Pr(>F) ##1 5 7785.5 ##2 4 6641.4 11144.1 0.6891 0.4531 summary(lm0) ## Call: lm(formula = u ~ 1) ## Residuals: ##1 2 3 4 5 6 ## 15.5 14.5 16.5 18.5 -80.5 15.5 ## Coefficients: ##Estimate Std. Error t value Pr(>|t|) ## (Intercept) 287.50 16.11 17.85 1.01e-05 *** ##Residual standard error: 39.46 on 5 degrees of freedom mean(d) ## [1] 287.5 summary(lm1) ## Call: lm(formula = u ~ t) ## Residuals: ## 1 2 3 4 5 6 ## -4.714 2.371 12.457 22.543 -68.371 35.714 ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 315.800 37.934 8.325 0.00114 ** ## t -8.086 9.740 -0.830 0.45314 ## Residual standard error: 40.75 on 4 degrees of freedom ## Multiple R-Squared: 0.147, Adjusted R-squared: -0.0663 ## F-statistic: 0.6891 on 1 and 4 DF, p-value: 0.4531 The P-value for the slope in lm1 is the same as the P-value returned by anova(). If you want to force a particular non-zero slope (e.g. s0) for comparison with the data, you can use lm0 <- lm(d - s0*t ~ 1), compared with lm1<- lm(d- s0*t ~ t) for instance. Hoping this helps, Ted. ---- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 14-Aug-07 Time: 13:16:05 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Help wit matrices
On 10-Aug-07 18:05:50, Lanre Okusanya wrote: > Hello all, > > I am working with a 1000x1000 matrix, and I would like to return a > 1000x1000 matrix that tells me which value in the matrix is greater > than a theshold value (1 or 0 indicator). > i have tried > mat2<-as.matrix(as.numeric(mat1>0.25)) > but that returns a 1:10 matrix. > I have also tried for loops, but they are grossly inefficient. > > THanks for all your help in advance. > > Lanre Simple-minded, but: > S<-matrix(rnorm(25),nrow=5) > S [,1][,2] [,3] [,4] [,5] [1,] -0.9283624 -0.44418487 1.1174555 1.9040999 -0.4675796 [2,] 0.2658770 -0.28492642 -1.2271013 -0.5713291 1.8036235 [3,] 0.7010885 -0.42972262 0.7576021 0.3407972 -1.0628487 [4,] -0.2003087 0.87006841 0.6233792 -0.9974902 -0.9104270 [5,] 0.2729014 0.09781886 -1.0004486 1.5987385 -0.4747125 > T<-0*S > T[S>0.25] <- 1+0*S[S>0.25] > T [,1] [,2] [,3] [,4] [,5] [1,]00110 [2,]10001 [3,]10110 [4,]01100 [5,]10010 Does this work OK for your big matrix? HTH Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 10-Aug-07 Time: 19:50:37 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Subject: Re: how to include bar values in a barplot?
ful" means "thinking about it" -- one thing that spreadsheets inhibit, because a) Even if you do think about it, you're not going to find it easy to implement the results of your thoughts (if they're any good); b) Spreadsheets readily induce the naive (especially beginning) user into the habit of trusting that the writers of spreadsheet software have thought through all those nasty implementation technicalities and have created an "expert system" which looks after drawing the graph according to best practice and with all necessary sophistication. Look! Isn't it clever!! This habit, once (all too easily) acquired, is difficult to kick. Patrick Burns's deliberate use of "addiction" is apt. Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 08-Aug-07 Time: 22:40:19 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Test (please ignore)
Please excuse this -- I need to test whether I can get through to R-help! (Have failed repeatedly today). Ted. __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Interaction factor and numeric variable versus separate
On 07-Aug-07 15:34:13, Gabor Grothendieck wrote: > In the single model all three levels share the same intercept which > means that the slope must change to accomodate it > whereas in the three separate models they each have their own > intercept. I think this arose because of the formulation of the "model with interaction" as: summary(lm(y~x:f, data=d)) If it has been formulated as summary(lm(y~x*f, data=d)) there would be three separate intercepts, and three different slopes (and the differences would be the same as the differences for the separate models). Ted. > Try looking at it graphically and note how the black dotted lines > are all forced to go through the same intercept, i.e. the same point > on the y axis, whereas the red dashed lines are each able to > fit their portion of the data using both the intercept and the slope. > > y.lm <- lm(y~x:f, data=d) > plot(y ~ x, d, col = as.numeric(d$f), xlim = c(-5, 20)) > for(i in 1:3) { > abline(a = coef(y.lm)[1], b = coef(y.lm)[1+i], lty = "dotted") > abline(lm(y ~ x, d[as.numeric(d$f) == i,]), col = "red", lty = > "dashed") > } > grid() > > > On 8/7/07, Sven Garbade <[EMAIL PROTECTED]> wrote: >> Dear list members, >> >> I have problems to interpret the coefficients from a lm model >> involving >> the interaction of a numeric and factor variable compared to separate >> lm >> models for each level of the factor variable. >> >> ## data: >> y1 <- rnorm(20) + 6.8 >> y2 <- rnorm(20) + (1:20*1.7 + 1) >> y3 <- rnorm(20) + (1:20*6.7 + 3.7) >> y <- c(y1,y2,y3) >> x <- rep(1:20,3) >> f <- gl(3,20, labels=paste("lev", 1:3, sep="")) >> d <- data.frame(x=x,y=y, f=f) >> >> ## plot >> # xyplot(y~x|f) >> >> ## lm model with interaction >> summary(lm(y~x:f, data=d)) >> >> Call: >> lm(formula = y ~ x:f, data = d) >> >> Residuals: >>Min 1Q Median 3Q Max >> -2.8109 -0.8302 0.2542 0.6737 3.5383 >> >> Coefficients: >>Estimate Std. Error t value Pr(>|t|) >> (Intercept) 3.687990.41045 8.985 1.91e-12 *** >> x:flev1 0.208850.04145 5.039 5.21e-06 *** >> x:flev2 1.496700.04145 36.109 < 2e-16 *** >> x:flev3 6.708150.04145 161.838 < 2e-16 *** >> --- >> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 >> >> Residual standard error: 1.53 on 56 degrees of freedom >> Multiple R-Squared: 0.9984, Adjusted R-squared: 0.9984 >> F-statistic: 1.191e+04 on 3 and 56 DF, p-value: < 2.2e-16 >> >> ## separate lm fits >> lapply(by(d, d$f, function(x) lm(y ~ x, data=x)), coef) >> $lev1 >> (Intercept) x >> 6.77022860 -0.01667528 >> >> $lev2 >> (Intercept) x >> 1.0190781.691982 >> >> $lev3 >> (Intercept) x >> 3.2746566.738396 >> >> >> Can anybody give me a hint why the coefficients for the slopes >> (especially for lev1) are so different and how the coefficients from >> the >> lm model with interaction are related to the separate fits? >> >> Thanks, Sven >> >> __ >> R-help@stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 07-Aug-07 Time: 17:01:33 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R and excell differences in calculation F distributionfu
On 07-Aug-07 10:27:06, Luis Ridao Cruz wrote: > R-help, > > I'm trying to work out the density function by doing: > >> df(5.22245, 3, 22) > [1] 0.005896862 > >> df(15.20675, 6, 4) > [1] 0.001223825 > > > In excell the result is : > > 0.0071060464 <*> FDIST(5.22245,3,22) > > 0.011406 <--> FDIST(15.20675,6,4) > > >>From my point of view the differences in the second case > are substantial. > > Can anyone give me a hint on this? > > Thanks in advance It would seem that Excel is giving you the upper tail of the F-distribution (cumulative distribution, not density), i.e. what R would calculate as pf(5.22245,3,22,lower.tail=FALSE) [1] 0.007106046 pf(15.20675,6,4,lower.tail=FALSE) [1] 0.0114 so in effect using "pf" rather than "df". If you want the density function (corresponding to "df") from Excel, then that's not something I know how to do (nor want to know ... ). Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 07-Aug-07 Time: 11:47:52 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] About grep
On 07-Aug-07 04:20:27, Shao wrote: > Hi,everyone. > > I have a problem when using the grep. > for example: > a <- c("aa","aba","abac") > b<- c("ab","aba") > > I want to match the whole word,so > grep("^aba$",a) > it returns 2 > > but when I used it a more useful way: > grep("^b[2]$",a), > it doesn't work at all, it can't find it, returning integer(0). > > How can I chang the format in the second way? > > Thanks. > > -- > Shao The problem is that in the string "^b[2]$" the element b[2] of b is not evaluated, but simply the successive characters ^ b [ 2 ] $ are passed to grep as a character string, which of course is not found. You can construct a character string with the value of b[2] in it by using paste(): grep(paste("^",b[2],"$",sep=""),a) [1] 2 Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 07-Aug-07 Time: 07:43:14 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] data analysis
On 06-Aug-07 19:26:59, lamack lamack wrote: > Dear all, I have a factorial design where the > response is an ordered categorical response. > > treatment (two levels: 1 and 2) > time (four levels: 30, 60,90 and 120) > ordered response (0,1,2,3) > > could someone suggest a correct analysis or some references? For your data below, I would be inclined to start from here, which gives the counts for the different responses: Response Trt Time 0123 ++ Tr1 30 | 1 3 | 4 60 | 211 | 4 90 | 31 | 4 120 | 31 | 4 ++--- Tr2 30 | 2 2 | 4 60 | 31 | 4 90 | 3 1 | 4 120 | 121 | 4 = Tr1 | 0934 | 16 ++--- Tr2 | 1 1023 | 16 = This suggests that, if anything is happening there at all, it is a tendency for high response to occur at shorter times, and low response at longer times, with little if any difference between the treatments. To approach this formally, I would consider adopting a "re-randomisation" approach, re-allocating the outcomes at random in such a way as to preserve the marginal totals, and evaluating a statistic T, defined in terms of the counts and such as to be sensitive to the kind of effect you seek. Then situate the value of T obtained from the above counts within the distribution of T obtained by this re-randomisation. There must be, somewhere in R, routines which can perform this kind of constrained re-randomisation,but I'm not sufficiently familiar with that area of R to know for sure about them. I hope other readers who know about this area in R can come up with suggestions! best wishes, Ted. > subject treatment time response > 1 130 3 > 2 130 3 > 3 130 1 > 4 130 3 > 5 160 3 > 6 160 1 > 7 160 1 > 8 160 2 > 9 190 2 > 10 190 1 > 11 190 1 > 12 190 1 > 13 1 120 2 > 14 1 120 1 > 15 1 120 1 > 16 1 120 1 > 17 230 3 > 18 230 3 > 19 230 1 > 20 230 1 > 21 260 1 > 22 260 2 > 23 260 1 > 24 260 1 > 25 290 1 > 26 290 1 > 27 290 1 > 28 290 3 > 29 2 120 1 > 30 2 120 2 > 31 2 120 0 > 32 2 120 1 > > _ > Verificador de Segurança do Windows Live OneCare: verifique já a > segurança > do seu PC! http://onecare.live.com/site/pt-br/default.htm > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 07-Aug-07 Time: 00:30:19 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] test the significances of two regression lines
On 06-Aug-07 10:32:50, Luis Ridao Cruz wrote: > R-help, > > I'm trying to test the significance of two regression lines > , i.e. the significance of the slopes from two samples > originated from the same population. > > Is it correct if I fit a liner model for each sample and > then test the slope signicance with 'anova'. Something like this: > > lm1 <- lm(Y~ a1 + b1*X)# sample 1 > lm2 <- lm(Y~ a2 + b2*X)# sample 2 > > anova(lm1, lm2) No, this will not work. From "?anova": Warning: The comparison between two or more models will only be valid if they are fitted to the same dataset. which is not the case in your example. One way to proceed is to merge the two datasets, and introduve a factor which identifies the dataset. For example: x1<-rnorm(100) ; x2<-rnorm(100) y1 <- 0.2 + 0.1*x1 + 0.05*rnorm(100) y2 <- 0.2 + 0.12*x2 + 0.05*rnorm(100) x <- c(x1,x2) y <- c(y1,y2) S <- factor(c(rep(0,100),rep(1,100))) lm12 <- lm(y ~ x*S) First look at the fit of y1~x1: summary(lm(y1~x1)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.206042 0.004647 44.34 <2e-16 *** x1 0.0913820.091382 0.004768 19.16 <2e-16 *** Then the fit of y2~x2: summary(lm(y2~x2)) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.208216 0.005171 40.26 <2e-16 *** x2 0.118840 0.005009 23.73 <2e-16 *** so the estimated slopes idiffere by 0.118840 - 0.091382 = 0.027458 But what is the "significance" of this difference? Now: summary(lm12) Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.206042 0.004923 41.852 < 2e-16 *** x 0.091382 0.005052 18.088 < 2e-16 *** S1 0.002174 0.006953 0.313 0.754926 x:S10.027457 0.006939 3.957 0.000106 *** so the "x:S1" value is the same as the difference in slopes as estimated from lm1 and lm2; but now we have a standard error and a P-value for it. You can also use anova now: anova(lm12) Response: y Df Sum Sq Mean Sq F valuePr(>F) x 1 2.26537 2.26537 946.2702 < 2.2e-16 *** S 1 0.00015 0.00015 0.0614 0.8045253 x:S 1 0.03749 0.03749 15.6599 0.0001060 *** Residuals 196 0.46922 0.00239 so you get the same P-value, though with anova() you do not see the actual estimate of the difference between the slopes. Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-Aug-07 Time: 12:16:06 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Any "special interest" in R/pic interface?
a) 'pic' code to specify the basic primitives of the plot (analagous to the "define bar{...}" in my histogram example) b) Layout information c) Acess to the numerical (and any textual) information which determines the position and value of plotted objects c) Export to a file which can then be readily edited to "tune" the appearance of the result to the user's satisfaction (as in the choice of size and style for the printed histogram counts). I'm well aware of R's xfig() device, which exports to XFig files which the xfig program can in turn export to 'pic' language. However, the resulting 'pic' code is horrendous, and essentially impossible to edit for a diagram of any complexity (it's near impossible for a mere scatter-plot of 2 variables). The 'pic' code written by a human is much more transparent and easy to edit (see my histogram example), and is the kind of objective I have in mind. I envisage that in complex plots produced by R, such as the above xyplot, one can obtain the generic information on needs by XY<-xyplot() str(XY) then extracting XY$whatever, and wrapping this stuff in 'pic' code. Looking forward to hearing if anyone is interested in joining in an exploration of this territory! Best wishes to all, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 05-Aug-07 Time: 13:01:40 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Invert Likert-Scale Values
On 04-Aug-07 22:02:33, William Revelle wrote: > Alexis and John, > > To reverse a Likert like item, subtract the item from the maximum > acceptable value + the minimum acceptable value, > That is, if > x <- 1:8 > xreverse <- 9-x > > Bill A few of us have suggested this, but Alexis's welcome for the recode() suggestion indicates that by the time he gets round to this his Likert scale values have already become levels of a factor. Levels "1", "2", ... of a factor may look like integers, but they're not; and R will not let you do arithmetic on them: > x<-factor(c(1,1,1,2,2,2)) > x [1] 1 1 1 2 2 2 Levels: 1 2 > y<-(3-x) Warning message: "-" not meaningful for factors in: Ops.factor(3, x) > y [1] NA NA NA NA NA NA However, you can turn them back into integers, reverse, and then turn the results back into a factor: > y <- factor(3 - as.integer(x)) > y [1] 2 2 2 1 1 1 Levels: 1 2 So, even for factors, the insight undelying our suggestion of "-" is still valid! :) Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 05-Aug-07 Time: 00:09:58 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] - round() strange behaviour
On 04-Aug-07 21:57:28, John Logsdon wrote: > I must admit I had never realised this so thanks to Monica for > raising it. > > Round-to-even is used for statistical reasons and there is some > point as it obviously reduces bias. But why should the decision > be to round to the nearest even number? rounding to the nearest > odd number would be equally valid. (signif(x,0) is the same as > round(). ) A fair point! But see below. > There is a debate and sometimes you need symmetric rounding. > Perhaps there should be an option in round that defaults to > round-to-even for compatibility but includes the various other > rounding approaches as seen for example in > http://en.wikipedia.org/wiki/Rounding As wikipedia says: "Round-to-even is used rather than round-to-odd as the latter rule would prevent rounding to a result of zero." And there's also stochastic rounding -- toss a penny. But you wouldn't get the same answer next time. And I can recall -- man years ago -- being in an environment where the practice was to round alternately up and down. (These were engineers). John's comparisons between FORTRAN, octave and (implicitly) R are interesting. As a general reflection: any method of rounding involves throwing away some of the information in the data, and this will have consequences. So the question is: what consequences are you happiest with? One consequence of rounding to nearest even (or nearest odd) is that this can give the worst possible outcome in terms of (rounded X - rounded Y) compared with (unrounded X - unrounded Y). For example, rounding to even integer X =1.5 --> 2.0 and Y = 0.5 --> 0.0 gives X - Y = 1.0 , rounded X - rounded Y = 2.0 whereas the difference is 1.0 if you always round up, or always down, and this is also exactly the difference between the unrounded values.. Rounding to nearest odd would give X = 1.5 --> 1.0, Y = 0.5 --> 1.0 thus difference 0.0 in this case, but again difference 2.0 for X = 2.5 --> 3.0, Y = 1.5 --> 1.0 And alternate rounding would give either 2.0 or 0.0 depending on the phase of X and Y. Thus "always up" or "always down" means that the difference between rounded numbers is always the same as between the unrounded numbers which for other methods the differences can differ by 2.0. This may or may not matter, but it is something to think about when choosing a rounding method. > [...] Best wishes to all, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 04-Aug-07 Time: 23:53:50 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Invert Likert-Scale Values
On 04-Aug-07 16:42:23, John Kane wrote: > Will ?recode in the car package do what you want? > x <- 1:4 > recode(x, "1='4';2='3' ;3='2'; 4='1'") Is thre a problem with just using New <- (8 - Old) ?? Ted. > --- Alexis Delevett <[EMAIL PROTECTED]> wrote: > >> Hi! >> >> I am using R to process some community survey data. >> Several item responses are recorded via a 7-point >> Likert-Scale. As I have coded the responses, 1 >> represents high agreement, and 7 high disagreement. >> This of course impacts the coefficients in a linear >> regression (of example agreement to self-perception >> measures on housing satisfaction). For some >> purposes, in order to make the coefficients more >> accessible to the reader, I would like to invert the >> item values, i.e. to arrive at 1 for high >> disagreement, and 7 for high agreement (such that >> the linear regression would express something like >> "the higher the agreement on A, the greater the B). >> >> Is there an already existing function for this, or >> do I use a custom replace loop in R? >> >> Thank you, Alexis >> >> >> - >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, >> reproducible code. >> > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 04-Aug-07 Time: 18:06:38 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] question about logistic models (AIC)
On 03-Aug-07 16:42:33, Kyle. wrote: > Tom: > > That's a good question. AIC, as i'm sure you know, is usually > calculated as 2k-2ln(L), where k is the number of free parameters in > the model, and L is the log-likelihood of the model. The goal of > AIC--and the reason one normally tries to select a model with minimal > AIC--is to minimize what's referred to as the Kullback-Leibler > distance between the distribution of your data's density from the > theoretical "true" theoretical density as defined by the model. More > concisely, the AIC is an index of the amount of information regarding > your data that is lost when your model is used to describe it. To > get back to your question, I can't say without a little more > information why the AIC's your referring to are negative (but perhaps > it's an issue of using the log-likelihood instead of the negative log- > likelihood), but my first instinct is that it doesn't matter. I > would go with the AIC that is closest to zero. I hope that helps. That could be wrong! Don't forget that ln(L) is indeterminate to within an additive constant (for given data), so one man's negative AIC could be another mans positive AIC -- but if each compared their AICs with different models (the same different models for each) then they should get the same *difference* of AIC. The correct approach is to compare AICs on the basis of algebraic difference: AIC1 - AIC2 in comparing Model1 with Model2. If this is positive then Model2 is better than Model1 (on the AIC criterion). Taking "the AIC that is closest to zero" would be the wrong way round for negative AICs. Ted. > On Aug 3, 2007, at 8:41 AM, Tom Willems wrote: > >> Dear fellow R-ussers, >> I have a question about the Akaike Information Criterion in the R >> output. >> Normally you would want it to be as small as possible, yet when i >> check up the information in my study books, the AIC is usually >> displayed as a negative number. In the exercises it is given as >> " - AIC ". >> R displays it as a positive number, does this mean that a large "AIC" >> gives a small " - AIC ", so the bigger the better? >> >> Kind regards, >> Tom. >> >> Tom Willems >> CODA-CERVA-VAR >> Department of Virology >> Epizootic Diseases Section >> Groeselenberg 99 >> B-1180 Ukkel >> >> Tel.: +32 2 3790522 >> Fax: +32 2 3790666 >> E-mail: [EMAIL PROTECTED] >> >> www.var.fgov.be E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 03-Aug-07 Time: 18:31:51 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] problem with reading data files with different numbers o
On 02-Aug-07 21:14:20, Tom Cohen wrote: > Dear List, > > I have 30 data files with different numbers of lines (31 and 33) that > I want to skip before reading the files. If I use the skip option I can > only choose either to skip 31 or 33 lines. The data files with 31 lines > have no blank rows between the lines and the header row. How can I read > the files without manually checking which files have 31 respectively 33 > lines ? The only text line I want to keep is the header. > > Thamks for your help, > Tom > > > for (i in 1:num.files) { >a<-read.table(file=data[i], > ,header=T,skip=31,sep='\t',na.strings="NA") > > } Apologies, I misunderstood your description in my previous response (I thought that the total number of lines in one of your files was either 31 or 33, and you wanted to know which was which). I now think you mean that there are either 0 (you want to skip 31) or 2 (you want to skip 33) blank lines in the first 33, and then you want the remainder (aswell as the header). Though it's still not really clear ... You can find out how many blank lines there are in the first 33 with > sum(cbind(readLines("~/00_junk/temp.tr", 33))=="") and then choose how many lines to skip. Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 03-Aug-07 Time: 00:11:21 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] problem with reading data files with different numbers o
On 02-Aug-07 21:14:20, Tom Cohen wrote: > Dear List, > > I have 30 data files with different numbers of lines (31 and 33) that > I want to skip before reading the files. If I use the skip option I can > only choose either to skip 31 or 33 lines. The data files with 31 lines > have no blank rows between the lines and the header row. How can I read > the files without manually checking which files have 31 respectively 33 > lines ? The only text line I want to keep is the header. > > Thamks for your help, > Tom > > > for (i in 1:num.files) { >a<-read.table(file=data[i], > ,header=T,skip=31,sep='\t',na.strings="NA") > > } If you're using a Unix/Linux system, you have the little command "wc" which can count characters or words or lines in a file. For example, I'm working at the moment on a file "mkSim.R" which has 53 lines. So, in R: > system("wc -l mkSim.R") 53 mkSim.R Hence the following returns the line-count as an integer: > as.integer(substring(system("wc -l mkSim.R",intern=TRUE),1,7)) [1] 53 (which will also work for files with hundreds, thousands, ... of lines, since the units digit is at position 7). Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 02-Aug-07 Time: 23:51:59 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] t-distribution
On 01-Aug-07 19:18:05, [EMAIL PROTECTED] wrote: > Well, is "t = 1.11" all that accurate in the first place? :-) > > In fact, reading beween the lines of the original enquiry, what the > person probably wanted was something like > > ta <- pt(-1.11, 9) + pt(1.11, 9, lower.tail = FALSE) > > which is the two-sided t-test tail area. > > The teller of the parable will usually leave some things unexplained... > > Bill. However: "Those who have ears to hear, let them hear!" Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 01-Aug-07 Time: 20:43:53 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] write.csv
On 29-Jul-07 17:41:58, Dong GUO ¹ù¶« wrote: > Hi, > > I want to save an array(say, array[6,7,8]) write a cvs file. > How can I do that??? can I write in one file? > > if I could not write in one file, i want to use a loop to save > in different files (in the array[6,7,8], should be 8 csv files), > such as the filename structure should be: > file ="filename" +str(i)+"." +"csv" > > Many thanks. The following (illustrated on a smaller array) may help: A<-array((1:60),dim=c(3,4,5)) D<-dim(A) write.table(t(dim(A)),file="A.csv",sep=",", quote=FALSE,row.names=FALSE,col.names=FALSE) Z<-as.vector(A) write.table(t(Z),file="A.csv",append=TRUE,sep=",", quote=FALSE,row.names=FALSE,col.names=FALSE) Then the file A.csv contains two rows: 3,4,5 1,2,3,4,5,6,7,8,9,10,11,12, ... ,55,56,57,58,59,60 of which the first gives the dimension, the second the data for the array. You can then reconstruct another array, say B, as: dimB<-scan(file="A.csv",sep=",",nlines=1) dataB<-scan(file="A.csv",sep=",",skip=1,nlines=1) B<-array(dataB,dimB) That's a hard way to do things, perhaps, but since you said you wanted the array as a CSV file, this is one way to do that. Since a CSV text file is essentially a two-dimensional object (fields in rows, by rows), to store a higher-dimensional object in such a format you have to include the "meta-data" about the structure -- in this case the list of dimensions. Note, however, that, although it is a CSV file, read.csv("A.csv",header=FALSE) will not work nicely, since it will give you two rows of equal length, the first one padded out with (in this case 57) NAs, which you will then have to clean up; which you can do, but by the time you've done it you might as well have done it the above way! Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 30-Jul-07 Time: 22:24:12 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] generating symmetric matrices
On 28-Jul-07 03:28:25, Gregory Gentlemen wrote: > Greetings, > > I have a seemingly simple task which I have not been able to solve > today. I want to construct a symmetric matrix of arbtriray size w/o > using loops. The following I thought would do it: > > p <- 6 > Rmat <- diag(p) > dat.cor <- rnorm(p*(p-1)/2) > Rmat[outer(1:p, 1:p, "<")] <- Rmat[outer(1:p, 1:p, ">")] <- dat.cor > > However, the problem is that the matrix is filled by column and so the > resulting matrix is not symmetric. > > I'd be grateful for any adive and/or solutions. > > Gregory Would the fact that A + t(A) is symmetric be useful here? E.g. p <- 6 A <- matrix(rnorm(p^2),ncol=p) A <- (A + t(A))/sqrt(2) diag(A) <- rep(1,p) round(A,digits=2) [,1] [,2] [,3] [,4] [,5] [,6] [1,] 1.00 0.53 -0.20 1.27 0.34 0.83 [2,] 0.53 1.00 -0.99 -0.72 0.68 -1.21 [3,] -0.20 -0.99 1.00 -0.62 -0.36 -0.87 [4,] 1.27 -0.72 -0.62 1.00 2.40 0.33 [5,] 0.34 0.68 -0.36 2.40 1.00 0.20 [6,] 0.83 -1.21 -0.87 0.33 0.20 1.00 (Here, because each off-diagonal element of A is the sum of 2 independent N(0,1)s, divided by sqrt(2), the result is also N(0,1)). However, whether this is reallyu seful for you depends on what you want the elements of A to be! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 30-Jul-07 Time: 20:00:55 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] 2nd R Console
On 28-Jul-07 08:19:10, Michael Janis wrote: > Hi, > > I was reading a thread: [R] "2nd R console" and had a similar > question regarding having more than one R console open at a time. > However, my question differs from that of the thread: > > Is it possible, or is there a wrapper that will allow one, to open > an arbitrary number of R consoles which access the same R session > (all objects in that session, etc.). This would be R on linux > accessed through a shell - kind of like using GNU screen multi-user > such that people could work collaboratively on a given session. > The problem with screen is that all commands are interleaved in > the same terminal, which is confusing and does not allow access > to the command prompt at the same time, rather it would be sequential. > I know there will be "why" questions but it is useful in an > academic environment. Basically we have a memory machine for large > genomic analysis - and we could set that up as an Rserver, but this > placing R into a multi-user engine is better suited for our immediate > needs. Does anybody have thoughts on this? You could have a look at XMX (X-protocol Multiplexer). This is now rather old, and I think it has not been further developed for many years. http://www.cs.brown.edu/software/xmx/README.patch7 Basically, you would start XMX from one machine (A) networked to others (B,C,...), all running X-windows. in the startup, you designate which machines are to share the session, and what rights they will have. On machine A, and all designated machines B, C, ... , appears a window. This in fact acts like a screen emulator, with its own X session inside it. The A user then starts up a program (say R) in this window, and what A sees is mirrored on the other machines. If A has granted input privileges to the other machines, then users of B, C, ... can do things themselves, and the effects of their actions are mirrored to all the other machines. Thus, for instance, it would be possible for different users to take turns at entering commands into the R session, and so on, from their own machines. This is a much better way of arranging things, than having all the people queue up to take turns at sitting in the one and only chair! It is certainly useful in a classroom situation, and even in a one-to-one tutorial. For instance, I've used it in the past to teach programming and system management, sitting more or less beside the other person but at different machines. We would both, for instance, be editing the same program file, and either could initiate a program run with the current file. (Of course, what's technically called a "race condition" can develop here ... ). I've even envisaged the scenario where two people, one in England and one in the US, could simultaneously be editing a joint paper (you'd also need an independent communication channel as well, but that's no problem using a "chat" program). The one constraint with XMX (at least in the past, I'm not sure to what extent it may have been relaxed) which can limit its use is that it depends on fairly close compatibility between the X resources of all the different machines. It's best of they're identical (same screen resolution e.g. 1024x768, same colour depths on all screens, ... ). Otherwise it's liable to not establish the necessary connections, and only some machines can join in. However, in a computing lab environment, it may well be that all machines are compatible. Suck it and see! Hoping this is useful, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 30-Jul-07 Time: 19:27:48 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] the large dataset problem
On 30-Jul-07 11:40:47, Eric Doviak wrote: > [...] Sympathies for the constraints you are operating in! > The "Introduction to R" manual suggests modifying input files with > Perl. Any tips on how to get started? Would Perl Data Language (PDL) be > a good choice? http://pdl.perl.org/index_en.html I've not used SIPP files, but itseems that they are available in "delimited" format, including CSV. For extracting a subset of fields (especially when large datasets may stretch RAM resources) I would use awk rather than perl, since it is a much lighter program, transparent to code for, efficient, and it will do that job. On a Linux/Unix system (see below), say I wanted to extract fields 1, 1000, 1275, , 5678 from a CSV file. Then the 'awk' line that would do it would look like awk ' BEGIN{FS=","}{print $(1) "," $(1000) "," $(1275) "," ... $(5678) ' < sippfile.csv > newdata.csv Awk reads one line at a tine, and does with it what you tell it to do. It will not be overcome by a file with an enormous number of lines. Perl would be similar. So long as one line fits comfortably into RAM, you would not be limited by file size (unless you're running out of disk space), and operation will be quick, even for very long lines (as an experiment, I just set up a file with 10,000 fields and 35 lines; awk output 6 selected fields from all 35 lines in about 1 second, on the 366MHz 128MB RAM machine I'm on at the moment. After transferring it to a 733MHz 512MB RAM machine, it was too quick to estimate; so I duplicated the lines to get a 363-line file, and now got those same fields out in a bit less than 1 second. So that's over 300 lines/second, 200,000 lines a minute, a million lines in 5 minutes; and all on rather puny hardware.). In practice, you might want to write a separate script which woould automatically create the necessary awk script (say if you supply the filed names, haing already coded the filed positions corresponding to filed names). You could exploit R's system() command to run the scripts from within R, and then load in the filtered data. > I wrote a script which loads large datasets a few lines at a time, > writes the dozen or so variables of interest to a CSV file, removes > the loaded data and then (via a "for" loop) loads the next few lines > I managed to get it to work with one of the SIPP core files, > but it's SLW. See above ... > Worse, if I discover later that I omitted a relevant variable, > then I'll have to run the whole script all over again. If the script worked quickly (as with awk), presumably you wouldn't mind so much? Regarding Linux/Unix versus Windows. It is general experience that Linux/Unix works faster, more cleanly and efficiently, and often more reliably, for similar tasks; and cam do so on low grade hardware. Also, these systems come with dozens of file-processing utilities (including perl and awk; also many others), each of which has been written to be efficient at precisely the repertoire of tasks it was designed for. A lot of Windows sotware carries a huge overhead of either cosmetic dross, or a pantechnicon of functionality of which you are only going to need 0.01% at any one time. The Unix utilities have been ported to Windows, long since, but I have no experience of using them in that environment. Others, who have, can advise! But I'd seriously suggest getting hold of them. Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 30-Jul-07 Time: 18:24:41 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Slightly OT - use of R
On 30-Jul-07 08:28:15, John Logsdon wrote: > I am trying to get a measure of how R compares in usage as a > statistical platform compared to other software. I would guess > it is the most widely used among statisticians at least by > virtue of it being open source. > > But is there any study to which I can refer? By asking this > list I am not exactly adopting a rigorous approach! I don't know about that -- my own expectation would be that serious users of R are likely to be subscribers to the list. So maybe a good answer to your question would be the number of subscribers (which I'm sure Martin Maechler can find out). Of course, some people will have subscribed under more than one email address, so that would somewhat over-estimate the number of people who subscribe. But it can be traded off (to a somewhat unknown extent) against R users who do not subscribe. More to the point, though, is what you mean by "usage". If you simply mean "people who use", that's a matter of counting (one way or another). But there's "use" and "use". There's a lot of what I call "SatNav Statistics" being done, and I would guess that "SatNav statisticians" tend to go for the commercial products, since these have bigger and brighter displays, and the more mellifluous and reassuring voice-overs. (And never mind that the voice instructs you to turn left, at the level-crossing, onto the railway line). Most serious R users, I tend to think, are more likely to pull into a layby and unfold large-scale maps. And, when the need arises, they will get out and push. So, in "widely used among statisticians", it depends on what you mean by "statisticians". Where you will will probably get extra value from the R list is that many of our people will have extensive and very professional experience, not only with R, but with many of the other available packages, and be best placed to provide serious and thoughtful comparisons. Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 30-Jul-07 Time: 10:18:21 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Fitting exponential curve to data points
On 24-Jul-07 01:09:06, Andrew Clegg wrote: > Hi folks, > > I've looked through the list archives and online resources, but I > haven't really found an answer to this -- it's pretty basic, but I'm > (very much) not a statistician, and I just want to check that my > solution is statistically sound. > > Basically, I have a data file containing two columns of data, call it > data.tsv: > > year count > 1999 3 > 2000 5 > 2001 9 > 2002 30 > 2003 62 > 2004 154 > 2005 245 > 2006 321 > > These look exponential to me, so what I want to do is plot these > points on a graph with linear axes, and add an exponential curve over > the top. I also want to give an R-squared for the fit. > > The way I did it was like so: > > ># Read in the data, make a copy of it, and take logs > data = read.table("data.tsv", header=TRUE) > log.data = data > log.data$count = log(log.data$count) > ># Fit a model to the logs of the data > model = lm(log.data$count ~ year, data = log.data) > ># Plot the original data points on a graph > plot(data) > ># Draw in the exponents of the model's output > lines(data$year, exp(fitted(model))) > > > Is this the right way to do it? log-ing the data and then exp-ing the > results seems like a bit of a long-winded way to achieve the desired > effect. Is the R-squared given by summary(model) a valid measurement > of the fit of the points to an exponential curve, and should I use > multiple R-squared or adjusted R-squared? > > The R-squared I get from this method (0.98 multiple) seems a little > high going by the deviation of the last data point from the curve -- > you'll see what I mean if you try it. I just did. From the plot of log(count) against year, with the plot of the linear fit of log(count)~year superimposed, I see indications of a non-linear relationship. The departures of the data from the fit follow a rather systematic pattern. Initially the data increase more slowly than the fit, and lie below it. Then they increase faster and corss over above it. Then the data increase less fast than the fit, and the final data point is below the fit. There are not enough data to properly identify the non-linearity, but the overall appearance of the data plot suggests to me that you should be considering one of the "growth curve" models. Many such models start of with an increasing rate of growth, which then slows down, and typically levels off to an asymptote. The apparent large discrepancy of your final data point could be compatible with this kind of behaviour. At this point, knowledge of what kind of thing is represented by your "count" variable might be helpful. If, for instance, it is the count of the numbers of individuals of a species in an area, then independent knowledge of growth mechanisms may help to narrow down the kind of model you should be tring to fit. As to your question about "Is this the right way to do it" (i.e. fitting an exponential curve by doing a linear fit of the logarithm), generally speaking the answer is "Yes". But of course you need to be confident that "exponential" is the right curve to be fitting in the first place. If it's the wrong type of curve to be considering, then it's not "the right way to do it"! Hoping this help[s, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 24-Jul-07 Time: 10:08:33 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] using R for a reaction-time experiment
On 22-Jul-07 16:38:02, Seth Roberts wrote: > > I want to use R to run a reaction-time experiment: > Something appears on the screen, I respond by typing something > (one keystroke), the system measures the speed of my response. > R would be great for this if only I didn't have to hit Enter > to enter that keystroke. I am doing such experiments now but > they require two actions per trial: hit keystroke, hit Enter. > > Is there some way that R can be made to respond to a single > keystroke (other than Enter)? What operating system are you using? If it's Linux, there would be ways to detect the keystroke (say "A") and immediately send "A\n" (i.e. "A" followed by "enter") to a FIFO which R could be watching. Then R would receive your single keystroke as if you had followed it by pressing "Enter". If you're using Windows, then unfortunately I haven't a clue. Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 22-Jul-07 Time: 18:11:43 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Can I test if there are statistical significance between
On 21-Jul-07 12:46:46, zhijie zhang wrote: > Dear Uwe Ligges, [restructuring the given layout of the data] Grp1 Grp2Grp3 Better 16 0 [ 1]1 | 17 | Good 71 4 [ 10]6 | 81 | Bad 3761 [118] 57 | 155 ---+- 12465 [129] 64 | 253 > My hypothesis is if the three groups,that is group1, group2,and > group3, have the same distributions on coloumns? If not, which > one is difference from which one? It is clear that there is no discernible difference between Group 2 and Group 3. If they are pooled together (totals in [...] above), it is also clear that there is a very large difference between [Group 2 + Group 3], or either separately, and Group 1. In summary, and in round figures, Groups 2 and 3 have about 90% "Bad", about 10% "Good" and hardly any "Better". Group 1 has only about 30% "Bad", about 60% "Good", and 10% "Better". Formally: A<-cbind(c(16,71,37),c(1,10,118)) A [,1] [,2] [1,] 161 [2,] 71 10 [3,] 37 118 chisq.test(A) Pearson's Chi-squared test data: A X-squared = 101.4434, df = 2, p-value = < 2.2e-16 cbind(round(chisq.test(A)$expected,1),A) [,1] [,2][,3] [,4] [1,] 8.3 8.7 161 [2,] 39.7 41.3 71 10 [3,] 76.0 79.0 37 118 B<-cbind(c(0,4,61),c(1,6,57)) B [,1] [,2] [1,]01 [2,]46 [3,] 61 57 chisq.test(B,simulate.p.value=TRUE) Pearson's Chi-squared test with simulated p-value (based on 2000 replicates) data: B X-squared = 1.5279, df = NA, p-value = 0.3215 fisher.test(B) Fisher's Exact Test for Count Data data: B p-value = 0.4247 alternative hypothesis: two.sided That, with possible refinements for more careful statements about the proportions in the three outcome classes, is about all one can say about these data; and it is definite enough! With the totally non-committal P-value for Group 2 vs Group 3, and the absolutely decisive P-value for Group 1 vs Groups 2&3, there is no need whatever to bother with "multiple comparison" complications. Best wishes, Ted. > On 7/20/07, Uwe Ligges <[EMAIL PROTECTED]> wrote: >> >> >> >> zhijie zhang wrote: >> > Dear friends, >> > My R*C table is as follow: >> > >> > >> > >> > better >> > >> > good >> > >> > bad >> > >> > Goup1 >> > >> > 16 >> > >> > 71 >> > >> > 37 >> > >> > Group2 >> > >> > 0 >> > >> > 4 >> > >> > 61 >> > >> > Group3 >> > >> > 1 >> > >> > 6 >> > >> > 57 >> > >> >Can I test if there are statistical significant between Group1 >> >and >> > Group2, Group2 and Group3, Group1 and Group2, taking into the >> > multiple >> > comparisons? >> >> >> So what is you hypothesis? Statistical significance of what it to be >> tested? >> >> Uwe Ligges >> >> >> >> > The table can be set up using the following program: >> > >> > a<-matrix(data=c(16,71,37,0,4,61,1,6,57),nrow=3,byrow=TRUE) >> > Thanks very much. >> > >> > >> > > > > -- > With Kind Regards, > > oooO: > (..): >:\.(:::Oooo:: >::\_)::(..):: >:::)./::: >::(_/ >: > [*** > ] > Zhi Jie,Zhang ,PHD > Tel:86-21-54237149 > Dept. of Epidemiology,School of Public Health,Fudan University > Address:No. 138 Yi Xue Yuan Road,Shanghai,China > Postcode:200032 > Email:[EMAIL PROTECTED] > Website: www.statABC.com > [*** > ] > oooO: > (..): > :\.(:::Oooo:: > ::\_)::(..):: > :::)./::: > ::(_/ > : E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 21-Jul-07 Time: 14:51:46 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] automatically jpeg output
On 20-Jul-07 15:34:00, Ding, Rebecca wrote: > Dear R users, > > I used R to draw many histograms and I would like to automatically save > them into a jpeg file. I tried the following code since I know .ps file > could be saved like this way: > > postscript("AYA_ELA.jpeg",horizontal=F,onefile=T) > ..#some funtions inside here > dev.off() > > There was a jpeg file, however, there is no pictures inside. Any > suggestion? > > Thanks. > > Rebecca If your "some functions inside here" do draw histograms, then there will indeed be pictures inside, but they will be in PostScript, since that is what is created when you use the postscript() device. The fact that you used ".jpeg" in the fileneam has nothing to do with it -- it will simply store the output, as PostScript, in the file whose name you give to it. It trusts you. You would have seen the PostScript pictures if you had used a PostScript viewer which reacts to the content of the files, whereas presumably your system expects a file whose name ends in ".jpeg" or ".jpg" to be a JPEG file; and therefore will use the wrong interpreter and fail to recognise the picture (as you have just proved). If you want to create JPEG files in a similar way, use the jpeg() function -- read ?jpeg for details. Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 20-Jul-07 Time: 17:07:20 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] presence/absence matrix
On 20-Jul-07 14:16:39, [EMAIL PROTECTED] wrote: > I have a table such that: > > sample # species > 1a > 1b > 2a > 2c > 3b > > and i would like to build a matrix with species names (i.e. a b and c) > as field names and samples (i.e. 1,2 and 3) as row names > and 1/0 as inner values > such as: > a b c > 1 1 1 0 > 2 1 0 1 > 3 0 1 0 > > I am currently using Rcmdr package for managing datasets but need a > function about I tried to use stack function but only with worst > results The following generates the sort of thing you show above: ## Identities of possible species and samples Species<-c("a","b","c","d") Samples<-c(1,2,3,4,5,6) ## Generate a set of samples and corresponding species: species<-sample(Species,20,replace=TRUE) samples=sample(Samples,20,replace=TRUE) ## Have a look: cbind(samples,species) samples species [1,] "5" "c" [2,] "2" "c" [3,] "4" "a" [4,] "5" "b" [5,] "5" "d" [6,] "1" "d" [7,] "2" "b" [8,] "2" "b" [9,] "3" "b" [10,] "2" "d" [11,] "4" "d" [12,] "1" "b" [13,] "3" "b" [14,] "2" "c" [15,] "2" "d" [16,] "6" "b" [17,] "5" "a" [18,] "4" "a" [19,] "2" "b" [20,] "5" "a" ## Now generate a table: T<-table(samples,species) T species samples a b c d 1 0 1 0 1 2 0 3 2 2 3 0 2 0 0 4 2 0 0 1 5 2 1 1 1 6 0 1 0 0 Now this is a "table" structure, and also gives counts of occurrences and not merely presence/absence. So we need to get these out as a matrix. U<-as.matrix(T) U species samples a b c d 1 0 1 0 1 2 0 3 2 2 3 0 2 0 0 4 2 0 0 1 5 2 1 1 1 6 0 1 0 0 U<-1*(U>0) U species samples a b c d 1 0 1 0 1 2 0 1 1 1 3 0 1 0 0 4 1 0 0 1 5 1 1 1 1 6 0 1 0 0 Is that what you want? Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 20-Jul-07 Time: 16:14:14 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] print tables to figure in R
On 16-Jul-07 22:11:31, Steve Powers wrote: > Doe anyone know of a way to print nice-looking, printer-friendly tables > directly in the R plot window? This would be very handy for examining > preliminary results from frequently used routines. I've hunted for this > with no luck. All anyone seems to do is export text files to excel for > formatting, or use the print(bla) function which just calls the numbers > at the command line. There must be a way to do this.---steve It would certainly be possible to write a function include a table within a plot, since a table is no more than bits of text laid out positionally in an array, and there is already provision to place text at assigned positions on a plot -- the function text() does this. That being said, though, writing the code to layout a particular table in the particular way you want would not be trivial. I don't know of an R function which implements a reasonable "default layout" for arbitrary tables in a plot. As an example of "doing it with your bare hands", run the following: x<-0.5*(0:20);y<-(x^2)/10;plot(x,y,pch="+",col="blue",ylim=c(-10,10)) lines(x,y,col="blue") A<-c(0,2.5,5,7.5,10); B<-(A^2)/10;points(A,B,pch=2,col="red") Ach<-as.character(round(A,digits=1)) Bch<-as.character(round(B,digits=3)) text(x=c(6.25,8.0),y=c(0,0),labels=c("A","B"),pos=4) for(i in (1:5)){text(x=6.125,y=(-1-i),labels=Ach[i],pos=4)} for(i in (1:5)){text(x=7.875,y=(-1-i),labels=Bch[i],pos=4)} You could augment this with grid lines, etc. according to taste. Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 17-Jul-07 Time: 09:23:33 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] logical operators priority
On 17-Jul-07 06:59:11, Delphine Fontaine wrote: > Dear R-users, > > I haven't found rules for logical operators. I need to select data > according the following rule: > Condition A & (Condition B | Condition C) > How should I write it_? Is Condition A & Condition B | Condition C > correct or will R execute (Condition A & Condition B) | Condition C ? > > Thanks for your help. > > Delphine Fontaine ?Syntax will tell you about the precedence for operators. In particular you will find: '!'negation '& &&'and '| ||' or indicating that "!" takes precedence over "&" and "&&", which take precedence over "|" and "||". With equal precedence evaluation is from left to right. This shows that if you write A & B | C then "A & B" will be evaluated first, so the expression is equivalent to (A & B) | C. In any case, there is nothing whatever wrong with using "(... )" for grouping. It is always accepted, it forces the precedence you want, and furthermore (most important) you yourself can see exactly what will happen and will be much less likely to make a mistake. So write your condition in your code *explicitly* as ConditionA & (ConditionB | ConditionC) You can see what's happening, and R will do exactly what you want. Best wishes Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 17-Jul-07 Time: 09:34:18 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Errors in data frames from read.table
On 16-Jul-07 14:13:09, Pat Carroll wrote: > Hello, all. > > I am working on a project with a large (~350Mb, about 5800 rows) > insurance claims dataset. It was supplied in a tilde(~)-delimited > format. I imported it into a data frame in R by setting memory.limit to > maximum (4Gb) for my computer and using read.table. I had a similar problem put to me some time back, and eventually solved it by going in with a scalpel. It turned out that there was a problem with muddling "End-of-Line" with field delimiter in creating the file. And the file came out of Excel in the first place ... (did yours?). Quite why excell made this particular mess of it remains a mystery. I note that your file size is "350Mb" and "about 5800 rows". Doing some arithmetic on that: 350 * 1024 * 1024 = 367,001,600 bytes 367001600/5800 = 63276.14 bytes per row. This (given your "about"s) looks to me dangerously close to 65536 = 64K, and this may be a limit on what Excel can handle? Just a thought ... Ted. > The resulting data frame had 10 bad rows. The errors appear due to > read.table missing delimiter characters, with multiple data being > imported into the same cell, then the remainder of the row and the next > run together and garbled due to the reading frame shift (example: a > single cell might contain: ~ ~ ~, after which all > the cells of the row and the next are wrong). > > To replicate, I tried the same import procedure on a smaller > demographics data set from the same supplier- only about 1Mb, and got > the same kinds of errors (5 bad rows in about 3500). I also imported as > much of the file as Excel would hold and cross-checked, Excel did not > produce the same errors but can't handle the entire file. I have used > read.table on a number of other formats (mainly csv and tab-delimited) > without such problems; so far it appears there's something different > about these files that produce > s the errors but I can't see what it would be. > > Does anyone have any thoughts about what is going wrong? And is there a > way, short of manual correction, for fixing it? > > Thanks for all help, > ~Pat. > > > Pat Carroll. > what matters most is how well you walk through the fire. > bukowski. > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 16-Jul-07 Time: 15:59:48 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] mix package causes R to crash
On 12-Jul-07 23:47:39, Masanao Yajima wrote: > Dear Professor Schaefer > > I am experiencing a technical difficulty with your mix package. > I would appreciate it if you could help me with this problem. > > When I run the following code, R 2.5.1 and R 2.6.0 crashes. > It's been tested on at least 2 windows machine and it is consistent. > Execution code it's self was coped from the help file of imp.mix. > Only thing I supplied was a fake dataset. > There are several things wrong with your approach to this. 1. The mix package is for imputation when the data consist of variables of two kinds: one or more continuous variables (which will be modelled as normally distributed); and one or more discrete (categorical) variables. The levels of the latter must be represented in the data as conscutive integers starting at 1. In your data, all the variables have been generated as normally distributed, and hence not integers, except for x4 which has been converted to integer from a Normal sample using floor(). However, this generates negative integers, which are not acceptable to the mix package as levels of categorical variables. 2. The categorical variables must be the first columns in the data matrix. Your construction has made x4 (the only possible candidate) the last column; in any case your first 5 columns are continuous variables. 3. The data presented to mix must be a matrix, not a dataframe. 4. You invoke prelim.mix as prelim.mix(dat,3), which makes it treat the first 3 columns as the categorical variables, which they are not. It is your "fake dataset" which is causing the trouble. All the points above are covered in the documentation for the functions in the mix package. Hoping this helps, Ted. > > library(mix) > n <-100 > x1<-rnorm(n) > x2<-rnorm(n,2,1.2) > x3<-rnorm(n,1,2) > x4<-floor(rnorm(n)*3) > y <-rnorm(n,1*x1+2*x2+3*x3+4*x4,2) > w <-rnorm(n,3,1.2) > ymis<-y > ymis[floor(runif(10,1,n))]<-NA > wmis<-w > wmis[floor(runif(10,1,n))]<-NA > dat<-as.data.frame(cbind(wmis,ymis,x1,x2,x3,x4)) > s <- prelim.mix(dat,3)# do preliminary manipulations > thetahat <- em.mix(s) # ML estimate for unrestricted model > rngseed(1234567) # set random number generator seed > newtheta <- da.mix(s,thetahat,steps=100) # data augmentation > ximp <- imp.mix(s, newtheta, dat) # impute under newtheta > > > Your mix package is important part of our ongoing research on the > missing data project and we would like to have it working. > If you could point out to me what I am doing wrong or > some technical difficulty that I am not aware of it will be highly > appreciated. > > Thank you for your help in advance. > > Sincerely > > Masanao Yajima > [EMAIL PROTECTED] > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jul-07 Time: 09:50:52 -- XFMail -- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jul-07 Time: 10:11:39 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] calculating percent error from 2 vectors
On 12-Jul-07 17:32:03, Scott Bearer wrote: > Hello, > > I believe this is an easy scripting problem, but one I am stumbling on. > > I have a "known" vector of 3 colors with nrow=10: > known<-c("red", "blue", "red", "red", "yellow", "blue", "yellow", > "blue", > "blue", "yellow") > > and a model output vector: > modelout<-c("red", "red", "red", "blue", "yellow", "blue", "blue", > "red", > "blue", "yellow") > > I would like to determine the proportion (in)correctly identified for > each > color. In other words: > % correct "red"= > % correct "blue"= > % correct "yellow"= > > How would I code this (assuming the actual dataset is more complex)? For your example: > tbl<-table(known,modelout) > tbl modelout knownblue red yellow blue 22 0 red12 0 yellow 10 2 > dim(tbl) [1] 3 3 > for(i in (1:dim(tbl)[1])){print(sum(tbl[i,-i])/sum(tbl[i,]))} [1] 0.5 [1] 0.333 [1] 0.333 and you can modify the "print" command produce a desired format, e.g. using rownames(tbl)[i] for the successive colour names. Hoping this helps (as a start), Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 12-Jul-07 Time: 20:15:34 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] eMail results out of R
On 12-Jul-07 16:10:46, Stéphane Dray wrote: > Here is a small function that I used on Debian. It requires exim4 : > > send.mail<-function(addr='[EMAIL PROTECTED]',subject='A > message from R', > text=paste("I have finished to work > ",Sys.time(),coll="")){ > # send an email > # it requires the reconfiguration of exim4 > # you have to connect as root and > # then type dpkg-reconfigure exim4config I'm a bit puzzled by this. On any Unix/Linux system (unless something has changed very recently which I haven't heard about), the 'mail' command simply works, for any user (without having to become root); does not require exim4 (or any particular version of any particular mail agent--so long as something has to be set up so that email can be sent at all), and (for the purpose of using 'mail' from R) does not require exim4 or any other mail agent to be re-configured. The email will be sent "From:" the user who is running R. In the example I posted just now, I just used 'mail' in R's system() command without doing anything special. The mail transfer agent in my case is 'sendmail', but it's a standard configuration and nothing special has been done. > mail.cmd<-paste("mail ", > "-s \"",subject,"\" ", > addr, > " << EOT &\n", > text,"\n", > "EOT", > sep="",collapse="") > system(mail.cmd,intern=FALSE) > } > > Cheers, > > Romain Francois wrote: >> Hi, >> >> There is a paper in the April 2007 issue of R News that might be of >> help >> here. >> http://##cran mirror##/doc/Rnews/Rnews_2007-1.pdf >> >> Romain >> >> Duncan Murdoch wrote: >> >>> On 7/12/2007 9:52 AM, [EMAIL PROTECTED] wrote: >>> >>> >>>> Hi everyone, >>>> >>>> I did my homework and read the posting guideline :-) >>>> >>>> I want to eMail the results of a computing automatically. So I get >>>> the results (the parameters of a garch process) and I want to eMail >>>> them to another person. How can I do that? >>>> >>>> >>> This will depend on the system you're using. If the command >>> "emailit" >>> would work from the command line on your system, then >>> >>> system("emailit") >>> >>> should work from within R. Writing that command is the hard part, of >>> course. >>> >>> Duncan Murdoch >>> >>> > > > -- > Stéphane DRAY ([EMAIL PROTECTED] ) > Laboratoire BBE-CNRS-UMR-5558, Univ. C. Bernard - Lyon I > 43, Bd du 11 Novembre 1918, 69622 Villeurbanne Cedex, France > Tel: 33 4 72 43 27 57 Fax: 33 4 72 43 13 88 > http://biomserv.univ-lyon1.fr/~dray/ > > __ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 12-Jul-07 Time: 18:03:20 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] eMail results out of R
On 12-Jul-07 13:52:56, [EMAIL PROTECTED] wrote: > Hi everyone, > > I did my homework and read the posting guideline :-) > > I want to eMail the results of a computing automatically. So I get the > results (the parameters of a garch process) and I want to eMail them to > another person. How can I do that? > > Thx As well as the answers from John Scillieri and Duncan Murdoch: if it does happen that you're using a Unix/Linux system, then the appropriate variant of the following illustration will work (the 'mail' command should always be present on variants of these systems): In the little project I'm working on in R at the moment, I have variables x1 and x2 (each length-50 numeric vectors). So I can, in R, do: sink(file="myoutput") cbind(x1,x2) sink() system("mail ted -s \"Test No 2\" < myoutput") which has stored the 2-column output from cbind(x1,x2) in the file "myoutput", and then used the 'mail' command to mail its contents to 'ted' with subject 'Test No 2' (I used a multi-word subject to illustrate the quotes the command line will need, to avoid splitting the subject into separate tokens). See ?sink for how to use the sink() command. Then 'ted' duly received an email with contents: === Date: Thu, 12 Jul 2007 17:10:44 +0100 From: Ted Harding <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Subject: Test No 2 x1 x2 [1,] 0.37282844 0.002743146 [2,] 0.93293155 -0.108009247 ... [49,] -0.08681427 0.828313288 [50,] -0.23621908 0.385269729 === You can of course encapsulate something like the above sequence of commands into an R function, depending on how you organise the storage of the results you want to email. Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 12-Jul-07 Time: 17:25:05 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] system: Linux vs. Windows differences
On 11-Jul-07 18:28:19, Marc Schwartz wrote: > On Wed, 2007-07-11 at 16:16 -0200, Alberto Monteiro wrote: >> [I tried to send this messages two days ago, but I guess I mistyped >> the To: >> address...] >> >> Why "system" is different in Linux and Windows? Both in R 2.4.1, but >> in >> Windows there is an option: >> >> system(something, wait = FALSE) >> >> while on Linux (Fedora Core 4), there is no such option? >> >> Alberto Monteiro > > From: > > https://svn.r-project.org/R/trunk/NEWS > > for changes in R 2.5.0: > > o system() now takes the same set of arguments on all platforms, > with those which are not applicable being ignored with a > warning. Unix-alikes gain 'input' and 'wait', and Windows > gains 'ignore.stderr'. > > > Time to upgrade both your R and your FC installation. > > R is at 2.5.1. > > FC4 has been EOL (End of Life) for some time and FC5 hit EOL last > month. > > HTH, > > Marc Schwartz "End of Life" is a Nomenklatura categorisation. While we loyal Members welcome and applaud President-2.5.1, we do not forget old Comrades now air-brushed from the photographs who, now faceless, sturdily labour 24 hours a day in the fields and saltmines, and even in the dark attics of those who conceal and protect them still. There can be life in old dogs, and even strong teeth in some. Ted. [emailing from SuSE-5.2 (1997), logged in from Red Hat 9 (2003)] E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 11-Jul-07 Time: 22:32:13 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] EM algorithm for Missing Data.
On 09-Jul-07 02:20:47, Marcus Vinicius wrote: > Dear all, > I need to use the EM algorithm where data are missing. > Example: > x<- c(60.87, NA, 61.53, 72.20, 68.96, NA, 68.35, 68.11, NA, 71.38) > > May anyone help me? > > Thanks. > > Marcus Vinicius The Dempster, Laird & Rubin reference given by Simon Blomberg is the classical account of the EM Algorithm for incomplete information, though there has been a lot more published since. However, more to the point in the present case: If the above is typical of your data, you had better state what you want to do with the data. Do you want to fit a distribution by estimating parameters? Are they observations of a "response" variable with covariates and you want to fit a linear model estimating the coefficients? Are they data from a time-series and you need to interpolate at the missing values? Etc.?? Depending on what you want to do, the way you apply the general EM Algorithm procedure may be very different; and a lot of applications are not covered by Dempster, Laird & Rubin (1977). And there may possibly be no point anyway: If all you want to do is estimate the mean of the distribution of the data, then the best procedure may simply be to ignore the missing data. Best wishes, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 09-Jul-07 Time: 09:18:56 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] change the "coeffcients approach" on an anova
On 08-Jul-07 00:36:46, vincent guyader wrote: > hi everybody > > I have to do a lot of Anova with R and I would like to have another > type of > coefficients coding.. I explain. > > by default if I have 2 temperatures for an experience. 100°C or 130°C > and I > want to see the temperature effect on the presure > I want to estimate the coefficient of each temperature. > > I will obtain ,with the anova, juste one coefficients for example +3,56 > (for > 100°C), and the other (for 130°C) is always zero. > > but I prefer to obtain + 1,78 or the first effect and -1,78 for the > second > (the intercept is also different too) > > my script is (it s just an example) > > rinfo2 <- (lm(pression~ temp, data=rebe)) > anova(rinfo2) > summary(rinfo2) > > what can I change to obtain what I need? Try rinfo2 <- (lm(pression~temp-1, data=rebe)) Ted. ---- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 08-Jul-07 Time: 02:18:24 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to calculate the index "the number of species combin
On 07-Jul-07 22:18:42, Zhang Jian wrote: > I want to analyze the co-occurrence of some species. In some > papers, the authors said that the index"the number of species > combinations (COMBO)" is a good index. I try to calculate the > index by R language. But I can not get the right value. I think > that I do not understand the concept of the index because my > english is not good. > > The concept: > *The number of species combinations *This index scans the > columns of the presence-absence matrix and keeps track of the > number of unique species combinations that are represented in > different sites. For an assemblage of n species, there are 2n [I think this should be 2^n] > possible species combinations, including the combination of no > species being present (Pielou and Pielou 1968). In most real > matrices, the number of sites (= columns) is usually substantially > less than 2n, which places an upper bound on the number of species > combinations that can be found in both the observed and the > simulated matrices. English good or bad, I found the above description not easy to follow! But, since I could see only one thing it could mean if it was intended to gice a unique definition, I decided to test on you data the hypothesis that Species Combinations means the number of distinct columns in the data matrix. I took your species-incidence data (not reproduced here) and converted it to a matrix S of 0 and 1 with 19 columns and 17 rows (though the following would work just as well if it is a dataframe): S V1 V2 V3 V4 V5 V6 V7 V8 V9 V10 V11 V12 V13 V14 V15 V16 V17 V18 V19 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 4 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 5 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 6 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8 0 0 0 0 1 1 1 1 1 1 1 1 0 0 1 1 1 1 0 9 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 10 1 1 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 11 0 1 1 1 1 1 1 1 1 1 1 1 1 0 1 1 1 1 1 12 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14 0 0 0 0 1 1 1 1 0 0 0 0 0 0 0 0 0 0 0 15 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16 0 1 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 There are 10 different columns in there, as can be found by t(unique(t(S))) V1 V2 V3 V4 V5 V8 V9 V11 V13 V14 1 0 1 0 0 0 0 0 0 0 0 2 0 1 0 0 0 0 0 0 0 0 3 0 0 0 0 0 0 0 1 0 0 4 0 0 0 0 0 0 0 1 0 0 5 1 0 0 0 0 0 0 0 0 1 6 0 0 0 1 0 0 0 0 0 0 7 0 1 1 0 0 0 0 0 0 0 8 0 0 0 0 1 1 1 1 0 0 9 0 0 0 0 0 1 0 0 0 0 10 1 1 1 1 0 0 0 0 0 1 11 0 1 1 1 1 1 1 1 1 0 12 1 0 0 0 0 0 0 0 0 0 13 0 0 1 0 0 0 0 0 0 0 14 0 0 0 0 1 1 0 0 0 0 15 1 0 0 0 0 0 0 0 0 0 16 0 1 1 1 0 0 0 0 0 0 17 0 1 0 0 0 0 0 0 0 0 Of the "missing" columns, it can be seen that V6 and V7 are the same as V5, and V10, V12, V15, V16, V17, V18, V19 are the same as V9. Hence the interpretation that "COMBO" means the number of distinct columns is confirmed. If that is really the case, then a very simple R function can be written for it: COMBO<-function(S){ncol(t(unique(t(S} COMBO(S) [1] 10 > About the data, I calculated the index "COMBO" by other software. > The value of the index is 10. > Can you help me to calculate or give me some detalied explain about > the concept of the index? See above! I hope it helps. Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 08-Jul-07 Time: 00:46:08 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Recursion in R ...
On 07-Jul-07 10:34:03, Uwe Ligges wrote: > Alberto Monteiro wrote: >> Ted Harding wrote: >>> So I slickly wrote a recursive definition: >>> >>> Nnk<-function(n,k){ >>> if(n==1) {return(k)} else { >>> R<-0; >>> for(r in (1:k)) R<-(R+Nnk(n-1,k-r+1)) # ,depth)) >>> } >>> return(R) >>> } >>> >> You are aware that this is equivalent to: >> >> Nnk1 <- function(n, k) { prod(1:(n+k-1)) / prod(1:n) / prod(1:(k-1)) } > > or > > Nnk2 <- function(n, k) { gamma(n+k) / gamma(n+1) / gamma(k) } > > or most easily: > > Nnk3 <- function(n, k) choose(n+k-1, n) > > Uwe Ligges OK, I can recognise a negative binomial coefficient when I see one! I'm wondering, though, what is the "natural" connection between choose(n+k-1, n) and the statement of the original question: What is the number of distinct non-decreasing sequences of length n which can be made using integers chosen from (1:k)? (repetition allowed, of course) (The fact that this leads to a recurrence relationship which is satisfied by choose(n+k-1,n) is not what I mean by "natural" -- I'm looking for a correspondence between such a sequence, and a choice of n out of (n+k-1) somethings). Ted. >> aren't you? >> >>> ON THAT BASIS: I hereby claim the all-time record for inefficient >>> programming in R. >>> >>> Challengers are invited to strut their stuff ... >>> >> No, I don't think I can bet you unintentionally, even though >> my second computer program that I ever wrote in life had to be >> aborted, because it consumed all the resources of the computer. >> >> Alberto Monteiro >> >> __ >> R-help@stat.math.ethz.ch mailing list >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide >> http://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. > > __________ > R-help@stat.math.ethz.ch mailing list > https://stat.ethz.ch/mailman/listinfo/r-help > PLEASE do read the posting guide > http://www.R-project.org/posting-guide.html > and provide commented, minimal, self-contained, reproducible code. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 07-Jul-07 Time: 12:15:21 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Recursion in R ...
Hi Folks, R has known speed issues for recursive definitions. There was a thread "Extremely slow recursion in R?" in August 2006 (24 Aug, from Jason Liao), with some interesting comparisons between programming languages. I'm re-opening the topic, with an ulterior motive (stated at the end). The starting point was the question "How many distinct increasing sequences of length n can be made from k integers (1:k)?", i.e. what is the size of the sample space of sort(sample((1:k),n,replace=TRUE)) Let this number be Nnk(n,k). I aptly observed that Nnk(1,k) = k Nnk(n,k) = Nnk(n-1,k) + sum[r= 1 to k](Nnk(n-1,r)) So I slickly wrote a recursive definition: Nnk<-function(n,k){ if(n==1) {return(k)} else { R<-0; for(r in (1:k)) R<-(R+Nnk(n-1,k-r+1,depth)) } return(R) } I then ran a few test cases, to check timing etc.: Nnk(21, 7) 34 seconds Nnk(21, 8) 120 seconds Nnk(21, 9) 384 seconds Nnk(21,10) 1141 seconds I then complacently set it to work out the number I happened to really want: Nnk(21,20) 24 hours later, no result; and an extra "test line" (not shown above) at the start of the function had prodced no output, so the function had not even returned to the top level for the first time. Then, with heavy thunderstorms rolling in, and my mains power beginning to flicker, I closed down the computer. A quick extrapolation based on the above results, and a few others, suggested that the time for completion of Nnk(21,20) would be about 2 years. Then I thought about it. The recursion relation in fact shows that Nnk(n,k) is the maximum value in cumsum(Nnk(n,r)) over r = 1:k. Hence (for n>=1 and r>=1): Nnk<-function(n,k){ K<-rep(1,k) for(i in (1:n)){ K<-cumsum(K) } return(max(K)) } This definition then gave me, in so brief a blink of the eye that I could make no estimate of it, the result: Nnk(21,20) = 131282408400 In fact, I had to go up to things like Nnk(1000,200) = 3.335367e+232 before I could sensibly perceive the delay (about 0.5 seconds in this case). On the old recursive definition, the computation might have outlasted the Universe. ON THAT BASIS: I hereby claim the all-time record for inefficient programming in R. Challengers are invited to strut their stuff ... Best wishes to all, and happy end of week, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-Jul-07 Time: 17:53:55 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] A More efficient method?
On 04-Jul-07 13:44:44, Keith Alan Chamberlain wrote: > Dear Rhelpers, > > Is there a faster way than below to set a vector based on values > from another vector? I'd like to call a pre-existing function for > this, but one which can also handle an arbitrarily large number > of categories. Any ideas? > > Cat=c('a','a','a','b','b','b','a','a','b')# Categorical variable > C1=vector(length=length(Cat)) # New vector for numeric values > ># Cycle through each column and set C1 to corresponding value of Cat. > for(i in 1:length(C1)){ > if(Cat[i]=='a') C1[i]=-1 else C1[i]=1 > } > > C1 > [1] -1 -1 -1 1 1 1 -1 -1 1 > Cat > [1] "a" "a" "a" "b" "b" "b" "a" "a" "b" > Cat=c('a','a','a','b','b','b','a','a','b') > Cat=="b" [1] FALSE FALSE FALSE TRUE TRUE TRUE FALSE FALSE TRUE > (Cat=="b") - 0.5 [1] -0.5 -0.5 -0.5 0.5 0.5 0.5 -0.5 -0.5 0.5 > 2*((Cat=="b") - 0.5) [1] -1 -1 -1 1 1 1 -1 -1 1 to give one example of a way to do it. But you don't say why you really want to do this. You may really want factors. And what do you want to see if there is "an arbitrarily large number of categories"? For instance: > factor(Cat,labels=c(-1,1)) [1] -1 -1 -1 1 1 1 -1 -1 1 but this is not a vector, but a "factor" object. To get the vector, you need to convert Cat to an integer: > as.integer(factor(Cat)) [1] 1 1 1 2 2 2 1 1 2 where (unless you've specified otherwise in factor()) the values will correspond to the elements of Cat in "natural" order, in this case first "a" (-> 1), then "b" (-> 2). E.g. > Cat2<-c("a","a","c","b","a","b") > as.integer(factor(Cat2)) [1] 1 1 3 2 1 2 so, with C2<-as.integer(factor(Cat2)), you get a vector of distinct integers 91,2,3) for the distinct levels ("a","b","c") of Cat2. If you want integer values for these levels, you can write a function to change them. Hoping this helps to beark the ice! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 04-Jul-07 Time: 16:44:20 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Substitution of Variables
On 02-Jul-07 23:01:07, Judith Flores wrote: > Hi, >I need to run a script under the variable (that > comes from a csv file) that is assigned to another > variable. This is very a simplified version of what I > want to do: > > data<-read.csv('name.csv') > names(data)<-("VA","VB","VC") > > v<-VA > > mn(v) > > Thank you in advance, > Judith read.csv() makes 'data' into a dataframe. This is essentially a list with named components. So VA is not yet available as a variable in your program. The data for the variable you want to call "VA" is stored as the component 'data$VA' of 'data'. The names "VA", "VB", "VC" are the names you have given to the *components* of 'data'; you have not created separate variables with these names. So you can assign the data for VA directly to 'v' with: v<-data$VA or (if you think you will need it later) create a variable VA using VA<-data$VA and, if you want a variable called 'v' as well, then: v<-VA Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 03-Jul-07 Time: 00:32:40 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Dominant eigenvector displayed as third (Marco Visser)
On 29-Jun-07 21:09:37, Peter Dalgaard wrote: > Marco Visser wrote: >> This is just a curiousity, I was wondering why the dominant >> eigenvetor and eigenvalue of the following matrix is given >> as the third. I guess this could complicate automatic selection >> procedures. >> >> 000005 >> 100000 >> 010000 >> 001000 >> 000100 >> 000010 >>[...] >> Comment: In Matlab the the dominant eigenvetor and eigenvalue >> of the described matrix are given as the sixth. Again no idea why. >> > > > I get > > > eigen(mat)$values > [1] -0.65383+1.132467i -0.65383-1.132467i 0.65383+1.132467i > 0.65383-1.132467i > [5] -1.30766+0.00i 1.30766+0.00i > > Mod(eigen(mat)$values) > [1] 1.307660 1.307660 1.307660 1.307660 1.307660 1.307660 > > So all the eigenvalues are equal in modulus. What makes you think > one of them is "dominant"? When I run it I get eigenvectors 3 and 6 both purely real. It may be that Marco has confused this with "dominant". Also, the eigenvalues of these two are real, and have the largest real parts (+/- 1.3076605). All others have complex eigenvalues, of which the real parts are +/- 0.6538302. It may be that Marco has been misled by this, perceiving the real part rather than both real and complex parts, and being led to think that the largest real part corresponds to the largest eigenvalue. As has been clearly pointed out, this is not the way to look at it! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 29-Jun-07 Time: 22:47:01 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Spectral Decomposition [OOPS]
On 29-Jun-07 13:23:05, Ted Harding wrote: > [Sorry -- a silly typo in my previous]: > If A is not symmetric, you have "left" eigenvectors: > > x'*A = lambda*x' > > and "right" eigenvectors: > > A*x = lambda*x > > and the "left" eigenvectors are not the same as the "right" > eigenvectors, though you have the same set of eigenvalues lambda > in each case. > > You then have > > A = L'*B*R Should be: A = R*B*L' in that L'*R = I (unit), so then A*R = R*B so each column (right eigenvector) of R is multiplied by the corresponding lambda; L'*A = B*L' so each row (left eigenvector) of L' is multiplied by the corresponding lambda. Apologies for the slip! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 29-Jun-07 Time: 14:42:19 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Spectral Decomposition
On 29-Jun-07 12:29:31, Doran, Harold wrote: > All of my resources for numerical analysis show that the spectral > decomposition is > > A = CBC' > > Where C are the eigenvectors and B is a diagonal matrix of eigen > values. Now, using the eigen function in R > ># Original matrix > aa <- matrix(c(1,-1,-1,1), ncol=2) > > ss <- eigen(aa) > ># This results yields back the original matrix according to the > formula above > ss$vectors %*% diag(ss$values) %*% t(ss$vectors) > > However, for the following I do not get back the original matrix > using the general formula for the spectral decomposition: > > set.seed(123) > > aa <- matrix(rnorm(16), ncol=4) > > ss <- eigen(aa) > > ss$vectors %*% diag(ss$values) %*% t(ss$vectors) > > However, if I do the following > > A = CBC^{-1} > > I get back the original matrix A > > ss$vectors %*% diag(ss$values) %*% solve(ss$vectors) Harold, I think the key to the issue is whether your original matric is symmetric or not. For your formula A = C*B*C' to work, where B is a diagonal matrix (therefore essentially symmetric) you have -- bearing in mind the reversal of factors -- A' = ReverseFactorsIn(C'*B'*C) = C*B*C' = A so A would have to be symmetric. This was the case for your first example matrix(c(1,-1,-1,1), ncol=2). However, your second example will not be symmetric, so the formula will not work, and you will need A = C*B*C^(-1). If A is not symmetric, you have "left" eigenvectors: x'*A = lambda*x' and "right" eigenvectors: A*x = lambda*x and the "left" eigenvectors are not the same as the "right" eigenvectors, though you have the same set of eigenvalues lambda in each case. You then have A = L'*B*R Of course the most frequent occurrence of this kind of question in statistics is where A is a covariance or correlation matrix, which is symmetric by definition. Hoping this helps! Ted. > In my lit search I am not finding an explanation for why this works, so > I am seeking R-help since there may be a computational rationale that > can be explained by users (or authors) of the function. In my > experimentation with some computations it seems that the latter > approach is more general in that it yields back the matrix I began > with, but deviates from the formula I commonly see in texts. > > Thanks, > Harold E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 29-Jun-07 Time: 14:23:03 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Unix-like permissions to allow a user to update recommen
On 18-Jun-07 20:27:56, Patrick Connolly wrote: > On Mon, 18-Jun-2007 at 11:53AM +0100, Ted Harding wrote: > >|> On 18-Jun-07 10:11:43, Patrick Connolly wrote: >|> > I installed R from the tar.gz file (as root) in a directory under >|> > /usr/local. The recommended packages are installed in a library in >|> > that directory whereas additional packages I install in a directory >|> > under the /home directory as a user. >|> > >|> > Updating the additional packages is very easy with >|> > update.packages() >|> > as a non-root user, but the recommended packages cannot be done so >|> > readily because of file permissions. >|> > >|> > My question is how do I set the permissions or ownerships in the >|> > /usr/local/R-2.5.0 directory so that everything necessary can be >|> > writable by a user? Should I make a group for R users (total of >|> > one >|> > member) or is it simpler than that? >|> >|> Since you have root access, do you need to segregate the additional >|> packages to a particular user? > > It's handy to not have to reload packages that don't change between > versions of the basic installation. > >|> >|> Though I don't run R as root for general use, I always install/update >|> by running R CMD as root. This makes all of R ("recommended" and also >|> any extras) available system-wide, and no pemission problems arise. >|> >|> This of course does not stop you from setting up a special .Rprofile >|> for each user, since this by definition lives in the user's home >|> directory. >|> >|> Does this help? Or are there issues you haven't mentioned which make >|> such an approach not feasible? > > I don't exactly have issues. It's not a huge problem I'm dealing > with. It's simple enough for me to use update.packages() as a user > which will download the appropriate packages. Though they won't be > installed, they are all in the one place in the /tmp/ directory from > where I can install them as root. I just thought there must be a more > elegant way to set permissions so that users could write to the > subdirectories under /usr/local/R-2.xxx/. So much of the installation > process of R and its packages is so elegant, I'd like to retain some > of that elegance. On my Linux, all the places where components of R might normally be installed (/usr/lib or /usr/local/lib) are user=root, group=root, and when I look under them practically everything is writeable only by user=root. So you'd have to change a lot of permissions before any other user got rights to write to these directories. Even adding a user to group=root wouldn't change things, since group does not have write permissions (unless user=root too). I'm still wondering, though, why you don't just run the command update.packages() as root. You have root access, and you said (in the "adding user to group" context) that only one user is involved (presumably yourself?). In that case, why not start R as root and run update.packages()? Or am I missing something? Best wishes, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-07 Time: 22:25:18 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] source a specific function
On 18-Jun-07 18:53:50, Duncan Murdoch wrote: > > The exact syntax you list there won't work, but in any case, changing > the environment of a function in a package is a bad idea -- it may need > to reference things from the namespace of the package. Well, as I said before, "(assuming that you know about interdependencies)"! I tried Gabor's suggested syntax as follows, bearing in mind that mvrnorm in MASS is pure R code calling only "base" functions: library(MASS) environment(mvrnorm) <- .GlobalEnv mu=c(0,0) V<-matrix(c(1.0,0.5,0.5,1.0),ncol=2) detach() ls() [1] "%.+%""%+.%""mu" "mvrnorm" "V" mvrnorm(10,mu,V) [,1] [,2] [1,] -1.80466069 -1.8229928 [2,] 0.05565147 -1.6279434 [3,] -0.28505572 -0.8927696 [4,] -0.48919795 0.0750501 [5,] -0.08437832 0.1349296 [6,] 2.17399713 1.2881640 [7,] 1.59934824 1.3784665 [8,] 0.30555420 0.3835743 [9,] 0.11120527 -0.7287910 [10,] -0.77281783 -1.2265502 so that one seems to have worked. However, I'm not sure about the space used by MASS "going away" after detach(): Starting R from scratch: Type 'q()' to quit R. > gc() used (Mb) gc trigger (Mb) Ncells 412589 11.1 597831 16 Vcells 102417 0.8 7864326 > library(MASS) > gc() used (Mb) gc trigger (Mb) Ncells 459471 12.3 667722 17.9 Vcells 87 0.9 786432 6.0 > environment(mvrnorm) <- .GlobalEnv > detach() > gc() used (Mb) gc trigger (Mb) Ncells 459297 12.3 741108 19.8 Vcells 05 0.9 786432 6.0 > gc() used (Mb) gc trigger (Mb) Ncells 459304 12.3 818163 21.9 Vcells 18 0.9 786432 6.0 so only about 170KB of the 2.2MB used by MASS has been recovered after detach(). Or am I looking at the wrong indicator of space used? On the other hand, if I first extract the code 9f mvrnorm() which is a few lines of pure R, I get: 1. Start R from scratch: gc() used (Mb) gc trigger (Mb) Ncells 412589 11.1 597831 16 Vcells 102417 0.8 7864326 2. Paste in the code for mvrnorm, and then: gc() used (Mb) gc trigger (Mb) Ncells 412844 11.1 667722 17.9 Vcells 102591 0.8 786432 6.0 so mrvnorm on its own is only taking up about (412844-412589) + (102591-102417) = 429KB Hence I wonder whether the space first occupied by MASS (2.2MB) apart from mvrnorm gets freed up at all? Thanks for the comments. Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-07 Time: 22:00:33 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Optimization
On 18-Jun-07 16:01:03, livia wrote: > > Hi, I would like to minimize the value of x1-x2, x2 is a fixed > value of 0.01, > x1 is the quantile of normal distribution (0.0032,x) with > probability of 0.7, and the changing value should be x. > Initial value for x is 0.0207. I'm a bit puzzled by the question. If I understand it right, we can ignore x2 (since it is a fixed value) and simply consider minimising x1 (instead of x1-x2). Then, denoting by P(u) the cumulative normal distribution function for mean=0 and variance=1 (i.e. in R: pnorm(u,0,1)), and by Q(p) its inverse, corresponding to qnorm(p,0,1), we have (again if I have understood right): P((x1 - 0.0032)/x) = 0.7 so x1 = 0.0032 + x*Q(0.7) and therefore, since Q(0.7) > 0 and x must be positive, the value of x1 can be made as close to 0.032 as you please (but greater than 0.032) by taking x small enough. Hence there is no strictly minimising value of x, but the greatest lower bound of all possible values of x1 is 0.032. Then you can subtract x2. The fact that there is no positive value of x which gives this bound as the value probably explains the failure of your optim() attempt. Best wishes, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-07 Time: 17:46:01 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] source a specific function
On 18-Jun-07 14:28:35, Gabor Grothendieck wrote: > This loads all the functions into an anonymous environment defined > by local and then exports f to the global environment. > > f <- local({ > source("/a.R", local = TRUE) > environment(f) <- .GlobalEnv > f > }) That looks neat! Two questions: 1. Would something similar work for extracting selected functions from a library (assuming that you know about interdependencies)? E.g. something like f <- local({ library(f.etc.lib) environment(f) <- .GlobalEnv f }) 2. Having done what you describe to extract just f from a source file, can one then "delete" the local environment used to load the source? I think what I'm basically asking is whether the exporting is done "by value" (local environment deletion OK) or "by reference" (deletion would destroy the exported object). Apologies, but for instance "?local" is a bit too deep for me! The underlying agenda behind these queries is the saving of memory space. With theanks, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-07 Time: 17:11:15 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Unix-like permissions to allow a user to update recommen
On 18-Jun-07 10:11:43, Patrick Connolly wrote: > I installed R from the tar.gz file (as root) in a directory under > /usr/local. The recommended packages are installed in a library in > that directory whereas additional packages I install in a directory > under the /home directory as a user. > > Updating the additional packages is very easy with update.packages() > as a non-root user, but the recommended packages cannot be done so > readily because of file permissions. > > My question is how do I set the permissions or ownerships in the > /usr/local/R-2.5.0 directory so that everything necessary can be > writable by a user? Should I make a group for R users (total of one > member) or is it simpler than that? Since you have root access, do you need to segregate the additional packages to a particular user? Though I don't run R as root for general use, I always install/update by running R CMD as root. This makes all of R ("recommended" and also any extras) available system-wide, and no pemission problems arise. This of course does not stop you from setting up a special .Rprofile for each user, since this by definition lives in the user's home directory. Does this help? Or are there issues you haven't mentioned which make such an approach not feasible? Best wishes, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 18-Jun-07 Time: 11:53:19 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R vs. Splus in Pharma/Devices Industry
On 16-Jun-07 01:12:18, Marc Schwartz wrote: > On Sat, 2007-06-16 at 00:43 +0100, [EMAIL PROTECTED] wrote: >> On 15-Jun-07 19:43:55, John Kane wrote: >> > --- Douglas Bates <[EMAIL PROTECTED]> wrote: >> >> On 6/15/07, Nicholas Lewin-Koh <[EMAIL PROTECTED]> >> >> wrote: >> >> > [...] >> >> I think you mean A, not Z. First there was S, then >> >> there was R. >> > >> > We're regressing >> >> But not to mediocrity! [1] >> Ted. >> >> [1] Practical Regression and Anova using R >> Julian J. Faraway (July 2002), page 15. >> >> http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf > > > Which in turn is of course paraphrasing Galton's "Regression Toward > Mediocrity in Hereditary Statureâ. Journal of the Anthropological > Institute 15 : 246-63, 1886. > > http://galton.org/essays/1880-1889/galton-1886-jaigi-regression-stature. > pdf > >:-) > > Regards, > > Marc Indeed! My reason for referencing it that way was to indicate obliquely that we've moved on a lot since Galton, especially in R :-) :-) [ including one left over from last time :-) ] Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 16-Jun-07 Time: 09:28:11 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [OT] 'gv' and fractional points
On 15-Jun-07 20:08:05, Gabor Grothendieck wrote: > Check out the engauge digitizer: > > http://digitizer.sourceforge.net/ Thanks, Gabor. This looks useful, as far as it goes. However, it doesn't deal directly with PS, so one would have to work to pixel resolution of screenshots as displayed by say 'gv'. In the case of an example such as the Fig 4 I mentioned in a previous post, this would involve taking several screenshots to "tile" over the entirety of the graph, since the pixel accuracy or a screenshot of the entire graph would be rather poor. This would introduce the neccessity of keeing track of points from screenshot to screenshot, and also each screenshot would have its own coordinate system so these would have to be reconciled before the final dataset could be compiled. But thanks -- this could well be worth a try in a different kind of context (e.g. I have some screenshots of old maps which I want to digitise ... ). Best wishes, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 16-Jun-07 Time: 01:13:53 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R vs. Splus in Pharma/Devices Industry
On 15-Jun-07 19:43:55, John Kane wrote: > --- Douglas Bates <[EMAIL PROTECTED]> wrote: >> On 6/15/07, Nicholas Lewin-Koh <[EMAIL PROTECTED]> >> wrote: >> > [...] >> I think you mean A, not Z. First there was S, then >> there was R. > > We're regressing But not to mediocrity! [1] Ted. [1] Practical Regression and Anova using R Julian J. Faraway (July 2002), page 15. http://cran.r-project.org/doc/contrib/Faraway-PRA.pdf ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 16-Jun-07 Time: 00:43:39 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] R vs. Splus in Pharma/Devices Industry
On 15-Jun-07 19:30:45, Douglas Bates wrote: > On 6/15/07, Nicholas Lewin-Koh <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Probably when the statistical community is using Z big pharma will be >> ready to use >> R. %P > > I think you mean A, not Z. First there was S, then there was R. And, furthermore, A is further from R than Z is, which seems fitting. Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Jun-07 Time: 23:57:13 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [OT] 'gv' and fractional points
On 15-Jun-07 15:56:58, hadley wickham wrote: > This doesn't answer your original question, and isn't much help unless > you're on a mac, but there's a nice looking program that makes this > kind of graph scraping really easy: > http://www.arizona-software.ch/applications/graphclick/en/ > > Hadley Thanks, Hadley! But (as you implicitly surmise) I don't use a Mac (just non-psychedelic Linux). However, as a follow-up, I've since found that one can (somewhat tediously) do what I was asking with the GIMP. If you start the GIMP on a PostScript file, you initially get a "Load PostScript" window which asks you to choose (amongst other things" the "resolution", initially "100". If you wind this up to say "1000", then you get 10 times the positional precision in the numbers shown as below. When the image is displayed, there is again a little window which gives you the GIMP coordinates of the mouse position. With increased "Resolution", the numerical values vary correspondingly more rapidly with position. The tedious aspect (compared with 'gv') is that to get better visual resolution you need to zoom the image. This enlarges the whole image, with the result that you have to pan around to locate different parts, and you can get lost in a graph with lots of widely-spread points. With 'gv', on the other hand, you can select any small part to view in zoom in a sub-window, which is much easier to cope with; and you also get a "rubber-band" rectangle which enables you to readily align points here with points there -- e.g. if you want to read off a curve the y-value corresponding to say x=5.0. However, this is at least a partial answer to my own question! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Jun-07 Time: 17:29:49 ---------- XFMail -- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Jun-07 Time: 20:32:59 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] [OT] 'gv' and fractional points
On 15-Jun-07 16:29:53, Ted Harding wrote: > [...] > However, as a follow-up, I've since found that one can (somewhat > tediously) do what I was asking with the GIMP. As well as the awkwardness of doing it the GIMP way, I've discovered another disadvantage. I'd previously tried it on a rather small image (175x70 points, = 2.43x0.97 inches). I then tried it on a full A4 page. Even at a GIMP "Scale" factor of 300, this leads to a 50MB temporary file being created. At 1000, this would rise to some 550MB, as I found out after this attempt filled up the limited spare space I have on the disk drive in question ... No doubt "Scale"=300 (as opposed to the default of 100) may be ample for most purposes, but the overhead is still unpleasant! Hence I'm once again hankering after something which will display a PS file as efficiently as 'gv', but will report the cursor position in fractions of a point! Best wishes to all, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Jun-07 Time: 19:18:48 -- XFMail -- -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Jun-07 Time: 20:33:19 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] [OT] 'gv' and fractional points
Hi Folks, This is off-topic R-wise, but it may be close to the heart of many R-users, so I think it may be the best place to ask! Users of 'gv' (the "front end" to ghostscript) will be aware of the little window which gives you the x-y coordinates (in points = 1/72 inch) of the position of the "cross-hair" mouse cursor. These coordinates are those of the corresponding position on the printed page, relative to some origin. I have often used this to extract numerical values for data from graphs in Postscript files (also PDF files, after you have converted them to PS). Then (veering back on topic ... ) you can submit the numerical data to R and try your own analyses on these data, and compare with what the article does. However, this little window only gives the numbers in whole points. Say a smallish graphic may print out 3 inches wide or high. Then you get precision of 1/216 per 3 inches or 0.4% of full scale. This can be adequate on many occasions, but can be on the coarse side on other occasions. Even for a 6-inch-wide/high graph, you only get down to 0.2% of full scale. If it were possible to induce 'gv' to display these coordinates in tenths of a point, then much greater precision (as adequate as one can expect to hope for when, in effect, "measuring off the graph") could be obtained. Does anyone know: a) Whether it is possible to persuade 'gv' to give this display in fractional points (my own search of the documentation has not revealed anything); b) Of any alternative to 'gv' as PS viewer which would provide this capability? With thanks, and best wishes to all, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 15-Jun-07 Time: 16:13:21 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
As a tangent to this thread, there is a very relevant article in the latest issue of the RSS magazine "Significance", which I have just received: Dr Fisher's Casebook The trouble with data Significance, Vol 4 (2007) Issue 2. Full current contents at http://www.blackwell-synergy.com/toc/sign/4/2 but unfortunately you can only read any of it by paying money to Blackwell (unless you're an RSS member). Best wishes to all, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 14-Jun-07 Time: 12:24:46 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Responding to a posting in the digest
On 14-Jun-07 07:26:26, Moshe Olshansky wrote: > Is there a convenient way to respond to a particular > posting which is a part of the digest? > I mean something that will automatically quote the > original message, subject, etc. > > Thank you! > > Moshe Olshansky > [EMAIL PROTECTED] This will depend on two things. 1. Whether the mail software you use has the capability; 2. Whether the digest format would permit it anyway. Regarding (2), if you are receiving R-help in "traditional" digest format (all the messages, each with its principal headers, as one single long message-body), then the only way to respond to a particular message is to start to compose a new message and copy what you need from the digest. While I've never reveived R-help in digest format myself, according to Martin Maechler: http://finzi.psych.upenn.edu/R/Rhelp02a/archive/59429.html Please open the URL at the end of every message https://stat.ethz.ch/mailman/listinfo/r-help go to the bottom and "log in" -- clicking the [Unsubscribe or Edit Options] field. You need your mailing list password sooner or later. The one you get sent every 1st of the month; or you can have it sent to you again. Then you are in a page entitled "R-help Membership Configuration for @ Fax-to-email: +44 (0)870 094 0861 Date: 14-Jun-07 Time: 09:53:58 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Awk and Vilno
On 13-Jun-07 01:24:41, Robert Wilkins wrote: > In clinical trial data preparation and many other data situations, > the statistical programmer needs to merge and re-merge multiple > input files countless times. A syntax for merging files that is > clear and concise is very important for the statistical programmer's > productivity. > > Here is how Vilno does it: > > inlist dataset1 dataset2 dataset3 ; > joinby variable1 variable2 where ( var3<=var4 ) ; > [...] Thanks to Robert for this more explicit illustration of what Vilno does. Its potential usefulness is clear. I broadly agree with the comments that have been made about the various approaches (with/without awk/sed/R etc). Is there any URL that leads to a fuller explicit exposition of Vilno? As I said previously, a web-search on "vilno" leads to very little that is relevant. What I did find didn't amount to much. Best wishes, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jun-07 Time: 10:00:12 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Generating artificial datasets with a specific correlati
On 12-Jun-07 20:54:05, Ken Knoblauch wrote: > see mvrnorm in MASS and especially the empirical argument > > James Milks wright.edu> writes: > > >> I need to create artificial datasets with specific correlation >> coefficients (i.e. a dataset that returns r = 0.30, etc.) as examples >> for a lab I am teaching this summer. Is there a way to do that in R? >> >> Thanks. >> >> Jim Milks Alternatively, if you would prefer your datasets to have non-nomal distributions, consider the fact that if X and Y are independent, each with mean 0 and variance 1, then the correlation coefficient between (X + a*Y) and (X - a*Y) is (1 - a^2)/(1 + a^2) so if you choose a = sqrt((1 - r)/(1 + r)) then these will have correlation coefficient r. So generate X and Y as you please, and then continue as above. Best wishes, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jun-07 Time: 00:35:59 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Appropriate regression model for categorical variables
On 12-Jun-07 17:45:44, Tirthadeep wrote: > > Dear users, > In my psychometric test i have applied logistic regression > on my data. My data consists of 50 predictors (22 continuous > and 28 categorical) plus a binary response. > > Using glm(), stepAIC() i didn't get satisfactory result as > misclassification rate is too high. I think categorical > variables are responsible for this debacle. Some of them have > more than 6 level (one has 10 level). > > Please suggest some better regression model for this situation. > If possible you can suggest some article. I hope you have a very large number of cases in your data! The minimal complexity of the 28 categorical variables compatible with your description is 1 factor at 10 levels 2 factors at 7 levels 25 factors at 2 levels which corresponds to (2^25)*(7^2)*10 = 16441671680 ~= 1.6e10 distinct possible combinations of levels of the factors. Your true factors may have far more than this. Unless you have more cases than this in your data, you are likely to fall into what is called "linear separation", in which the logistic regression will find a perfect predictor for your binary outcome. This prefect predictor may well not be unique (indeed if you have only a few hundred cases there will probably be millions of them). Therefore your logistic reggression is likely to be meaningless. I can only suggest that you consider very closely how to a) reduce the numbers of levels in some of your factors, by coalescing levels together; b) defining new factors in terms of the old so as to reduce the total number of factors (which may include ignoring some factors altogether) so that you end up with new categorical variables whose total number of possible combinations is much smaller (say at most 1/5) of the number of cases in your data. In summary: you have to many explanatory variables. Best wishes, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 13-Jun-07 Time: 00:23:49 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 10-Jun-07 19:27:50, Stephen Tucker wrote: > > Since R is supposed to be a complete programming language, > I wonder why these tools couldn't be implemented in R > (unless speed is the issue). Of course, it's a naive desire > to have a single language that does everything, but it seems > that R currently has most of the functions necessary to do > the type of data cleaning described. In principle that is certainly true. A couple of comments, though. 1. R's rich data structures are likely to be superfluous. Mostly, at the sanitisation stage, one is working with "flat" files (row & column). This straightforward format is often easier to handle using simple programs for the kind of basic filtering needed, rather then getting into the heavier programming constructs of R. 2. As follow-on and contrast at the same time, very often what should be a nice flat file with no rough edges is not. If there are variable numbers of fields per line, R will not handle it straightforwardly (you can force it in, but it's more elaborate). There are related issues as well. a) If someone entering data into an Excel table lets their cursor wander outside the row/col range of the table, this can cause invisible entities to be planted in the extraneous cells. When saved as a CSV, this file then has variable numbers of fields per line, and possibly also extra lines with arbitrary blank fields. cat datafile.csv | awk 'BEGIN{FS=","}{n=NF;print n}' will give you the numbers of fields in each line. If you further pipe it into | sort -nu you will get the distinct field-numbers. If you know (by now) how many fields there should be (e.g. 10), then cat datafile.csv | awk 'BEGIN{FS=","} (NF != 10){print NR ", " NF}' will tell you which lines have the wrong number of fields, and how many fields they have. You can similarly count how many lines there are (e.g. pipe into wc -l). b) Poeple sometimes randomly use a blank space or a "." in a cell to demote a missing value. Consistent use of either is OK: ",," in a CSV will be treated as "NA" by R. The use of "." can be more problematic. If for instance you try to read the following CSV into R as a dataframe: 1,2,.,4 2,.,4,5 3,4,.,6 the "." in cols 2 and 3 is treated as the character ".", with the result that something complicated happens to the typing of the items. typeeof(D[i,j]) is always integer. sum(D[1,1]=1, but sum(D[1,2]) gives a type-error, even though the entry is in fact 2. And so on , in various combinations. And (as.nmatrix(D)) is of course a matrix of characters. In fact, columns 2 and 3 of D are treated as factors! for(i in (1:3)){ for(j in (1:4)){ print( (D[i,j]))}} [1] 1 [1] 2 Levels: . 2 4 [1] . Levels: . 4 [1] 4 [1] 2 [1] . Levels: . 2 4 [1] 4 Levels: . 4 [1] 5 [1] 3 [1] 4 Levels: . 2 4 [1] . Levels: . 4 [1] 6 This is getting altogether too complicated for the job one wants to do! And it gets worse when people mix ",," and ",.,"! On the other hand, a simple brush with awk (or sed in this case) can sort it once and for all, without waking the sleeping dogs in R. I could go on. R undoubtedly has the power, but it can very quickly get over-complicated for simple jobs. Best wishes to all, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 10-Jun-07 Time: 22:14:35 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 10-Jun-07 14:04:44, Sarah Goslee wrote: > On 6/10/07, Ted Harding <[EMAIL PROTECTED]> wrote: > >> ... a lot of the problems with data >> files arise at the data gathering and entry stages, where people >> can behave as if stuffing unpaired socks and unattributed underwear >> randomly into a drawer, and then banging it shut. > > Not specifically R-related, but this would make a great fortune. > > Sarah > -- > Sarah Goslee > http://www.functionaldiversity.org I'm not going to object to that! Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 10-Jun-07 Time: 21:18:45 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 10-Jun-07 02:16:46, Gabor Grothendieck wrote: > That can be elegantly handled in R through R's object > oriented programming by defining a class for the fancy input. > See this post: > https://stat.ethz.ch/pipermail/r-help/2007-April/130912.html > for a simple example of that style. > > On 6/9/07, Robert Wilkins <[EMAIL PROTECTED]> wrote: >> Here are some examples of the type of data crunching you might >> have to do. >> >> In response to the requests by Christophe Pallier and Martin Stevens. >> >> Before I started developing Vilno, some six years ago, I had >> been working in the pharmaceuticals for eight years ( it's not >> easy to show you actual data though, because it's all confidential >> of course). I hadn't heard of Vilno before (except as a variant of "Vilnius"). And it seems remarkably hard to find info about it from a Google search. The best I've come up with, searching on vilno data is at http://www.xanga.com/datahelper This is a blog site, apparently with postings by Robert Wilkins. At the end of the Sunday, September 17, 2006 posting "Tedious coding at the Pharmas" is a link: "I have created a new data crunching programming language." http://www.my.opera.com/datahelper which appears to be totally empty. In another blog article: "go to the www.my.opera.com/datahelper site, go to the August 31 blog article, and there you will find a tarball-file to download, called vilnoAUG2006package.tgz" so again inaccessible; and a google on "vilnoAUG2006package.tgz" gives a single hit which is simply the same aricle. In the Xanga blog there are a few examples of tasks which are no big deal in any programming language (and, relative to their simplicity, appear a bit cumbersome in "Vilno"). I've not seen in the blog any instance of data transformation which could not be quite easily done in any straigthforward language (even awk). >> Lab data can be especially messy, especially if one clinical >> trial allows the physicians to use different labs. So let's >> consider lab data. >> [...] That's a fairly daunting description, though indeed not at all extreme for the sort of data that can arise in practice (and not just in pharmaceutical investigations). But the complexity is in the situation, and, whatever language you use, the writing of the program will involve the writer getting to grips with the complexity, and the complexity will be present in the code simply because of the need to accomodate all the special cases, exceptions and faults that have to be anticipated in "feral" data. Once these have been anticipated and incorporated in the code, the actual transformations are again no big deal. Frankly, I haven't yet seen anything "Vilno" that couldn't be accomodated in an 'awk' program. Not that I'm advocating awk for universal use (I'm not that monolithic about it). But I'm using it as my favourite example of a flexible, capable, transparent and efficient data filtering language, as far as it goes. SO: where can one find out more about Vilno, to see what it may really be capable of that can not be done so easily in other ways? (As is implicit in many comments in Robert's blog, and indeed also from many postings to this list over time and undoubtedly well known to many of us in practice, a lot of the problems with data files arise at the data gathering and entry stages, where people can behave as if stuffing unpaired socks and unattributed underwear randomly into a drawer, and then banging it shut). Best wishes to all, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 10-Jun-07 Time: 09:28:10 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Tools For Preparing Data For Analysis
On 08-Jun-07 08:27:21, Christophe Pallier wrote: > Hi, > > Can you provide examples of data formats that are problematic > to read and clean with R ? > > The only problematic cases I have encountered were cases with > multiline and/or varying length records (optional information). > Then, it is sometimes a good idea to preprocess the data to > present in a tabular format (one record per line). > > For this purpose, I use awk (e.g. > http://www.vectorsite.net/tsawk.html), > which is very adept at processing ascii data files (awk is > much simpler to learn than perl, spss, sas, ...). I want to join in with an enthusiastic "Me too!!". For anything which has to do with basic checking for the kind of messes that people can get data into when they "put it on the computer", I think awk is ideal. It is very flexible (far more so than many, even long-time, awk users suspect), very transparent in its programming language (as opposed to say perl), fast, and with light impact on system resources (rare delight in these days, when upgrading your software may require upgrading your hardware). Although it may seem on the surface that awk is "two-dimensional" in its view of data (line by line, and per field in a line), it has some flexible internal data structures and recursive function capability, which allows a lot more to be done with the data that have been read in. For example, I've used awk to trace ancestry through a genealogy, given a data file where each line includes the identifier of an individual and the identifiers of its male and female parents (where known). And that was for pedigree dogs, where what happens in real life makes Oedipus look trivial. > I have never encountered a data file in ascii format that I > could not reformat with Awk. With binary formats, it is > another story... But then it is a good idea to process the binary file using an instance of the creating software, to produce a ASCII file (say in CSV format). > But, again, this is my limited experience; I would like to > know if there are situations where using SAS/SPSS is really > a better approach. The main thing often useful for data cleaning that awk does not have is any associated graphics. It is -- by design -- a line-by-line text-file processor. While, for instance, you could use awk to accumulate numerical histogram counts, you would have to use something else to display the histogram. And for scatter-plots there's probably not much point in bringing awk into the picture at all (unless a preliminary filtration of mess is needed anyway). That being said, though, there can still be a use to extract data fields from a file for submission to other software. Another kind of area where awk would not have much to offer is where, as a part of your preliminary data inspection, you want to inspect the results of some standard statistical analyses. As a final comment, utilities like awk can be used far more fruitfully on operating systems (the unixoid family) which incorporate at ground level the infrastructure for "plumbing" together streams of data output from different programs. Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 08-Jun-07 Time: 10:43:05 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Spectral analysis
On 06-Jun-07 20:55:09, David LEDU wrote: > Hi all, > > I am dealing with paleoceanographic data and I have a C14 time serie > and one other variable. I would like to perform a spectral analysis > (fft or wavelet) and plot it. Unfortunately I don't know the exact > script to do this. Does anybody could send me an example to perform my > spectral analysis ? > > I Thank you > > David There are a lot of possible ramifications to your query, but for a basic spectral analysis of a series you can use the function spectrum() in the "stats" package. What is the role of the "other variable"? Best wishes, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-Jun-07 Time: 22:34:07 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] random numbers selection - simple example
On 06-Jun-07 14:30:44, Jenny Barnes wrote: > Dear R-help, > > Which random number generator function would you recommend for > simply picking 15 random numbers from the sequence 0-42? I want > to use replacement (so that the same number could potentially be > picked more than once). R has the function sample() which samples a given number of items from a given set, without replacement by default, but with replacement if you specify this. Enter ?sample for more information. In the above case sample((0:42), 15, replace=TRUE) will do what you seem to describe above. Example: > sample((0:42), 15, replace=TRUE) [1] 26 38 1 41 11 30 22 37 28 0 0 25 10 39 27 if you want them in random order (i.e. "as they come off the line"), or > sort(sample((0:42), 15, replace=TRUE)) [1] 1 3 5 8 8 10 16 17 21 25 30 30 33 34 40 if you want them sorted. Best wishes, Ted. > I have read the R-help archives and the statistics and computing > book on modern Applied statistics with S but the advice seems to > be for much form complicated examples, there must be a simpler way > for what I am trying to do? > > If anybody can help me I would greatly appreciate your advice and time, > > Best Wishes, > > Jenny ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-Jun-07 Time: 16:24:15 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] spatial simulation
On 02-Jun-07 01:06:48, Kitty Lee wrote: > Dear R-users, > > I'm trying to do some spatial simulation. I have two covariates, > Z and C. I want to examine their relationship under different > spatial distribution. > > I have no problem simulating completely spatial random process > but I'm totally stuck on poisson (cluster) pattern. I already > have a dataset with Z and C (obs=575) and I know the relationship > between them. Using these 575 cases, how can I simulate a clustered > process and have a dataset that has four columns: > > x-coor y-coor z c > > I know I can use rpois or pcp.sim to generate points that clustered > and then use cbind to attach Z and C values. But the problem is my > observations will not be spatially clustered. How can I simulate so > that now Z is spatially clustered? > > Thanks! > > K. Have a look at the package spatstat: Description: Spatial Point Pattern data analysis, modelling and simulation including multitype/marked points and spatial covariates It has a flexible repertoire of spatial point processes that you can simulate from. Best wishes, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 02-Jun-07 Time: 08:33:17 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Make sign test show test statistics
On 14-May-07 10:07:53, Johan A. Stenberg wrote: > When I perform a two-tailed sign test with the following simple syntax, > > binom.test(59,100) > > R returns a P-value (0.088) but nothing else. As I want the result for > a > one-tailed test I take P/2 = 0.044). 1: If you want a 1-sided P-value, use the appropriate option to binom.test(). For example (assuming your Null Hypothesis is that p = 0.5, which is the default, and your Alternative Hypothesis is that p > 0.5): binom.test(x=59, n=100,alternative="greater") Exact binomial test data: 59 and 100 number of successes = 59, number of trials = 100, p-value = 0.04431 alternative hypothesis: true probability of success is greater than 0.5 95 percent confidence interval: 0.5028918 1.000 sample estimates: probability of success 0.59 (which confirms that your "divide by two" strategy was about right in this case, though that will not always be true; and certainly will not be appropriate for the opposite Alternative Hypothesis that p < 0.5). > However, the journal to which I've > submitted my results requests the test statistics, not just the > P-values. How can I make R return the test statistics? 2: Have a word (et borgerlig ord, as the Danes say -- sorry I do not have the same familiarity with Swedish, but maybe you say the same) with the journal editor. The plain and simple fact is that the data (x = 59 out of n = 100) constitue the test statistic. In the conditional context that n=100, the test statistic is simply x = 59. However, one should strictly speaking always include a statement about n, no matter how one expresses the test statistic. For instance, you might use t = x/n as test statistic (with expectation p=0.5 and SD sqrt( p*(1-p)/n ) in your case). While its expecation now depends only on the NH (p = 0.5), it cannot be evaluated as a test without knowldge of n since that is needed to get the SD. So R (see above) has indeed returned the "test statistic", and the journal is not being reasonable! Hoping this helps, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 14-May-07 Time: 11:57:19 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] power 2x3 exact test
On 10-May-07 15:07:10, Thomas Lumley wrote: > On Thu, 10 May 2007, [EMAIL PROTECTED] wrote: >> >> Given that you expect some cells to be small, it should not >> be a severe task to draw up a list of (a1,b1) values which >> correspond to rejection of the null hypothesis (that both >> ORs equal 1), and then the simulation using different values >> of the two odds-ratios will give you the power for each such >> pair of odds-ratios. >> >> The main technical difficulty will be simulation of random >> tables, conditional on the marginals, with the probabilities >> as given above. >> >> I don't know of a good suggestion for this. > > r2dtable(). Thanks for pointing this out! (And if I had astutely done help.search("marginal") I would have found it). > If this is a power calculation, though, you probably want to > fix only one margin, which is a much simpler problem, That is probably a fair point (real-life situations where both margins are objectively fixed are probably sparse, to the point where it may almost be worth collecting them as an "exhibition"). But then Bingshan Li is faced with the issue that the power (given the odds-ratio) is then dependent on one of the cell probabilities as a "nuisance parameter". This of course can be eliminated by conditioning on the total number of successes over the dimension whose marginals are fixed; but then we are in effect back to the Fisher Exact Test, whose power is a function of the odds-ratio alone. I'm still with Duncan Murdoch, in that what's meant by "power" in this case depends very much on what you mean by "alternative hypothesis". Best wishes to all, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 10-May-07 Time: 18:01:13 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Allocating shelf space
Hi Folks, This is not an R question as such, though it may well have an R answer. (And, in any case, this community probably knows more about most things than most others ... indeed, has probably pondered this very question). I: Given a "catalogue" of hundreds of books, where each "entry" has author and title (or equivalent ID), and also Ia) The dimensions (thickness, height, depth) of the book Ib) A sort of classification of its subject/type/genre II: Given also a specification of available and possibly potential bookshelf space (numbers of book-cases, the width, height and shelf-spacing of each, and the dimensions of any free wall-space where further book-cases may be placed), where some book-cases have fixed shelves and some have shelves with (discretely) adjustable position, and additional book-cases can be designed to measure (probably with adjustable shelves). Question: Is there a resource to approach the solution of the problem of optimising the placement of adjustable shelves, the design of additional bookcases, and the placement of the books in the resulting shelf-space so as to A: Make the efficient use of space B: Minimise the spatial disclocation of related books (it is acceptable to separate large books from small books on the same subject, for the sake of efficient packing). Awaiting comments and suggestions with interest! With thanks, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 09-May-07 Time: 18:23:53 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] power 2x3 exact test
On 09-May-07 22:00:27, Duncan Murdoch wrote: > On 09/05/2007 5:11 PM, Bingshan Li wrote: > > Hi, all, > > > > I am wondering if there is an algorithm for calculating power of 2x3 > > table using Fisher exact test. There is one (algorithm 280) for 2x2 > > Fisher exact test but I couldn't find one for 2x3 table. If we are > > not lucky enough to have one, is there any other way to calculate > > exact power of 2x3 table? The reason why I want exact power is > > because some cells are assumed to be very small and chi square > > approximation is not valid. > > I think there are lots of possible alternatives to the null in a 2x3 > table, so you may have trouble finding a single answer to this > question. > But assuming you have one in mind, I'd suggest doing a Monte Carlo > power calculation: simulate a few thousand tables from the alternative > distribution, and see what the distribution of p-values looks like. > > Duncan Murdoch I'd back Duncan on that point! More specifically, for the 2x2 table, the table, conditional on the marginals, is a function of one element (say top left-hand corner), and the probability of any table depends on the single parameter which is the odds-ratio of the 4 cell probabilities. so this case is relatively easy and straightforward and, indeed, for the 2x2 table R'a fisher.test() allows you to specify the odds-ratio as a "null" parameter. This is not the case with fisher.test() for a larger (say 2x3 table), so to investigate that case you cannot use fisher.test(). In all cases, however (according to the FORTRAN code on which itis based -- see the reference in "?fisher.table"), the rejection region for the exact fisher.test() consists of those table with the smallest probabilities. For the 2x3 table, say (cell counts with margins, and probabilities): a1 b1 c1 | d1 p1 q1 r1 | a2 b2 c2 | d2 p2 q2 r2 --- a b c | n so that a1+b1+c1 = d1, a2+b2+c2 = d2, a1+a2 = a, b1+b2 = b, c1+c2 = c the table is a function of any two functionally independent cells (say a1 and b1), and its probability is a function of some two odds-ratios, say (p1*r2)/(r1*p2) (q1*r2)/(r1*q2) which, for the standard null hypothesis, are both equal to 1. However, as Duncan says, alternatives are 2-dimensional and so there is not a unique natural form for an alternative (as opposed to the 2x2 case, where it boils down to (p1*q2)/(p2*q1) being not equal to 1, therefore greater than 1, or less than 1, or 2-sidedly either >1 or <1). The probability of the 2x3 table is proportional to ((p1*r2)/(r1*p2))^a1 * ((q1*r2)/(r1*q2))^b1 (or equivalent), divided by the product of the factorials of a1, b1, c1, a2, b2, c2, subject to summing to 1 over all combinations of (a1,b1) giving rise to a table compatible with the marginal constraints. Given that you expect some cells to be small, it should not be a severe task to draw up a list of (a1,b1) values which correspond to rejection of the null hypothesis (that both ORs equal 1), and then the simulation using different values of the two odds-ratios will give you the power for each such pair of odds-ratios. The main technical difficulty will be simulation of random tables, conditional on the marginals, with the probabilities as given above. I don't know of a good suggestion for this. The fisher.test() function will not help (see above). In the case of the 2x2 table, is is a straightforward hypergeometric distribution, but 2x3 is not straightforward. Maybe someone has written a function for this kind of application, and can point it out to us??? A quick R-site search did not help! Best wishes, ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 10-May-07 Time: 00:12:29 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] A function for raising a matrix to a power?
[Apologies: This will probably break the thread, but at the moment I cannot receive mail since my remote mail-server is down, and so am responding to the message as found on the R-help archives; hence this is not a "Reply"] From: Atte Tenkanen attenka at utu.fi Sun May 6 11:07:07 CEST 2007 > Is there a function for raising a matrix to a power? For example if you > like to compute A%*%A%*%A, is there an abbreviation similar to A^3? > > Atte Tenkanen > > > A=rbind(c(1,1),c(-1,-2)) > > A > [,1] [,2] > [1,]11 > [2,] -1 -2 > > A^3 > [,1] [,2] > [1,]11 > [2,] -1 -8 > > But: > > > A%*%A%*%A > [,1] [,2] > [1,]12 > [2,] -2 -5 Of course, "^" alone acts element-wise on the matrix A, so the result of A^3 is the matrix B where B[i,j] = A[i,j]^3. However, you can write your own, and call it for instance "%^%": "%^%"<-function(A,n){ if(n==1) A else {B<-A; for(i in (2:n)){A<-A%*%B}}; A } Then: > A [,1] [,2] [1,]11 [2,] -1 -2 > A%^%1 [,1] [,2] [1,]11 [2,] -1 -2 > A%^%2 [,1] [,2] [1,]0 -1 [2,]13 > A%^%3 [,1] [,2] [1,]12 [2,] -2 -5 > A%^%100 [,1] [,2] [1,] -1.353019e+20 -3.542248e+20 [2,] 3.542248e+20 9.273727e+20 Hoping this helps! Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-May-07 Time: 11:35:31 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Creating contingency table from mixed data
var1: 1 Var3: 3 Var4: 1 $Count [1] 1 $Mean Var2 Var5 12 152 but this format is not very convenient for incorporating into a contingency table such as the one shown above (obtained by hand). Probably others can find a way to convert the above output from CT into a contingency table. However, unless you have a lot of these to do, it may be quicker to do one, or a few, by hand! Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 06-May-07 Time: 02:57:12 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] how to reproduce the same sampled units?
On 02-May-07 23:25:18, Santanu Pramanik wrote: > Hi all, > > Is it possible to generate the same sample number of times in R? > In SAS, using the option "seed" it is possible to reproduce exactly > the same sample. Is there any such feature in R which I can use? > > For further clarity, > > for (i in 1:2) > { > samp = sample(1:1000,100,replace = FALSE) > print(samp) > } > > For the above simulation, is it possible to generate the same sampled > units twice? > > Any suggestion would be gratefully acknowledged. > Thanks, > Santanu Have a look at set.seed(): ?set.seed You will of course have to invoke it before you do your first run. You cannot find out what the seed was afterwards! For example (your example somewhat reduced): for (i in 1:2) { samp = sample(1:1000,10,replace = FALSE) print(samp) } # [1] 984 587 26 920 304 247 434 392 650 279 # [1] 178 619 458 208 389 629 988 263 598 308 for (i in 1:2) { set.seed(512534) samp = sample(1:1000,10,replace = FALSE) print(samp) } # [1] 271 106 570 621 257 663 399 454 983 805 # [1] 271 106 570 621 257 663 399 454 983 805 Best wishes, Ted. -------- E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 03-May-07 Time: 00:56:43 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Simulation using parts of density function
On 02-May-07 07:45:48, Prof Brian Ripley wrote: > Please do not send everything twice: you are using R-help in both the > To: > and Cc: fields. > > I disagree with Ted: it _is_ much easier to create a generator for this > purpose. > > Consider > > rtgamma <- function(n, ..., tr = log(500)) > { > p <- pgamma(tr, ...) > qgamma(p*runif(n), ...) > } > > as inversion (especially at C level) is plenty fast enough. Of course ... !! Just to explain Brian's solution above: Since pgamma(rgamma(...),...) is uniformly distributed on (0,1), if rgamma is truncated to (0,tr) them pgamma(rgamma) will be truncated to (0,pgamma(tr)), and hence uniformly distributed on this range. Best wishes, Ted. ------------ E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 02-May-07 Time: 09:16:08 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Simulation using parts of density function
On 01-May-07 17:03:46, Thür Brigitte wrote: > > Hi > > My simulation with the followin R code works perfectly: > sim <- replicate(999, sum(exp(rgamma(rpois(1,2000), > scale = 0.5, shape = 12 > > But now I do not want to have values in object "sim" exceeding > 5'000'000, that means that I am just using the beginning of > densitiy function gamma x < 15.4. Is there a possibility to > modify my code in an easy way? > > Thanks for any help! > > Regards, Brigitte A somewhat extreme problem! The easiest way to modify the code is as below -- certiainly easier than writing a special function to draw random samples from the truncated gamma distribution. A bit of experimentation shows that, from your code above, about 10% of the results are <= 500. So: sim<-NULL remain <- 999 while(remain>0){ sim0<-replicate(10*remain, sum(exp(rgamma(rpois(1,2000), scale = 0.5, shape = 12))) ) sim<-c(sim,sim0[sim0<=500]) remain<-(999 - length(sim)) } sim<-sim[1:999] Results of a run: sum(sim>500) [1] 0 max(sim) [1] 4999696 length(sim) [1] 999 It may be on the slow side (though not hugely -- on a quite slow machine the above run was completed in 2min 5sec, while the 999-replicate in your original took 15sec. So about 8 times as long. Most of this, of course, is taken up with the first round. Hoping this helps, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 01-May-07 Time: 19:18:01 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] to draw a smooth arc
This thread prompts me to ask about something I've been pondering for a while, as to whether there's an implementation somewhere ticked away in the R resources. So far, people have been responding to the original query in terms of increasing the numbers of points, and joining these by lines. However, if you're using PostScript output, you can draw really smooth curves by exploiting PS's "curveto" operator. This draws a cubic-curve segment in the following way: The two points you want to join with a curve will be denoted by (X0,Y0) and (X3,Y3) in the following (for reasons which will appear). The PS command is of the form x1 y1 x2 y2 X3 Y3 curevto At (X0,Y0) the tangent to the curve (as it departs from (X0,Y0) is in the direction of the directed line from (X0,Y0) to (x1,y1), and at (X3,Y3) (as it arrives) the tangent to the curve is in the direction of the directed line from (x2,y3) to (X3,Y3). The location of (X0,Y0) is not part of the command, since it is implicit in the PS "currentpoint" which is the starting point of the curve. The result is (in theory, and in practice to within the resolution of the output device) a perfectly smooth curve, provided the consecutive cubic segments have the same tangent at each of the points being joined. This can be achieved by appropriate choice of the "intermediate" points -- (x1,y2), (x2,y2) above. So far, when I've done this myself (including when using the output from R to give the points being joined), I've done the computation of the "intermediate" points "by hand". This basically involves deciding, at each of the points being joined, what the tangent to the smooth curve shouold be. Of course, there is an element of arbitrariness in this, unless there is an analytic representation of the curve on which the points lie (e.g. you're plotting sin(x)/x every pi/8, and want to join them smoothly), when all you need is the derivatives at the points. Crudely, you might evaluate the direction at a point in terms os a weighted average of the directions to its two immediate neighbours (the nearer meghbour ges the greater weight); less crudely, you might fit a quadratic through the point and its 2 neighbours and use the gradient at the middle point; and so on. Once you've decided on the tangent at each point, it's then straightforward to compute suitable "intermediate points" to serve as (x1,y2) and (x2,y2). (One application where this sort of approach is needed is in joining computed points on iso-contours, where the individual points have been determined by interpolation of spot-measurements at nearby measuring stations). Anyway. The Question: is there a general function for the above kind of smooth curve-drawing? With thanks, Ted. E-Mail: (Ted Harding) <[EMAIL PROTECTED]> Fax-to-email: +44 (0)870 094 0861 Date: 01-May-07 Time: 14:50:38 -- XFMail -- __ R-help@stat.math.ethz.ch mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.