Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)
Hi All, Thanks so much for your inputs, it's so nice to have such a helpful community -- I wrote some kind of mix between the different replies, I copy my final code below. All the best, Emmanuel mat = read.csv("~/format_test.csv", fill=TRUE, header=FALSE, as.is=TRUE) first.col.idx = grep(":",mat[,1]) first.col.val = grep(":",mat[,1],value=TRUE) first.col.val = gsub(pattern=":", replacement="", first.col.val) mat.clean = mat[-first.col.idx,] reps = diff(c(first.col.idx,length(mat1[,1]))) reps[1] = reps[1]+1 mat.final = cbind( rep(first.col.val, reps-1), mat.clean) On Fri, 31 Jan 2020 at 20:31, Berry, Charles wrote: > > > > On Jan 31, 2020, at 1:04 AM, Emmanuel Levy > wrote: > > > > Hi, > > > > I'd like to use the Netflix challenge data and just can't figure out how > to > > efficiently "scan" the files. > > https://www.kaggle.com/netflix-inc/netflix-prize-data > > > > The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. > or > > 3 values associated to each ID: > > > > The format is as follows: > > *1:* > > value1,value2, value3 > > value1,value2, value3 > > value1,value2, value3 > > value1,value2, value3 > > *2:* > > value1,value2, value3 > > value1,value2, value3 > > *3:* > > value1,value2, value3 > > value1,value2, value3 > > value1,value2, value3 > > *4:* > > etc ... > > > > And I want to create a matrix where each line is of the form: > > > > ID value1, value2, value3 > > > > Si "ID" needs to be duplicated - I could write a Perl script to convert > > this format to CSV, but I'm sure there's a simple R trick. > > > > I'd be tempted to use pipe() to separately read the ID lines and the value > lines, but in R you can do this: > > fc <- count.fields( "yourfile.txt", sep = ",") > rlines <- split( readLines( "yourfile.txt" ), fc) > mat <- > cbind( rlines[[1]], > do.call( rbind, strsplit( rlines[[2]], ","))) > > > This assumes that there are exactly 1 or 3 fields in each row of > "yourfile.txt", if not, some incantation of grepl() applied to the text of > readLines() should suffice. > > HTH, > > Chuck > > > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to read a file containing two types of rows - (for the Netflix challenge data format)
Hi, I'd like to use the Netflix challenge data and just can't figure out how to efficiently "scan" the files. https://www.kaggle.com/netflix-inc/netflix-prize-data The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or 3 values associated to each ID: The format is as follows: *1:* value1,value2, value3 value1,value2, value3 value1,value2, value3 value1,value2, value3 *2:* value1,value2, value3 value1,value2, value3 *3:* value1,value2, value3 value1,value2, value3 value1,value2, value3 *4:* etc ... And I want to create a matrix where each line is of the form: ID value1, value2, value3 Si "ID" needs to be duplicated - I could write a Perl script to convert this format to CSV, but I'm sure there's a simple R trick. Thanks for suggestions! Emmanuel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Adding a column to an **empty** data.frame
Dear Bill, Thanks a lot this is what I was looking for. Emmanuel On 2 November 2016 at 15:59, William Dunlap <wdun...@tibco.com> wrote: > Give your new column a vector type - NULL cannot be extended beyond length > 0. Also, the syntax df$col<-NULL means to remove 'col' from 'df'. > > > df <- data.frame(id=integer(0), data=numeric(0)) > > df$new.col <- character(0) > > str(df) > 'data.frame': 0 obs. of 3 variables: > $ id : int > $ data : num > $ new.col: chr > > (Think of 'zero-row' or 'zero-column' data.frames, not 'empty' > data.frames, which could be either.) > > > > Bill Dunlap > TIBCO Software > wdunlap tibco.com > > On Wed, Nov 2, 2016 at 6:48 AM, Emmanuel Levy <emmanuel.l...@gmail.com> > wrote: > >> Dear All, >> >> This sounds simple but can't figure out a good way to do it. >> >> Let's say that I have an empty data frame "df": >> >> ## creates the df >> df = data.frame( id=1, data=2) >> >> ## empties the df, perhaps there is a more elegant way to create an empty >> df? >> df = df[-c(1),] >> >> > df >> [1] id data >> <0 rows> (or 0-length row.names) >> >> Now, how can I add a third column name to that empty df? >> >> Normally I would use df$new.col = 1 >> >> But here I can't use this because it's empty! >> >> I tried df$new.col=NULL -- this doesn't give an error but doesn't add the >> column. >> >> Thanks for your help, >> >> Emmanuel >> >> [[alternative HTML version deleted]] >> >> __ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guide http://www.R-project.org/posti >> ng-guide.html >> and provide commented, minimal, self-contained, reproducible code. >> > > [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Adding a column to an **empty** data.frame
Dear All, This sounds simple but can't figure out a good way to do it. Let's say that I have an empty data frame "df": ## creates the df df = data.frame( id=1, data=2) ## empties the df, perhaps there is a more elegant way to create an empty df? df = df[-c(1),] > df [1] id data <0 rows> (or 0-length row.names) Now, how can I add a third column name to that empty df? Normally I would use df$new.col = 1 But here I can't use this because it's empty! I tried df$new.col=NULL -- this doesn't give an error but doesn't add the column. Thanks for your help, Emmanuel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] combining columns into a combination index of the same length
Thanks Thierry, you made my day :) On 21 July 2015 at 17:00, Thierry Onkelinx thierry.onkel...@inbo.be wrote: Please always keep the mailing list in cc. If mat is a data.frame, then you can use do.call. Then the number of columns doesn't matter. do.call(paste, mtcars[, c(mpg, cyl)]) do.call(paste, mtcars[, c(mpg, cyl, disp)]) do.call(paste, mtcars) ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie Kwaliteitszorg / team Biometrics Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-07-21 15:43 GMT+02:00 Emmanuel Levy emmanuel.l...@gmail.com: Thanks! -- this is indeed much faster (plus I made a mistake, one has to use paste with the option collapse=. The thing is I'm looking for a solution *without paste*. The reason is that* there may be two or more columns*. On 21 July 2015 at 16:32, Thierry Onkelinx thierry.onkel...@inbo.be wrote: Yes. paste0() can work on vectors. So paste0(mat[, col1], mat[, col2]) ir. Thierry Onkelinx Instituut voor natuur- en bosonderzoek / Research Institute for Nature and Forest team Biometrie Kwaliteitszorg / team Biometrics Quality Assurance Kliniekstraat 25 1070 Anderlecht Belgium To call in the statistician after the experiment is done may be no more than asking him to perform a post-mortem examination: he may be able to say what the experiment died of. ~ Sir Ronald Aylmer Fisher The plural of anecdote is not data. ~ Roger Brinner The combination of some data and an aching desire for an answer does not ensure that a reasonable answer can be extracted from a given body of data. ~ John Tukey 2015-07-21 15:21 GMT+02:00 Emmanuel Levy emmanuel.l...@gmail.com: Hi, The answer to this is probably straightforward, I have a dataframe and I'd like to build an index of column combinations, e.g. col1 col2 -- col3 (the index I need) A 1 1 A 1 1 A 2 2 B 1 3 B 2 4 B 2 4 At the moment I use: col3 - apply(mat[,sel.col], 1, paste0) But I wonder if another approach could be faster? Thanks, Emmanuel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] combining columns into a combination index of the same length
Hi, The answer to this is probably straightforward, I have a dataframe and I'd like to build an index of column combinations, e.g. col1 col2 -- col3 (the index I need) A 1 1 A 1 1 A 2 2 B 1 3 B 2 4 B 2 4 At the moment I use: col3 - apply(mat[,sel.col], 1, paste0) But I wonder if another approach could be faster? Thanks, Emmanuel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Retrieve indexes of the first occurrence of numbers in an effective manner
Hi, That sounds simple but I cannot think of a really fast way of getting the following: c(1,1,2,2,3,3,4,4) would give c(1,3,5,7) i.e., a function that returns the indexes of the first occurrences of numbers. Note that numbers may have any order e.g., c(3,4,1,2,1,1,2,3,5), can be very large, and the vectors are also very large (which prohibits any loop). The best I could think of is: tmp = c(1,1,2,2,3,3,4,4) u = unique(tmp) sapply(u, function(x){ which(!is.na(match(tmp,x)))[1]} ) But there might be a useful function I don't know .. Thanks for any hint. All the best, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Finding (swapped) repetitions of numbers pairs across two columns
Hi, I've had this problem for a while and tackled it is a quite dirty way so I'm wondering is a better solution exists: If we have two vectors: v1 = c(0,1,2,3,4) v2 = c(5,3,2,1,0) How to remove one instance of the 3,1 / 1,3 double? At the moment I'm using the following solution, which is quite horrible: v1 = c(0,1,2,3,4) v2 = c(5,3,2,1,0) ft - cbind(v1, v2) direction = apply( ft, 1, function(x) return(x[1]x[2])) ft.tmp = ft ft[which(direction),1] = ft.tmp[which(direction),2] ft[which(direction),2] = ft.tmp[which(direction),1] uniques = apply( ft, 1, function(x) paste(x, collapse=%) ) uniques = unique(uniques) ft.unique = matrix(unlist(strsplit(uniques,%)), ncol=2, byrow=TRUE) Any better solution would be very welcome! All the best, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Finding (swapped) repetitions of numbers pairs across two columns
I did not know that unique worked on entire rows! That is great, thank you very much! Emmanuel On 27 December 2012 22:39, Marc Schwartz marc_schwa...@me.com wrote: unique(t(apply(cbind(v1, v2), 1, sort))) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to re-order clusters of hclust output?
Hello, The heatmap function conveniently has a reorder.dendrogram function so that clusters follow a certain logic. It seems that the hclust function doesn't have such feature. I can use the reorder function on the dendrogram obtained from hclust, but this does not modify the hclust object itself. I understand that the answer should be within the heatmap function given that it uses hclust to work, but I find it very hard to follow actually. For example I just don;t get what is 1L and 2L. Any help would be appreciated! Thanks, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to re-order clusters of hclust output?
Hi Michael, Thanks for the info! I think the key function I was missing is: order.dendrogram and not REorder.dendrogram. It returns me the new order, so I think I should get going with that :) Emmanuel On 11/05/2012, R. Michael Weylandt michael.weyla...@gmail.com michael.weyla...@gmail.com wrote: I don't have a general answer to your question, but 1L and 2L are just the integers 1 and 2 (the L makes them integers instead of doubles which is useful for some things) Michael On May 11, 2012, at 2:15 PM, Emmanuel Levy emmanuel.l...@gmail.com wrote: Hello, The heatmap function conveniently has a reorder.dendrogram function so that clusters follow a certain logic. It seems that the hclust function doesn't have such feature. I can use the reorder function on the dendrogram obtained from hclust, but this does not modify the hclust object itself. I understand that the answer should be within the heatmap function given that it uses hclust to work, but I find it very hard to follow actually. For example I just don;t get what is 1L and 2L. Any help would be appreciated! Thanks, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to flatten a multidimensional array into a dataframe?
Hi, I have a three dimensional array, e.g., my.array = array(0, dim=c(2,3,4), dimnames=list( d1=c(A1,A2), d2=c(B1,B2,B3), d3=c(C1,C2,C3,C4)) ) what I would like to get is then a dataframe: d1 d2 d3 value A1 B1 C1 0 A2 B1 C1 0 . . . A2 B3 C4 0 I'm sure there is one function to do this transformation, I just don't know which one. Thanks for your help, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to flatten a multidimensional array into a dataframe?
OK, it seems that the array2df function from arrayhelpers package does the job :) On 19 April 2012 16:46, Emmanuel Levy emmanuel.l...@gmail.com wrote: Hi, I have a three dimensional array, e.g., my.array = array(0, dim=c(2,3,4), dimnames=list( d1=c(A1,A2), d2=c(B1,B2,B3), d3=c(C1,C2,C3,C4)) ) what I would like to get is then a dataframe: d1 d2 d3 value A1 B1 C1 0 A2 B1 C1 0 . . . A2 B3 C4 0 I'm sure there is one function to do this transformation, I just don't know which one. Thanks for your help, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Idea/package to linearize a curve along the diagonal?
Dear David and Jeff, Only if you were going apply some sort of transformation that did not extend globally Exactly, this is why the LPCM package is great, as it assigns points to parts of a curve. I think I pretty much got what I need - it is not perfect yet but it should be enough to give you an idea of what I was trying to achieve. All the best, Emmanuel ### Example ### library(LPCM) tmp=rnorm(2000) X.1 = 5+tmp Y.1 = 5+ (5*tmp+rnorm(2000)) tmp=rnorm(1000) X.2 = 9+tmp Y.2 = 40+ (1.5*tmp+rnorm(1000)) X.3 = 7+ 0.5*runif(500) Y.3 = 15+20*runif(500) Y = c(X.1,X.2,X.3) X = c(Y.1,Y.2,Y.3) lpc1 = lpc(cbind(X,Y), scaled=FALSE, h=c(1,1) , control=lpc.control( boundary=1)) my.proj = lpc.spline(lpc1, project=TRUE, optimize=TRUE) data = cbind( dist= my.proj$closest.pi, X1=lpc1$data[,1], Y1=lpc1$data[,2], Xo=my.proj$closest.coords[,1], Yo=my.proj$closest.coord[,2]) transfoData = matrix(apply(data, 1, function(x) { return( transfo( (5+x[1])/10,x[2],x[3],x[4],x[5]))}), ncol=2, byrow=TRUE) plot(transfoData) ## This shows the result I'm looking for, not perfect yet but it gives an idea. ### ### Moves a point from it's position to the new normalized position ### transfo = function(dist, X1, Y1, X0, Y0) { # First, the point needs to be rotated trans=newCoord(X1, Y1, X0, Y0) ; Xnew=X1+trans[1] Ynew=Y1+trans[2] # second it is taken on the diagonal. Xfinal=dist Yfinal=dist X.TransToDiag = Xfinal-X0 Y.TransToDiag = Yfinal-Y0 return( c(Xnew+X.TransToDiag, Ynew+Y.TransToDiag)) } ## Rotates a point X1,Y1 relative to Xo,Yo ## The new point is either at 3pi/4 or 7pi/4 i.e., 90 degrees left or ## right of the diagonal. ## newCoord = function(X1,Y1, Xo=0, Yo=0){ # First calculates the coordinates of the point relative to Xo,Yo Xr = X1-Xo Yr = Y1-Yo # Now calculates the new coordinates, # i.e., # if V is the vector defined from Xo,Yo to X1,Y1, # the new coordinates are such that Xf, Yf are at angle TETA # by default TETA=3*pi/4 or 135 degrees To = atan2(Yr,Xr) # XXX This is not perfect but will do the job for now if(Yr Xr){ TETA=3*pi/4 } else { TETA=7*pi/4 } Xn = Xr * (cos(TETA)/cos(To)) Yn = Yr * (sin(TETA)/sin(To)) # Xn, Yn are the new coordinates relative to Xo, Yo # However for the translation I need absolute coordinates # These are given by Xo + Xn and Y0 + Yn Xabs = Xo+Xn Yabs = Yo+Yn ## the translation that need to be applied to X1 and Y1 are thus: Xtrans = Xabs-X1 Ytrans = Yabs-Y1 return(c(Xtrans,Ytrans)) } On 12 March 2012 20:58, David Winsemius dwinsem...@comcast.net wrote: On Mar 12, 2012, at 3:07 PM, Emmanuel Levy wrote: Hi Jeff, Thanks for your reply and the example. I'm not sure if it could be applied to the problem I'm facing though, for two reasons: (i) my understanding is that the inverse will associate a new Y coordinate given an absolute X coordinate. However, in the case I'm working on, the transformation that has to be applied depends on X *and* on its position relative to the *normal* of the fitted curve. This means, for instance, that both X and Y will change after transformation. (ii) the fitted curve can be described by a spline, but I'm not sure if inverse of such models can be inferred automatically (I don't know anything about that). The procedure I envision is the following: treat the curve segment by segment, apply rotation+translation to each segment to bring it on the diagonal, That makes sense. Although the way I am imagining it would be to do a condition (on x) shift. and apply the same transformation to all points corresponding to the same segment (i.e., these are the points that are close and within the normal area covered by the segment). Does this make sense? The first part sort of makes sense to me... maybe. You are thinking of some sort of local transformation that converts a curve to a straight line by rotation or deformation. Seems like a problem of finding a transformation of a scalar field. But you then want it extended outward to affect the regions at some distance from the curve. That's where might break down or at least becomes non-trivial. Because a region at a distance could be in the sights of the normal vector to the curve (not in the statistical sense but in the vector-field sense) of more than one segment of the curve. Only if you were going apply some sort of transformation that did not extend globally would you be able to do anything other than a y|x (y conditional on x) shift or expansion contraction All the best, Emmanuel On 12 March 2012 02:15, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: It is possible that I do not see what you mean, but it seems like the following code does what you want. The general version
Re: [R] Idea/package to linearize a curve along the diagonal?
Hi Jeff, Thanks for your reply and the example. I'm not sure if it could be applied to the problem I'm facing though, for two reasons: (i) my understanding is that the inverse will associate a new Y coordinate given an absolute X coordinate. However, in the case I'm working on, the transformation that has to be applied depends on X *and* on its position relative to the *normal* of the fitted curve. This means, for instance, that both X and Y will change after transformation. (ii) the fitted curve can be described by a spline, but I'm not sure if inverse of such models can be inferred automatically (I don't know anything about that). The procedure I envision is the following: treat the curve segment by segment, apply rotation+translation to each segment to bring it on the diagonal, and apply the same transformation to all points corresponding to the same segment (i.e., these are the points that are close and within the normal area covered by the segment). Does this make sense? All the best, Emmanuel On 12 March 2012 02:15, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: It is possible that I do not see what you mean, but it seems like the following code does what you want. The general version of this is likely to be rather more difficult to do, but in theory the inverse function seems like what you are trying to accomplish. x - 1:20 y - x^2 + rnorm(20) y.lm - lm( y ~ I(x^2) + x ) plot( x, y ) lines( x, predict( y.lm ) ) qa - coef(y.lm)[I(x^2)] qb - coef(y.lm)[x] qc - coef(y.lm)[(Intercept)] # define inverse of known model f1 - function( y ) { ( sqrt( 4*qa*( y -qc ) + qb^2 ) - qb ) / ( 2*qa ) } f2 - function( y ) { -( sqrt( 4*qa*( y -qc ) + qb^2 ) + qb ) / ( 2*qa ) } plot( x, f1( y ) ) --- Jeff Newmiller The . . Go Live... DCN:jdnew...@dcn.davis.ca.us Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Emmanuel Levy emmanuel.l...@gmail.com wrote: Dear Jeff, I'm sorry but I do not see what you mean. It'd be great if you could let me know in more details what you mean whenever you can. Thanks, Emmanuel On 12 March 2012 00:07, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote: Aren't you just reinventing the inverse of a function? --- Jeff Newmiller The . . Go Live... DCN:jdnew...@dcn.davis.ca.us Basics: ##.#. ##.#. Live Go... Live: OO#.. Dead: OO#.. Playing Research Engineer (Solar/Batteries O.O#. #.O#. with /Software/Embedded Controllers) .OO#. .OO#. rocks...1k --- Sent from my phone. Please excuse my brevity. Emmanuel Levy emmanuel.l...@gmail.com wrote: Hi, I am trying to normalize some data. First I fitted a principal curve (using the LCPM package), but now I would like to apply a transformation so that the curve becomes a straight diagonal line on the plot. The data used to fit the curve would then be normalized by applying the same transformation to it. A simple solution could be to apply translations only (e.g., as done after a fit using loess), but here rotations would have to be applied as well. One could visualize this as the stretching of a curve, i.e., during such an unfolding process, both translations and rotations would need to be applied. Before I embark on this (which would require me remembering long forgotten geometry principles) I was wondering if you can think of packages or tricks that could make my life simpler? Thanks for any input, Emmanuel Below I provide an example - the black curve is to be brought along the diagonal, and all data points normal to a small segment (of the black curve) would undergo the same transformation as it - what small is remains to be defined though. tmp=rnorm(2000) X.1 = 5+tmp Y.1 = 5+ (5*tmp+rnorm(2000)) tmp=rnorm(1000) X.2 = 9+tmp Y.2 = 40+ (1.5*tmp+rnorm(1000)) X.3 = 7+ 0.5*runif(500) Y.3 = 15+20*runif(500) X = c(X.1,X.2,X.3) Y = c(Y.1,Y.2,Y.3) lpc1 = lpc(cbind(X,Y), scaled=FALSE, h=c(1,1) ) plot(lpc1) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r
Re: [R] Which non-parametric regression would allow fitting this type of data? (example given).
Hello Bert, Thanks so much for these suggestions. They led me to the package LPCM, which I found worked best with minimum tuning. lpc1 = lpc(cbind(X,Y), scaled=TRUE, h=c(0.05,0.05)) plot(lpc1) ... et voila! All the best, Emmanuel On 11 March 2012 00:37, Bert Gunter gunter.ber...@gene.com wrote: Thanks for the example. Have you tried fitting a principal curve via either the princurve or pcurve packages? I think this might work for what you want, but no guarantees. Note that loess, splines, etc. are all fitting y|x, that is, a nonparametric regression of y on x. That is not what you say you want, so these approaches are unlikely to work. -- Bert On Sat, Mar 10, 2012 at 6:20 PM, Emmanuel Levy emmanuel.l...@gmail.com wrote: Hi, I'm wondering which function would allow fitting this type of data: tmp=rnorm(2000) X.1 = 5+tmp Y.1 = 5+ (5*tmp+rnorm(2000)) tmp=rnorm(100) X.2 = 9+tmp Y.2 = 40+ (1.5*tmp+rnorm(100)) X.3 = 7+ 0.5*runif(500) Y.3 = 15+20*runif(500) X = c(X.1,X.2,X.3) Y = c(Y.1,Y.2,Y.3) plot(X,Y) The problem with loess is that distances for the goodness of fit are calculated on the Y-axis. However, distances would need to be calculated on the normals of the fitted curve. Is there a function that provide this option? A simple trick in that case consists in swapping X and Y, but I'm wondering if there is a more general solution? Thanks for your input, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Bert Gunter Genentech Nonclinical Biostatistics Internal Contact Info: Phone: 467-7374 Website: http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Idea/package to linearize a curve along the diagonal?
Hi, I am trying to normalize some data. First I fitted a principal curve (using the LCPM package), but now I would like to apply a transformation so that the curve becomes a straight diagonal line on the plot. The data used to fit the curve would then be normalized by applying the same transformation to it. A simple solution could be to apply translations only (e.g., as done after a fit using loess), but here rotations would have to be applied as well. One could visualize this as the stretching of a curve, i.e., during such an unfolding process, both translations and rotations would need to be applied. Before I embark on this (which would require me remembering long forgotten geometry principles) I was wondering if you can think of packages or tricks that could make my life simpler? Thanks for any input, Emmanuel Below I provide an example - the black curve is to be brought along the diagonal, and all data points normal to a small segment (of the black curve) would undergo the same transformation as it - what small is remains to be defined though. tmp=rnorm(2000) X.1 = 5+tmp Y.1 = 5+ (5*tmp+rnorm(2000)) tmp=rnorm(1000) X.2 = 9+tmp Y.2 = 40+ (1.5*tmp+rnorm(1000)) X.3 = 7+ 0.5*runif(500) Y.3 = 15+20*runif(500) X = c(X.1,X.2,X.3) Y = c(Y.1,Y.2,Y.3) lpc1 = lpc(cbind(X,Y), scaled=FALSE, h=c(1,1) ) plot(lpc1) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to fit a line through the Mountain crest, i.e., through the highest density of points - in a loess-like fashion.
Hi, I'm trying to normalize data by fitting a line through the highest density of points (in a 2D plot). In other words, if you visualize the data as a density plot, the fit I'm trying to achieve is the line that goes through the crest of the mountain. This is similar yet different to what LOESS does. I've been using loess before, but it does not exactly that as it takes into account all points. Although points farther from the fit have a smaller weight, they result in the fit being a bit off the crest. Do you know a package or maybe even an option in loess that would allow me achieve this? Any advice or idea appreciated. Emmanuel [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to improve the robustness of loess? - example included.
Hi, I posted a message earlier entitled How to fit a line through the Mountain crest ... I figured loess is probably the best way, but it seems that the problem is the robustness of the fit. Below I paste an example to illustrate the problem: tmp=rnorm(2000) X.background = 5+tmp; Y.background = 5+ (10*tmp+rnorm(2000)) X.specific = 3.5+3*runif(1000); Y.specific = 5+120*runif(1000) X = c(X.background, X.specific);Y = c(Y.background, Y.specific) MINx=range(X)[1];MAXx=range(X)[2] my.loess = loess(Y ~ X, data.frame( X = X, Y = Y), family=symmetric, degree=2, span=0.1) lo.pred = predict(my.loess, data.frame(X = seq(MINx, MAXx, length=100)), se=TRUE) plot( seq(MINx, MAXx, length=100), lo.pred$fit, lwd=2,col=2, l) points(X,Y, col= grey(abs(my.loess$res)/max(abs(my.loess$res))) ) As you will see, the red line does not follow the background signal. However, when decreasing the specific signal to 500 points it becomes perfect. I'm sure there is a way to tune the fitting so that it works but I can't figure out how. Importantly, *I cannot increase the span* because in reality the relationship I'm looking at is more complex so I need a small span value to allow for a close fit. I foresee that changing the weigthing is the way to go but I do not really understand how the weight option is used (I tried to change it and nothing happened), and also the embedded tricubic weighting does not seem changeable. So any idea would be very welcome. Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to fit a line through the Mountain crest, i.e., through the highest density of points - in a loess-like fashion.
Hi, Thanks a lot for your reply - I posted a second message where I provide a dummy example, entitled How to improve the robustness of loess? - example included. I need to fit a curve which makes it a bit difficult to work with kde2d only. I'm actually trying to use kde2d in combination with loess - basically I give the output density of kde2d as weights in the loess function. It seems to give nice results :) In my second post I wrote that the weight option did not work but that's because I was writing weigth - not sure why I did not get an error message. I'll post the lines of code as a reply to the second post. All the best, Emmanuel On 10 March 2012 19:46, David Winsemius dwinsem...@comcast.net wrote: On Mar 10, 2012, at 3:55 PM, Emmanuel Levy wrote: Hi, I'm trying to normalize data by fitting a line through the highest density of points (in a 2D plot). In other words, if you visualize the data as a density plot, the fit I'm trying to achieve is the line that goes through the crest of the mountain. Are you familiar with the kde2d of bkde2D functions in various packages? If you then collected the max density for each X and Y you might want to see whether that 2-d function would follow a sufficiently regular path that would represent the projection of the ridge on the z=0 plane. This is similar yet different to what LOESS does. Do you want a curve or a line? I've been using loess before, but it does not exactly that as it takes into account all points. Although points farther from the fit have a smaller weight, they result in the fit being a bit off the crest. Do you know a package or maybe even an option in loess that would allow me achieve this? I don't. I happen to have a dataset where I could test it. But you are likely to get better responses if you provide a test case. Any advice or idea appreciated. Emmanuel [[alternative HTML version deleted]] Plain text is preferred. -- David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to improve the robustness of loess? - example included.
Ok so this seems to work :) tmp=rnorm(2000) X.background = 5+tmp Y.background = 5+ (10*tmp+rnorm(2000)) X.specific = 3.5+3*runif(3000) Y.specific = 5+120*runif(3000) X = c(X.background, X.specific) Y = c(Y.background, Y.specific) MINx=range(X)[1] MAXx=range(X)[2] MINy=range(Y)[1] MAXy=range(Y)[2] ## estimates the density for each datapoint nBins=50 my.lims= c(range(X,na.rm=TRUE),range(Y,na.rm=TRUE)) z1 = kde2d(X,Y,n=nBins, lims=my.lims, h= c( (my.lims[2]-my.lims[1])/(nBins/4) , (my.lims[4]-my.lims[3])/(nBins/4) ) ) X.cut = cut(X, seq(z1$x[1], z1$x[nBins],len=(nBins+1) )) Y.cut = cut(Y, seq(z1$y[1], z1$y[nBins],len=(nBins+1) )) xy.cuts = data.frame(X.cut,Y.cut, ord=1:(length(X.cut)) ) density = data.frame( X=rep(factor(levels(X.cut)),rep(nBins) ), Y=rep(factor(levels(Y.cut)), rep(nBins,nBins) ) , Z= as.vector(z1$z)) xy.density = merge( xy.cuts, density, by=c(1,2), sort=FALSE, all.x=TRUE) xy.density = xy.density[order(x=xy.density$ord),] ### Now uses the density as a weight my.loess = loess(Y ~ X, data.frame( X = X, Y = Y), family=symmetric, degree=2, span=0.1, weights= xy.density$Z^3) lo.pred = predict(my.loess, data.frame(X = seq(MINx, MAXx, length=100)), se=TRUE) plot( seq(MINx, MAXx, length=100), lo.pred$fit, lwd=2,col=2, l) #, ylim=c(0, max(tmp$fit, na.rm=TRUE) ) , col=dark grey) points(X,Y, pch=., col= grey(abs(my.loess$res)/max(abs(my.loess$res))) ) On 10 March 2012 18:30, Emmanuel Levy emmanuel.l...@gmail.com wrote: Hi, I posted a message earlier entitled How to fit a line through the Mountain crest ... I figured loess is probably the best way, but it seems that the problem is the robustness of the fit. Below I paste an example to illustrate the problem: tmp=rnorm(2000) X.background = 5+tmp; Y.background = 5+ (10*tmp+rnorm(2000)) X.specific = 3.5+3*runif(1000); Y.specific = 5+120*runif(1000) X = c(X.background, X.specific);Y = c(Y.background, Y.specific) MINx=range(X)[1];MAXx=range(X)[2] my.loess = loess(Y ~ X, data.frame( X = X, Y = Y), family=symmetric, degree=2, span=0.1) lo.pred = predict(my.loess, data.frame(X = seq(MINx, MAXx, length=100)), se=TRUE) plot( seq(MINx, MAXx, length=100), lo.pred$fit, lwd=2,col=2, l) points(X,Y, col= grey(abs(my.loess$res)/max(abs(my.loess$res))) ) As you will see, the red line does not follow the background signal. However, when decreasing the specific signal to 500 points it becomes perfect. I'm sure there is a way to tune the fitting so that it works but I can't figure out how. Importantly, *I cannot increase the span* because in reality the relationship I'm looking at is more complex so I need a small span value to allow for a close fit. I foresee that changing the weigthing is the way to go but I do not really understand how the weight option is used (I tried to change it and nothing happened), and also the embedded tricubic weighting does not seem changeable. So any idea would be very welcome. Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Which non-parametric regression would allow fitting this type of data? (example given).
Hi, I'm wondering which function would allow fitting this type of data: tmp=rnorm(2000) X.1 = 5+tmp Y.1 = 5+ (5*tmp+rnorm(2000)) tmp=rnorm(100) X.2 = 9+tmp Y.2 = 40+ (1.5*tmp+rnorm(100)) X.3 = 7+ 0.5*runif(500) Y.3 = 15+20*runif(500) X = c(X.1,X.2,X.3) Y = c(Y.1,Y.2,Y.3) plot(X,Y) The problem with loess is that distances for the goodness of fit are calculated on the Y-axis. However, distances would need to be calculated on the normals of the fitted curve. Is there a function that provide this option? A simple trick in that case consists in swapping X and Y, but I'm wondering if there is a more general solution? Thanks for your input, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Best HMM package to generate random (protein) sequences?
Dear All, I would like to generate random protein sequences using a HMM model. Has anybody done that before, or would you have any idea which package is likely to be best for that? The important facts are that the HMM will be fitted on ~3 million sequential observations, with 20 different states (one for each amino acid). I guess that 2-5 hidden states should be enough, and an order of 3 would also be enough I would say. Thanks a lot for any suggestion! All the best, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] How to do a probability density based filtering in 2D?
Hello, This sounds like a problem to which many solutions should exist, but I did not manage to find one. Basically, given a list of datapoints, I'd like to keep those within the X% percentile highest density. That would be equivalent to retain only points within a given line of a contour plot. Thanks to anybody who could let me know which function I could use! Best, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to do a probability density based filtering in 2D?
Hello David, I thought about this at first as well, e.g., x1.lim = quantile(x1,prob=c(0.05,0.95)) y2.lim = quantile(y2,prob=c(0.05,0.95)) x1.sub = x1[ x1 x1.lim[1] x1 x1.lim[2] y2 y2.lim[1] y2 y2.lim[2]] y2.sub = y2[ x1 x1.lim[1] x1 x1.lim[2] y2 y2.lim[1] y2 y2.lim[2]] But this is actually does not do what I want because it is not based on the density of the data. What I would like is to keep only the points within an area where the density of points is over x. In other words, if you imagine a contour plot, I'd like to keep all the points within a given contour line. So the data has to be interpolated or smoothed at some point. I am using the kde2d function of the MASS package to plot contour lines but I don't know how to retrieve subsets of what I plot. Also I wouldn't be surprised if there is a trick with quantile that escapes my mind. Thanks for your help, Emmanuel On 19 November 2010 21:25, David Winsemius dwinsem...@comcast.net wrote: On Nov 19, 2010, at 8:44 PM, Emmanuel Levy wrote: Hello, This sounds like a problem to which many solutions should exist, but I did not manage to find one. Basically, given a list of datapoints, I'd like to keep those within the X% percentile highest density. That would be equivalent to retain only points within a given line of a contour plot. Thanks to anybody who could let me know which function I could use! What's wrong with quantile? -- David Winsemius, MD West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] How to do a probability density based filtering in 2D?
Hello Roger, Thanks for the suggestions. I finally managed to do it using the output of kde2d - The code is pasted below. Actually this made me realize that the outcome of kde2d can be quite influenced by outliers if a boundary box is not given (try running the code without the boundary box, e.g.lims, and you will see). library(MASS) x1 = rnorm(1) x2 = rnorm(1) nBins=200 z1 = kde2d(x1,x2,n=nBins, lims=c(-4,4,-4,4)) x1.cut = cut(x1, seq(z1$x[1], z1$x[nBins],len=(nBins+1) )) x2.cut = cut(x2, seq(z1$y[1], z1$y[nBins],len=(nBins+1) )) xy.cuts = data.frame(x1.cut,x2.cut, ord=1:(length(x1.cut)) ) density = data.frame( X=rep(factor(levels(x1.cut)),rep(nBins,nBins) ), Y=rep(factor(levels(x2.cut)),nBins) , Z= as.vector(z1$z)) xy.density = merge( xy.cuts, density, by=c(1,2), sort=FALSE, all.x=TRUE) xy.density = xy.density[order(x=xy.density$ord),] xy.density$Z[is.na(xy.density$Z)]=0 Z.lim = quantile(xy.density$Z, prob=c(0.1)) to.plot = xy.density$Z Z.lim[1] par(mfrow=c(1,2)) plot(x1,x2, xlim=c(-3,3) ,ylim=c(-3,3)) plot(x1[to.plot], x2[to.plot], xlim=c(-3,3),ylim=c(-3,3)) On 19 November 2010 21:52, Roger Koenker rkoen...@illinois.edu wrote: Also I wouldn't be surprised if there is a trick with quantile that escapes my mind. I would be surprised... this is all very dependent on how you do the density estimation, but you might look at contourLines and then try to use in.convex.hull() from tripack, but beware that things get more complicated if the density is multi-modal. You might also look at one of the packages that do data depth sorts of things. Roger Koenker rkoen...@illinois.edu __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] problem with PDF/postcript, cannot change paper size: ‘mode(width)’ and ‘mod e(height)’ differ between new and previous
Hi, The pdf function would not let me change the paper size and gives me the following warning: pdf(figure.pdf, width=6, height=10) Warning message: ‘mode(width)’ and ‘mode(height)’ differ between new and previous == NOT changing ‘width’ ‘height’ If I use the option paper = a4r, it does not give me a warning but still prints on a square region (it does not use the entire page). Two days ago I updated R and associated packages. I'm not sure if this could be the cause? Even if I restart a new session I keep getting the same error. I also tried X11.options(reset=TRUE) and it does not help. Thanks for any hint you could give me, this is really annoying (the function postscript gives the same error so I cannot make any figure!). all the best, Emmanuel sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] problem with PDF/postcript, cannot change paper size: ‘mode(width)’ and ‘mod e(height)’ differ between new and previous
Update - sorry for the stupid question, let's say it's pretty late. For those who may be as tired as I am and get the same warning, the paper size should be given as an integer! On 16 November 2010 04:17, Emmanuel Levy emmanuel.l...@gmail.com wrote: Hi, The pdf function would not let me change the paper size and gives me the following warning: pdf(figure.pdf, width=6, height=10) Warning message: ‘mode(width)’ and ‘mode(height)’ differ between new and previous == NOT changing ‘width’ ‘height’ If I use the option paper = a4r, it does not give me a warning but still prints on a square region (it does not use the entire page). Two days ago I updated R and associated packages. I'm not sure if this could be the cause? Even if I restart a new session I keep getting the same error. I also tried X11.options(reset=TRUE) and it does not help. Thanks for any hint you could give me, this is really annoying (the function postscript gives the same error so I cannot make any figure!). all the best, Emmanuel sessionInfo() R version 2.12.0 (2010-10-15) Platform: x86_64-pc-linux-gnu (64-bit) locale: [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8 [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8 [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C [9] LC_ADDRESS=C LC_TELEPHONE=C [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Random sampling while keeping distribution of nearest neighbor distances constant.
Dear All, I cannot find a solution to the following problem although I imagine that it is a classic, hence my email. I have a vector V of X values comprised between 1 and N. I would like to get random samples of X values also comprised between 1 and N, but the important point is: * I would like to keep the same distribution of distances between the X values * For example let's say N=10 and I have V = c(3,4,5,6) then the random values could be 1,2,3,4 or 2,3,4,5 or 3,4,5,6, or 4,5,6,7 etc.. so that the distribution of distances (3 - 4, 3 -5, 3 - 6, 4 - 5, 4 - 6 etc ...) is kept constant. I couldn't find a package that help me with this, but it looks like it should be a classic problem so there should be something! Many thanks in advance for any help or hint you could provide, All the best, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Random sampling while keeping distribution of nearest neighbor distances constant.
Dear All,(my apologies if it got posted twice, it seems it didn't get through) I cannot find a solution to the following problem although I suppose this is a classic. I have a vector V of X=length(V) values comprised between 1 and N. I would like to get random samples of X values also comprised between 1 and N, but the important point is: * I would like to keep the same distribution of distances between the original X values * For example let's say N=10 and I have V = c(3,4,5,6) then the random values could be 1,2,3,4 or 2,3,4,5 or 3,4,5,6, or 4,5,6,7 etc.. so that the distribution of distances (3 - 4, 3 -5, 3 - 6, 4 - 5, 4 - 6 etc ...) is kept constant. I couldn't find a package that help me with this, but it looks like it should be a classic problem so there should be something! Many thanks in advance for any help or hint you could provide, All the best, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random sampling while keeping distribution of nearest ne
Thanks for your suggestion Ted, This would indeed work for the particular example I gave, but I am looking for a general solution. For example, if my values are: V=c(2,4,5,6) Then there would be two possibilities: 2,4,5,6 or 4,5,6,8 more generally, what I mean is that the matrix of distances between pairs of values in V should be similar in the vector of random values. Note that in practice, N is around 7,000,000 and X=length(V) may vary between 20,000 and 500,000. It'd be great if you could point me out to the name of this class of problem, to a book, or to a package that could help me solve it. Many thanks! Emmanuel PS: I apologize that I sent a second post. This one did not appear in my R-help label so I assumed it wasn't sent for some reason. 2009/8/12 Ted Harding ted.hard...@manchester.ac.uk: On 12-Aug-09 22:05:24, Emmanuel Levy wrote: Dear All, I cannot find a solution to the following problem although I imagine that it is a classic, hence my email. I have a vector V of X values comprised between 1 and N. I would like to get random samples of X values also comprised between 1 and N, but the important point is: * I would like to keep the same distribution of distances between the X values * For example let's say N=10 and I have V = c(3,4,5,6) then the random values could be 1,2,3,4 or 2,3,4,5 or 3,4,5,6, or 4,5,6,7 etc.. so that the distribution of distances (3 - 4, 3 -5, 3 - 6, 4 - 5, 4 - 6 etc ...) is kept constant. I couldn't find a package that help me with this, but it looks like it should be a classic problem so there should be something! Many thanks in advance for any help or hint you could provide, All the best, Emmanuel If I've understood you right, you are basically putting a sequence with given spacings in a random position amongst the available positions. In your example, you would randomly choose between 1,2,3,4/2,3,4,5/3,4,5,6/4,5,6,7/5,6,7,8/6,7,8,9/7,8,9,10/ Hence a result Y could be: A - min(V) L - max(V) - A + 1 M - (0:(N-L)) Y - 1 + (V-A) + sample(M,1) I think this does it! E-Mail: (Ted Harding) ted.hard...@manchester.ac.uk Fax-to-email: +44 (0)870 094 0861 Date: 12-Aug-09 Time: 23:49:22 -- XFMail -- __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random sampling while keeping distribution of nearest neighbor distances constant.
Dear Daniel, Thank a lot for your suggestion. It is helpful and got me thinking more about it so that I can rephrase it: Given a vector V containing X values, comprised within 1 and N. I'd like to sample values so that the *distribution* of distances between the X values is similar. There are several distributions: the 1st order would be given by the function diff. The 2d order distribution would be given by diff(V[seq(1,length(V),by=2)]) and diff(V[seq(2,length(V),by=2)]) The 3rd order distribution diff(V[seq(1,length(V),by=3)]) and diff(V[seq(2,length(V),by=3)]) and diff(V[seq(3,length(V),by=3)]) The 4th order I would like to produce different samples, where the first, or first and second, or first and second and third, or up to say five orders distance distributions are reproduced. Is anybody aware of a formalism that is explained in a book and that could help me deal with this problem? Or even better of a package? Thanks for your help, Emmanuel 2009/8/12 Nordlund, Dan (DSHS/RDA) nord...@dshs.wa.gov: -Original Message- From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On Behalf Of Emmanuel Levy Sent: Wednesday, August 12, 2009 3:05 PM To: r-h...@stat.math.ethz.ch Cc: dev djomson Subject: [R] Random sampling while keeping distribution of nearest neighbor distances constant. Dear All, I cannot find a solution to the following problem although I imagine that it is a classic, hence my email. I have a vector V of X values comprised between 1 and N. I would like to get random samples of X values also comprised between 1 and N, but the important point is: * I would like to keep the same distribution of distances between the X values * For example let's say N=10 and I have V = c(3,4,5,6) then the random values could be 1,2,3,4 or 2,3,4,5 or 3,4,5,6, or 4,5,6,7 etc.. so that the distribution of distances (3 - 4, 3 -5, 3 - 6, 4 - 5, 4 - 6 etc ...) is kept constant. I couldn't find a package that help me with this, but it looks like it should be a classic problem so there should be something! Many thanks in advance for any help or hint you could provide, All the best, Emmanuel Emmanuel, I don't know if this is a classic problem or not. But given your description, you write your own function something like this sample.dist - function(vec, Min=1, Max=10){ diffs - c(0,diff(vec)) sum_d - sum(diffs) sample(Min:(Max-sum_d),1)+cumsum(diffs) } Where Min and Max are the minimum and maximum values that you are sampling from (Min=1 and Max=10 in your example), and vec is passed the vector that you are sampling distances from. This assumes that your vector is sorted smallest to largest as in your example. The function could be changed to accommodate a vector that isn't sorted. V - sort(sample(1:100,4)) V #[1] 46 78 82 95 sample.dist(V, Min=1, Max=100) #[1] 36 68 72 85 sample.dist(V, Min=1, Max=100) #[1] 12 44 48 61 This should get you started at least. Hope this is helpful, Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Random sampling while keeping distribution of nearest neighbor distances constant.
But if the 1st order differences are the same, then doesn't it follow that the 2nd, 3rd, ... order differences must be the same between the original and the new random vector. What am I missing? You are missing nothing sorry, I wrote something wrong. What I would like to be preserved is the distance with the *nearest* neighbor, so diff is not the way to go. If you only consider the nearest neighbor, then c(3,4, 8,9) and c(4,5,6,7) are the same in terms of first order (all closest neighbor are 1 unit away) but not in terms of second order. Also, I don't know if there would be a simple way to maintain a *distribution* of distances (even if not of nearest neighbor). For example, c(2,4,5,6) could be c(1,3,4,5), c(3,5,6,7) as proposed by your solution, but it could also be: c(4,5,6,8) Or, c(2,3,6,7,8) could be c(2,3,4,7,8) Actually, that's really simple! I can simply resample the diff vector! OK so the only problem becomes the 1st, 2d, 3rd order thing now, but you made me realize that I can skip it for the moment. Thank you! :-) Emmanuel Dan Daniel J. Nordlund Washington State Department of Social and Health Services Planning, Performance, and Accountability Research and Data Analysis Division Olympia, WA 98504-5204 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Is it normal that normalize.loess does not tolerate a single NA value?
Dear all, I have been using normalize.loess and I get the following error message when my matrix contains NA values: my.mat = matrix(nrow=100, ncol=4, runif(400) ) my.mat[1,1]=NA my.mat.n = normalize.loess(my.mat, verbose=TRUE) Done with 1 vs 2 in iteration 1 Done with 1 vs 3 in iteration 1 Done with 1 vs 4 in iteration 1 Done with 2 vs 3 in iteration 1 Done with 2 vs 4 in iteration 1 Done with 3 vs 4 in iteration 1 1 0.317319 Warning messages: 1: In means[, j] + aux : longer object length is not a multiple of shorter object length 2: In means[, k] - aux : longer object length is not a multiple of shorter object length ... Is that normal? If not, do you have any suggestion to avoid it? I'm scared that this introduces abnormalities in the normalization. Thanks for your help, Emmanuel sessionInfo() R version 2.8.0 (2008-10-20) x86_64-pc-linux-gnu locale: LC_CTYPE=en_CA.UTF-8;LC_NUMERIC=C;LC_TIME=en_CA.UTF-8;LC_COLLATE=en_CA.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_CA.UTF-8;LC_PAPER=en_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_CA.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] splines tools stats graphics grDevices utils datasets [8] methods base other attached packages: [1] limma_2.16.3GOstats_2.4.0 Category_2.4.0 [4] genefilter_1.22.0 survival_2.34-1 RBGL_1.16.0 [7] annotate_1.20.1 xtable_1.5-4GO.db_2.2.0 [10] AnnotationDbi_1.2.2 RSQLite_0.6-9 DBI_0.2-4 [13] graph_1.18.0YEAST_2.0.1 affy_1.20.0 [16] Biobase_2.2.1 loaded via a namespace (and not attached): [1] affyio_1.10.1cluster_1.11.11 preprocessCore_1.4.0 [4] tcltk_2.8.0 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mathematica now working with Nvidia GPUs -- any plan for R?
Dear Brian, Mose, Peter and Stefan, Thanks a lot for your replies - the issues are now clearer to me. (and I apologize for not using the appropriate list). Best wishes, Emmanuel 2008/11/19 Peter Dalgaard [EMAIL PROTECTED]: Stefan Evert wrote: On 19 Nov 2008, at 07:56, Prof Brian Ripley wrote: I just read an announcement saying that Mathematica is launching a version working with Nvidia GPUs. It is claimed that it'd make it ~10-100x faster! http://www.physorg.com/news146247669.html Well, lots of things are 'claimed' in marketing (and Wolfram is not shy to claim). I think that you need lots of GPUs, as well as the right problem. Which makes me wonder whether R would be able to make use of this processing power, since the figures claimed by Wolfram are very probably for single-precision floats? (Am I right in thinking that R only works with double precision?) According to the nVidia Web site, the Tesla architecture is _ten times_ slower for double-precision operations than for single-precision, which makes it seem far less amazing than at first sight. It wouldn't be rocket science to add a single-precision data type to R, it is just a large amount of work to get all details right, and noone has found it worth the trouble. Probably more of a stumbling block is that GPUs traditionally have had a rather cavalier attitude to floating-point standards. We have had our share of problems with compilers that breaks the rules (or users setting flags like -fast-math that allow rule breakage), and ensuing breakage of user code and/or make check, so I don't think it is viable to try and get the GPU to take on all of R's computations. Special-purpose computation, e.g. an R package interfacing to libgpublas.so or whatever, is always a possibility, but that really goes without saying. -- O__ Peter Dalgaard Øster Farimagsgade 5, Entr.B c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K (*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918 ~~ - ([EMAIL PROTECTED]) FAX: (+45) 35327907 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Mathematica now working with Nvidia GPUs -- any plan for R?
Dear All, I just read an announcement saying that Mathematica is launching a version working with Nvidia GPUs. It is claimed that it'd make it ~10-100x faster! http://www.physorg.com/news146247669.html I was wondering if you are aware of any development going into this direction with R? Thanks for sharing your thoughts, Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] gregexpr slow and increases exponentially with string length -- how to speed it up?
Dear All, I have a long string and need to search for regular expressions in there. However it becomes horribly slow as the string length increases. Below is an example: when i increases by 5, the time spent increases by more! (my string is 11,000,000 letters long!) I also noticed that - the search time increases dramatically with the number of matches found. - the perl=T option slows down the search Any idea to speed this up would be greatly appreciated! Best, Emmanuel for (i in c(1, 5, 10, 50)){ + aa = as.character(sample(1:9, i, replace=T)) + aa = paste(aa, collapse='') + print(i) + print(system.time(gregexpr([367]2[1-9][129],aa))) + } [1] 1 user system elapsed 0.004 0.000 0.003 [1] 5 user system elapsed 0.060 0.000 0.061 [1] 1e+05 user system elapsed 0.240 0.000 0.238 [1] 5e+05 user system elapsed 5.733 0.000 5.732 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] gregexpr slow and increases exponentially with string length -- how to speed it up?
Hi Chuck, Thanks a lot for your suggestion. You can find all such matches (not just the disjoint ones that gregexpr finds) using something like this: twomatch -function(x,y) intersect(x+1,y) match.list - list( which( vec %in% c(3,6,7) ), which( vec == 2 ), which( vec %in% 1:9 ), which( vec %in% c(1,2,9) ) ) res - Reduce( twomatch, match.list ) - length(match.list) + 1 I should have made explicit that I have many of these motifs to match, and their structure vary quite a bit. This means that I'd need a function to translate each motif into the solution you proposed, which would be (although feasible), a bit painful. In the meantime, the best solution I found is to cut the big string into smaller strings. That actually speeds things up a lot. Best, E If you want to precisely match the gregexpr results, you'll need to filter out the overlapping matches. HTH, Chuck Best, Emmanuel for (i in c(1, 5, 10, 50)){ + aa = as.character(sample(1:9, i, replace=T)) + aa = paste(aa, collapse='') + print(i) + print(system.time(gregexpr([367]2[1-9][129],aa))) + } [1] 1 user system elapsed 0.004 0.000 0.003 [1] 5 user system elapsed 0.060 0.000 0.061 [1] 1e+05 user system elapsed 0.240 0.000 0.238 [1] 5e+05 user system elapsed 5.733 0.000 5.732 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Charles C. Berry(858) 534-2098 Dept of Family/Preventive Medicine E mailto:[EMAIL PROTECTED] UC San Diego http://famprevmed.ucsd.edu/faculty/cberry/ La Jolla, San Diego 92093-0901 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] If I known d1 (density1), and dmix is a mix between d1 and d2 (d2 is unknown), can one infer d2?
Dear All, I hope the title speaks by itself. I believe that there should be a solution when I see what Mclust is able to do. However, this problem is quite particular in that d3 is not known and does not necessarily correspond to a common distribution (e.g. normal, exponential ...). However it must look like dmix and d1 to some extent. To illustrate my problem I worked out a simple example: Imagine that there are two classes of people: (i) fast walkers: they achieve a unit distance in a given distribution of unit time (d1). (ii) slow walkers: they achieve a unit distance in another distribution of longer times (d2), cf example below. If I have a mixed sample that contain X% of fast walkers and (100-X)% of slow walkers, *is it possible to use it to estimate d2, as well as X?* R example: walk.fast = sample(seq(1,5,length.out=1000), prob=dlnorm(seq(1,5,length.out=1000)), replace=T) walk.slow = sample(seq(1,5,length.out=1000), prob=dlnorm(seq(1,5,length.out=1000), meanlog=1.2), replace=T) percentage.fast =0.8 walk.all = c(sample(walk.fast, percentage.fast*1000), sample(walk.slow, (1-percentage.fast)*1000 ) ) plot(density(walk.fast, from=1, to=5)) # d1 lines(density(walk.slow, from=1, to=5), col=2) # d2 lines(density(walk.all, from=1, to=5), col=3)# dmix, mix between d1 and d2 Is there a method that could allow me to estimate d2 given dmix? Any hint would be greatly appreciated. Best, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mclust problem with mclust1Dplot: Error in to - from : non-numeric argument to binary operator
After playing with the data, I figured out what the problem is: I've got many zeros in the dataset, which probably induces the algorithm to determine a gaussian with variance=0. If I remove the zeros it works, but then the decomposition is not as it should be Any idea on how to solve this would be great; is it possible to somehow force the parameters (e.g variance) to be greater than a particular threshold? Thanks, Emmanuel 2008/10/20 Emmanuel Levy [EMAIL PROTECTED]: Dear list members, I am using Mclust in order to deconvolute a distribution that I believe is a sum of two gaussians. First I can make a model: my.data.model = Mclust(my.data, modelNames=c(E), warn=T, G=1:3) But then, when I try to plot the result, I get the following error: mclust1Dplot(my.data.model, parameters = my.data.model$parameters, what = density) Error in to - from : non-numeric argument to binary operator Also, I'd like to allow for each gaussian to have a different variance (modelNmaes=c(V)) , but then I get another error message: my.data.model = Mclust(my.data, modelNames=c(V), warn=T, G=1:3) Warning messages: 1: In meV(data = data, z = z, prior = prior, control = control, warn = warn) : sigma-squared falls below threshold 2: In meV(data = data, z = z, prior = prior, control = control, warn = warn) : sigma-squared falls below threshold 3: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) : best model occurs at the min or max # of components considered 4: In Mclust(my.data, modelNames = c(V), warn = T, G = 1:3) : optimal number of clusters occurs at min choice Many thanks in advance for your help, Best wishes, Emmanuel If you would like to reproduce the above, the dataset is: my.data=c( 0.,0.0052,0.,-0.2136,0.4625,0.6047,0.,0.7370,0.5059 ,-0.8060,-1.0790,0.,-1.5397,-0.0720,-3.2180,-1.6980,0.,2.2845 ,-1.0741,0.,0.1020,-0.6010,0.2210,-0.0120,1.0785,0.,-0.4536 ,-0.1127,-0.2032,-0.0421,-1.6818,-0.9935,-2.2105,-0.7963,-0.1820,-2.0468 ,0.6161,-1.7663,-0.6800,-2.1290,-0.0167,0.,0.,0.,0.5427 ,-0.0170,0.,0.,-0.6576,0.9055,0.1409,-0.1409,0.,0.3730 ,-0.1800,-1.3141,0.6786,-0.2480,-2.5110,-0.1340,0.3000,-1.7350,0. ,-0.5464,0.,-0.7513,-1.9056,-1.4823,-0.5376,-0.4516,-1.1391,0. ,-2.2560,1.3770,0.3390,-2.6023,-1.0880,-0.1444,0.,-0.1459,0.1740 ,0.,0.3310,0.0749,1.0360,-0.8345,-0.6843,-3.5171,-1.9482,-0.4972 ,-0.0130,-2.0290,-0.2812,0.,0.,-0.0164,0.,-1.9220,-1.5941 ,-1.0840,0.,0.0459,-2.2121,-1.1485,-1.1485,0.,-0.4449,-0.5001 ,0.3520,1.9980,-3.8385,1.7160,1.0020,-0.2250,-0.8265,-0.2032) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Mclust problem with mclust1Dplot: Error in to - from : non-numeric argument to binary operator
Dear All, I haven't found a solution to the variance problem. However, I could solve the plotting problem by plotting the data myself. I think that the problem is due to a change in the data structure returned by the function Mclust. The web-page: http://www.childrensmercy.org/stats/weblog2006/UnivariateClustering.asp has been extremely helpful. I paste the code below for people's future reference. par(mfcol=c(1,2),mar=c(2.1,0.1,0.1,0.1)) v - Mclust(my.data) x0 - seq(min(my.data),max(my.data),length.out=100) nt - rep(0,100) plot(range(my.data),c(0,0.15),type=n,xlab= ,ylab= ,axes=F, ylim=c(0,0.4) ) axis(side=1) for (i in 1:2) { ni - v$parameters$pro[i]*dnorm(x0, mean=as.numeric(v$parameters$mean[i]),sd=1) lines(x0,ni,col=1) nt - nt+ni } lines(x0,nt,lwd=3) segments(my.data,0,my.data,0.02) Best, Emmanuel 2008/10/21 Emmanuel Levy [EMAIL PROTECTED]: After playing with the data, I figured out what the problem is: I've got many zeros in the dataset, which probably induces the algorithm to determine a gaussian with variance=0. If I remove the zeros it works, but then the decomposition is not as it should be Any idea on how to solve this would be great; is it possible to somehow force the parameters (e.g variance) to be greater than a particular threshold? Thanks, Emmanuel 2008/10/20 Emmanuel Levy [EMAIL PROTECTED]: Dear list members, I am using Mclust in order to deconvolute a distribution that I believe is a sum of two gaussians. First I can make a model: my.data.model = Mclust(my.data, modelNames=c(E), warn=T, G=1:3) But then, when I try to plot the result, I get the following error: mclust1Dplot(my.data.model, parameters = my.data.model$parameters, what = density) Error in to - from : non-numeric argument to binary operator Also, I'd like to allow for each gaussian to have a different variance (modelNmaes=c(V)) , but then I get another error message: my.data.model = Mclust(my.data, modelNames=c(V), warn=T, G=1:3) Warning messages: 1: In meV(data = data, z = z, prior = prior, control = control, warn = warn) : sigma-squared falls below threshold 2: In meV(data = data, z = z, prior = prior, control = control, warn = warn) : sigma-squared falls below threshold 3: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) : best model occurs at the min or max # of components considered 4: In Mclust(my.data, modelNames = c(V), warn = T, G = 1:3) : optimal number of clusters occurs at min choice Many thanks in advance for your help, Best wishes, Emmanuel If you would like to reproduce the above, the dataset is: my.data=c( 0.,0.0052,0.,-0.2136,0.4625,0.6047,0.,0.7370,0.5059 ,-0.8060,-1.0790,0.,-1.5397,-0.0720,-3.2180,-1.6980,0.,2.2845 ,-1.0741,0.,0.1020,-0.6010,0.2210,-0.0120,1.0785,0.,-0.4536 ,-0.1127,-0.2032,-0.0421,-1.6818,-0.9935,-2.2105,-0.7963,-0.1820,-2.0468 ,0.6161,-1.7663,-0.6800,-2.1290,-0.0167,0.,0.,0.,0.5427 ,-0.0170,0.,0.,-0.6576,0.9055,0.1409,-0.1409,0.,0.3730 ,-0.1800,-1.3141,0.6786,-0.2480,-2.5110,-0.1340,0.3000,-1.7350,0. ,-0.5464,0.,-0.7513,-1.9056,-1.4823,-0.5376,-0.4516,-1.1391,0. ,-2.2560,1.3770,0.3390,-2.6023,-1.0880,-0.1444,0.,-0.1459,0.1740 ,0.,0.3310,0.0749,1.0360,-0.8345,-0.6843,-3.5171,-1.9482,-0.4972 ,-0.0130,-2.0290,-0.2812,0.,0.,-0.0164,0.,-1.9220,-1.5941 ,-1.0840,0.,0.0459,-2.2121,-1.1485,-1.1485,0.,-0.4449,-0.5001 ,0.3520,1.9980,-3.8385,1.7160,1.0020,-0.2250,-0.8265,-0.2032) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] unimodal VS bimodal normal distribution - how to get a pvalue?
Dear All, I have a distribution of values and I would like to assess the uni/bimodality of the distribution. I managed to decompose it into two normal distribs using Mclust, and the BIC criteria is best for two parameters. However, the problem is that the BIC criteria is not a P-value, which I would need ideally. I saw the diptest package but it is not for gaussian distributions. Any hint at a package or way-around this would be greatly appreciated. Best, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] unimodal VS bimodal normal distribution - how to get a pvalue?
Hi Duncan, I'm really stupid --- yes of course!! Thanks for pointing me out the (now) obvious. All the best, E 2008/10/21 Duncan Murdoch [EMAIL PROTECTED]: On 10/21/2008 2:56 PM, Emmanuel Levy wrote: Dear All, I have a distribution of values and I would like to assess the uni/bimodality of the distribution. I managed to decompose it into two normal distribs using Mclust, and the BIC criteria is best for two parameters. However, the problem is that the BIC criteria is not a P-value, which I would need ideally. I saw the diptest package but it is not for gaussian distributions. Any hint at a package or way-around this would be greatly appreciated. If your null is really a unimodal gaussian, then doing it by simulation should be easy: simulate lots of normal samples of the same size as your observed one, and calculate the same BIC statistic on each of them. Then your p-value is the true proportion of simulated statistics that give stronger evidence than the observed one, and is well approximated by the simulated proportion. Duncan Murdoch __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Mclust problem with mclust1Dplot: Error in to - from : non-numeric argument to binary operator
Dear list members, I am using Mclust in order to deconvolute a distribution that I believe is a sum of two gaussians. First I can make a model: my.data.model = Mclust(my.data, modelNames=c(E), warn=T, G=1:3) But then, when I try to plot the result, I get the following error: mclust1Dplot(my.data.model, parameters = my.data.model$parameters, what = density) Error in to - from : non-numeric argument to binary operator Also, I'd like to allow for each gaussian to have a different variance (modelNmaes=c(V)) , but then I get another error message: my.data.model = Mclust(my.data, modelNames=c(V), warn=T, G=1:3) Warning messages: 1: In meV(data = data, z = z, prior = prior, control = control, warn = warn) : sigma-squared falls below threshold 2: In meV(data = data, z = z, prior = prior, control = control, warn = warn) : sigma-squared falls below threshold 3: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) : best model occurs at the min or max # of components considered 4: In Mclust(my.data, modelNames = c(V), warn = T, G = 1:3) : optimal number of clusters occurs at min choice Many thanks in advance for your help, Best wishes, Emmanuel If you would like to reproduce the above, the dataset is: my.data=c( 0.,0.0052,0.,-0.2136,0.4625,0.6047,0.,0.7370,0.5059 ,-0.8060,-1.0790,0.,-1.5397,-0.0720,-3.2180,-1.6980,0.,2.2845 ,-1.0741,0.,0.1020,-0.6010,0.2210,-0.0120,1.0785,0.,-0.4536 ,-0.1127,-0.2032,-0.0421,-1.6818,-0.9935,-2.2105,-0.7963,-0.1820,-2.0468 ,0.6161,-1.7663,-0.6800,-2.1290,-0.0167,0.,0.,0.,0.5427 ,-0.0170,0.,0.,-0.6576,0.9055,0.1409,-0.1409,0.,0.3730 ,-0.1800,-1.3141,0.6786,-0.2480,-2.5110,-0.1340,0.3000,-1.7350,0. ,-0.5464,0.,-0.7513,-1.9056,-1.4823,-0.5376,-0.4516,-1.1391,0. ,-2.2560,1.3770,0.3390,-2.6023,-1.0880,-0.1444,0.,-0.1459,0.1740 ,0.,0.3310,0.0749,1.0360,-0.8345,-0.6843,-3.5171,-1.9482,-0.4972 ,-0.0130,-2.0290,-0.2812,0.,0.,-0.0164,0.,-1.9220,-1.5941 ,-1.0840,0.,0.0459,-2.2121,-1.1485,-1.1485,0.,-0.4449,-0.5001 ,0.3520,1.9980,-3.8385,1.7160,1.0020,-0.2250,-0.8265,-0.2032) __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] RCurl compilation error on ubuntu hardy
Dear Duncan and Vincent, Thanks for your help. It took me some time to upgrade R because atp-get install would not actually install the packages, but I used sudo aptitude install r-base and that managed to upgrade the r-base package. I'm also happy to say that RCurl as well as BiomaRt could now be installed with success! Thanks again, Best wishes, Emmanuel 2008/9/17 Duncan Temple Lang [EMAIL PROTECTED]: -BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Hi Emanuel. The best thing to do is upgrade to R-2.7.2 (or any 2.7.*) and the problem should disappear. It relates to encoding of strings. D. Emmanuel Levy wrote: Dear list members, I encountered this problem and the solution pointed out in a previous thread did not work for me. (e.g. install.packages(RCurl, repos = http://www.omegahat.org/R;) I work with Ubuntu Hardy, and installed R 2.6.2 via apt-get. I really need RCurl in order to use biomaRt ... any help would be greatly appreciated. Best wishes, Emmanuel = sessionInfo() R version 2.6.2 (2008-02-08) x86_64-pc-linux-gnu locale: LC_CTYPE=en_CA.UTF-8;LC_NUMERIC=C;LC_TIME=en_CA.UTF-8;LC_COLLATE=en_CA.UTF-8;LC_MONETARY=en_CA.UTF-8;LC_MESSAGES=en_CA.UTF-8;LC_PAPER=en_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_CA.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] rcompgen_0.1-17 tools_2.6.2 = install.packages(RCurl, repos = http://www.omegahat.org/R;) Warning in install.packages(RCurl, repos = http://www.omegahat.org/R;) : argument 'lib' is missing: using '/usr/local/lib/R/site-library' trying URL 'http://www.omegahat.org/R/src/contrib/RCurl_0.9-4.tar.gz' Content type 'application/x-gzip' length 150884 bytes (147 Kb) opened URL == downloaded 147 Kb * Installing *source* package 'RCurl' ... checking for curl-config... /usr/bin/curl-config checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed checking how to run the C preprocessor... gcc -E Version has a libidn field configure: creating ./config.status config.status: creating src/Makevars ** libs gcc -std=gnu99 -I/usr/share/R/include -I/usr/share/R/include -DHAVE_LIBIDN_FIELD=1 -fpic -g -O2 -c base64.c -o base64.o In file included from base64.c:1: Rcurl.h:52: error: expected specifier-qualifier-list before 'cetype_t' make: *** [base64.o] Error 1 chmod: cannot access `/usr/local/lib/R/site-library/RCurl/libs/*': No such file or directory ERROR: compilation failed for package 'RCurl' ** Removing '/usr/local/lib/R/site-library/RCurl' The downloaded packages are in /tmp/RtmpQ8FMBZ/downloaded_packages Warning message: In install.packages(RCurl, repos = http://www.omegahat.org/R;) : installation of package 'RCurl' had non-zero exit status Hi Martin, are you working on a 64-bit linux distribution and which version of RCurl are you trying to install? There has been a problem with a recent version of RCurl and the R_base64_decode, search the archives of the Bioconductor mailing list for a thread called RCurl loading problem with 64 bit linux distribution. Please try using the newest versions of R (R-2.7.0 has been released a few weeks ago) and RCurl, which you can obtain from within R by typing: install.packages(RCurl, repos = http://www.omegahat.org/R;) The new version of RCurl (= 0.9.2) worked fine for me, while 0.9.0 and 0.9.1 did not. Hope this helps. Joern martin sikora wrote: dear list members, i'm having a problem installing the biomaRt package on my linux machine, due to the fact of a compilation error with RCurl. i am using R 2.6.2 on fedora 7, and this is the output i get: gcc -m32 -std=gnu99 -I/usr/include/R -I/usr/include/R -DHAVE_LIBIDN_FIELD=1 -I/usr/local/include-fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables -c base64.c -o base64.o In file included from base64.c:1: Rcurl.h:52: error: expected specifier-qualifier-list before ?cetype_t? base64.c: In function ?R_base64_decode?: base64.c:25: warning: pointer targets in assignment differ in signedness base64.c:39: warning: pointer targets in passing argument 1 of ?Rf_mkString? differ in signedness base64.c
[R] RCurl compilation error on ubuntu hardy
Dear list members, I encountered this problem and the solution pointed out in a previous thread did not work for me. (e.g. install.packages(RCurl, repos = http://www.omegahat.org/R;) I work with Ubuntu Hardy, and installed R 2.6.2 via apt-get. I really need RCurl in order to use biomaRt ... any help would be greatly appreciated. Best wishes, Emmanuel = sessionInfo() R version 2.6.2 (2008-02-08) x86_64-pc-linux-gnu locale: LC_CTYPE=en_CA.UTF-8;LC_NUMERIC=C;LC_TIME=en_CA.UTF-8;LC_COLLATE=en_CA.UTF-8;LC_MONETARY=en_CA.UTF-8;LC_MESSAGES=en_CA.UTF-8;LC_PAPER=en_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_CA.UTF-8;LC_IDENTIFICATION=C attached base packages: [1] stats graphics grDevices utils datasets methods base loaded via a namespace (and not attached): [1] rcompgen_0.1-17 tools_2.6.2 = install.packages(RCurl, repos = http://www.omegahat.org/R;) Warning in install.packages(RCurl, repos = http://www.omegahat.org/R;) : argument 'lib' is missing: using '/usr/local/lib/R/site-library' trying URL 'http://www.omegahat.org/R/src/contrib/RCurl_0.9-4.tar.gz' Content type 'application/x-gzip' length 150884 bytes (147 Kb) opened URL == downloaded 147 Kb * Installing *source* package 'RCurl' ... checking for curl-config... /usr/bin/curl-config checking for gcc... gcc checking for C compiler default output file name... a.out checking whether the C compiler works... yes checking whether we are cross compiling... no checking for suffix of executables... checking for suffix of object files... o checking whether we are using the GNU C compiler... yes checking whether gcc accepts -g... yes checking for gcc option to accept ANSI C... none needed checking how to run the C preprocessor... gcc -E Version has a libidn field configure: creating ./config.status config.status: creating src/Makevars ** libs gcc -std=gnu99 -I/usr/share/R/include -I/usr/share/R/include -DHAVE_LIBIDN_FIELD=1 -fpic -g -O2 -c base64.c -o base64.o In file included from base64.c:1: Rcurl.h:52: error: expected specifier-qualifier-list before 'cetype_t' make: *** [base64.o] Error 1 chmod: cannot access `/usr/local/lib/R/site-library/RCurl/libs/*': No such file or directory ERROR: compilation failed for package 'RCurl' ** Removing '/usr/local/lib/R/site-library/RCurl' The downloaded packages are in /tmp/RtmpQ8FMBZ/downloaded_packages Warning message: In install.packages(RCurl, repos = http://www.omegahat.org/R;) : installation of package 'RCurl' had non-zero exit status Hi Martin, are you working on a 64-bit linux distribution and which version of RCurl are you trying to install? There has been a problem with a recent version of RCurl and the R_base64_decode, search the archives of the Bioconductor mailing list for a thread called RCurl loading problem with 64 bit linux distribution. Please try using the newest versions of R (R-2.7.0 has been released a few weeks ago) and RCurl, which you can obtain from within R by typing: install.packages(RCurl, repos = http://www.omegahat.org/R;) The new version of RCurl (= 0.9.2) worked fine for me, while 0.9.0 and 0.9.1 did not. Hope this helps. Joern martin sikora wrote: dear list members, i'm having a problem installing the biomaRt package on my linux machine, due to the fact of a compilation error with RCurl. i am using R 2.6.2 on fedora 7, and this is the output i get: gcc -m32 -std=gnu99 -I/usr/include/R -I/usr/include/R -DHAVE_LIBIDN_FIELD=1 -I/usr/local/include-fpic -O2 -g -pipe -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic -fasynchronous-unwind-tables -c base64.c -o base64.o In file included from base64.c:1: Rcurl.h:52: error: expected specifier-qualifier-list before ?cetype_t? base64.c: In function ?R_base64_decode?: base64.c:25: warning: pointer targets in assignment differ in signedness base64.c:39: warning: pointer targets in passing argument 1 of ?Rf_mkString? differ in signedness base64.c: In function ?R_base64_encode?: base64.c:60: warning: pointer targets in assignment differ in signedness make: *** [base64.o] Error 1 as far as i know i have all the necessary libraries installed: $ yum list installed | grep libxml libxml2.i386 2.6.31-1.fc7 installed libxml2-devel.i386 2.6.31-1.fc7 installed libxml2-python.i386 2.6.31-1.fc7 installed perl-libxml-perl.noarch 0.08-1.2.1 installed $ yum list installed | grep curl curl.i3867.16.4-1.fc7 installed curl-devel.i386 7.16.4-1.fc7 installed python-pycurl.i386
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Sorry for being unclear, I thought the example above was clear enough. I have a data frame of the form: name info 1 YAL001C 1 2 YAL001C 1 3 YAL001C 1 4 YAL001C 1 5 YAL001C 0 6 YAL001C 1 7 YAL001C 1 8 YAL001C 1 9 YAL001C 1 10 YAL001C 1 ... ... ~270 lines, and ~6000 different names. which corresponds to yeast proteins + some info. So there are about 6000 names like YAL001C I would like to transform this data frame into the following form: 1/ a list, where each protein corresponds to an index, and the info is the vector L[[1]] [1] 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 L[[2]] [1] 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 etc. 2/ an index, which gives me the position of each protein in the list: index [1] YAL001C YAL002W YAL003W YAL005C YAL007C ... I hope this will be clearer! I'll have a look right now that the split and hash.mat functions. Thanks for your help, Emmanuel 2008/8/13 Erik Iverson [EMAIL PROTECTED]: I still don't understand what you are doing. Can you make a small example that shows what you have and what you want? Is ?split what you are after? Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code
Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Wow great! Split was exactly what was needed. It takes about 1 second for the whole operation :D Thanks again - I can't believe I never used this function in the past. All the best, Emmanuel 2008/8/13 Erik Iverson [EMAIL PROTECTED]: I still don't understand what you are doing. Can you make a small example that shows what you have and what you want? Is ?split what you are after? Emmanuel Levy wrote: Dear Peter and Henrik, Thanks for your replies - this helps speed up a bit, but I thought there would be something much faster. What I mean is that I thought that a particular value of a level could be accessed instantly, similarly to a hash key. Since I've got about 6000 levels in that data frame, it means that making a list L of the form L[[1]] = values of name 1 L[[2]] = values of name 2 L[[3]] = values of name 3 ... would take ~1hour. Best, Emmanuel 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]: To simplify: n - 2.7e6; x - factor(c(rep(A, n/2), rep(B, n/2))); # Identify 'A':s t1 - system.time(res - which(x == A)); # To compare a factor to a string, the factor is in practice # coerced to a character vector. t2 - system.time(res - which(as.character(x) == A)); # Interestingly enough, this seems to be faster (repeated many times) # Don't know why. print(t2/t1); user system elapsed 0.632653 1.60 0.754717 # Avoid coercing the factor, but instead coerce the level compared to t3 - system.time(res - which(x == match(A, levels(x; # ...but gives no speed up print(t3/t1); user system elapsed 1.041667 1.00 1.018182 # But coercing the factor to integers does t4 - system.time(res - which(as.integer(x) == match(A, levels(x print(t4/t1); usersystem elapsed 0.417 0.000 0.3636364 So, the latter seems to be the fastest way to identify those elements. My $.02 /Henrik On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote: Emmanuel, On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote: Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] I'm not sure I fully understand your problem, you example would not run for me. You could get a small speedup by omitting which(), you can subset by a logical vector also which give a small speedup. n - 270 foo - data.frame( + one = sample(c(0,1), n, rep = T), + two = factor(c(rep(A, n/2 ),rep(B, n/2 ))) + ) system.time(out - which(foo$two==A)) user system elapsed 0.566 0.146 0.761 system.time(out - foo$two==A) user system elapsed 0.429 0.075 0.588 You might also find use for unstack(), though I didn't see a speedup. system.time(out - unstack(foo)) user system elapsed 1.068 0.697 2.004 HTH Peter My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?
Dear All, I have a large data frame ( 270 lines and 14 columns), and I would like to extract the information in a particular way illustrated below: Given a data frame df: col1=sample(c(0,1),10, rep=T) names = factor(c(rep(A,5),rep(B,5))) df = data.frame(names,col1) df names col1 1 A1 2 A0 3 A1 4 A0 5 A1 6 B0 7 B0 8 B1 9 B0 10 B0 I would like to tranform it in the form: index = c(A,B) col1[[1]]=df$col1[which(df$name==A)] col1[[2]]=df$col1[which(df$name==B)] My problem is that the command: *** which(df$name==A) *** takes about 1 second because df is so big. I was thinking that a level could maybe be accessed instantly but I am not sure about how to do it. I would be very grateful for any advice that would allow me to speed this up. Best wishes, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
[R] Smoothing z-values according to their x, y positions
Dear All, I'm sure this is not the first time this question comes up but I couldn't find the keywords that would point me out to it - so apologies if this is a re-post. Basically I've got thousands of points, each depending on three variables: x, y, and z. if I do a plot(x,y, col=z), I get something very messy. So I would like to smooth the values of z according to the values of their neighbors (given by x and y). Could you please point me out to the function(s) I should look into for the smothing, and possibly for the plotting? Many thanks for your help, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Smoothing z-values according to their x, y positions
Dear Bert, Thanks for your reply - I indeed saw a lot of functions using: help.search(smooth) The problem is that most seem to not be very appropriate to what I'd like, or they seem extremely complicated (e.g. gma). I am probably missing something as I don't see how to use Loess. From my poor understanding, it seems to be for 2D data. Here I want to smooth the third z component. In the meantime I adapted a function from Greg Warnes, which sort of does the job (although the smoothing is not very nice). So I'd be very happy to learn how do do something similar with loess! x = runif(1000) y = runif(1000) z = x+y hist2d.el(x,y,z, nbins=10) ## -- This function was adapted from Greg Warnes hist2d.el - function( x,y=NULL, z=NULL, nbins=200, my.cuts = NULL, same.scale=FALSE, na.rm=TRUE, show=TRUE, col=c(black, heat.colors(12)), ... ) { if(is.null(y)) { if(ncol(x) != 2) stop(If y is ommitted, x must be a 2 column matirx) y - x[,2] x - x[,1] } if(length(nbins)==1) nbins - rep(nbins,2) nas - is.na(x) | is.na(y) if(na.rm) { x - x[!nas] y - y[!nas] } else stop(missinig values not permitted if na.rm=FALSE) if(!is.null(my.cuts)){ nbins = rep(length(my.cuts)-1,2) x.cuts=my.cuts y.cuts=my.cuts } else { if(same.scale) { x.cuts - seq( from=min(x,y), to=max(x,y), length=nbins[1]+1) y.cuts - seq( from=min(x,y), to=max(x,y), length=nbins[2]+1) } else { x.cuts - seq( from=min(x), to=max(x), length=nbins[1]+1) y.cuts - seq( from=min(y), to=max(y), length=nbins[2]+1) } } print(nbins) index.x - cut( x, x.cuts, include.lowest=TRUE) index.y - cut( y, y.cuts, include.lowest=TRUE) m - matrix( 0, nrow=nbins[1], ncol=nbins[2], dimnames=list( levels(index.x), levels(index.y) ) ) m.sum - matrix( 0, nrow=nbins[1], ncol=nbins[2], dimnames=list( levels(index.x), levels(index.y) ) ) for( i in 1:length(index.x) ) m[ index.x[i], index.y[i] ] - m[ index.x[i], index.y[i] ] + 1 for( i in 1:length(index.x) ) m.sum[ index.x[i], index.y[i] ] - m.sum[ index.x[i], index.y[i] ] + z[i] m[which(m==0)]=1 m = m.sum/(m) xvals - x.cuts[1:nbins[1]] yvals - y.cuts[1:nbins[2]] if(show) image( xvals,yvals, m, col=col,...) # invisible(list(counts=m,x=xvals,y=yvals)) invisible(m) } On 19/03/2008, Bert Gunter [EMAIL PROTECTED] wrote: There are dozens of functions within R and contributed packages that do this. The spatial statistics packages (see the spatial task view on CRAN) is certainly where most are concentrated, but thin plate splines (in splines packages, I believe), tensor product splines (in mgcv package), local likelihood (in locfit package) etc. are also available for spatial smoothing. Perhaps the most basic place to look is the ?loess function in the base distribution, which will do exactly what you requested. Bert Gunter Genentech Nonclinical Statistics -Original Message- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of Emmanuel Levy Sent: Wednesday, March 19, 2008 12:42 PM To: r-help@r-project.org Subject: [R] Smoothing z-values according to their x, y positions Dear All, I'm sure this is not the first time this question comes up but I couldn't find the keywords that would point me out to it - so apologies if this is a re-post. Basically I've got thousands of points, each depending on three variables: x, y, and z. if I do a plot(x,y, col=z), I get something very messy. So I would like to smooth the values of z according to the values of their neighbors (given by x and y). Could you please point me out to the function(s) I should look into for the smothing, and possibly for the plotting? Many thanks for your help, Emmanuel __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Smoothing z-values according to their x, y positions
Dear David, Thanks a lot for pointing out kde2d, just tried it out but the problem is that it indeed takes the density of points into account, which I dont want. For example, if in an region of surface S I've got 10,000 points, and that their average height is 0.5, and in an other region I've got only ten points and their average value if also 0.5, I'd like all these points to be transformed to the same 0.5 value. At the moment, it seems that it's not the case. For example, the range of the values I give is: 0.2 - 3.7, but the range of the values that are outputed is 0 - 0.17. I haven't looked yet at the locfit package as it is not installed, but I will check it out! Thanks for helping! Emmanuel On 20/03/2008, David Winsemius [EMAIL PROTECTED] wrote: Emmanuel Levy [EMAIL PROTECTED] wrote in news:[EMAIL PROTECTED]: Dear Bert, Thanks for your reply - I indeed saw a lot of functions using: help.search(smooth) The problem is that most seem to not be very appropriate to what I'd like, or they seem extremely complicated (e.g. gma). I am probably missing something as I don't see how to use Loess. From my poor understanding, it seems to be for 2D data. Here I want to smooth the third z component. In the meantime I adapted a function from Greg Warnes, which sort of does the job (although the smoothing is not very nice). So I'd be very happy to learn how do do something similar with loess! You could consider at combining the kde2d() function in the MASS package within the VR bundle in combination with contour(). The example I followed came from MASS 2ed in chapter 5's section on density estimation. It is probably easiest to just copy the help example: library(MASS) attach(geyser) f1 - kde2d(duration[-272], duration[-1], h = rep(1.5, 2), n = 50, lims = c(0.5, 6, 0.5, 6)) contour(f1, xlab = previous duration, ylab = duration, levels = c(0.05, 0.1, 0.2, 0.4) ) I have also had success with Loader's locfit package. -- David Winsemius __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.