Re: [R] How to read a file containing two types of rows - (for the Netflix challenge data format)

2020-01-31 Thread Emmanuel Levy
Hi All,

Thanks so much for your inputs, it's so nice to have such a helpful
community -- I wrote some kind of mix between the different replies, I copy
my final code below.

All the best,

Emmanuel

mat = read.csv("~/format_test.csv", fill=TRUE, header=FALSE, as.is=TRUE)

first.col.idx = grep(":",mat[,1])

first.col.val = grep(":",mat[,1],value=TRUE)

first.col.val = gsub(pattern=":", replacement="", first.col.val)

mat.clean = mat[-first.col.idx,]

reps = diff(c(first.col.idx,length(mat1[,1])))

reps[1] = reps[1]+1

mat.final = cbind( rep(first.col.val, reps-1), mat.clean)








On Fri, 31 Jan 2020 at 20:31, Berry, Charles 
wrote:

>
>
> > On Jan 31, 2020, at 1:04 AM, Emmanuel Levy 
> wrote:
> >
> > Hi,
> >
> > I'd like to use the Netflix challenge data and just can't figure out how
> to
> > efficiently "scan" the files.
> > https://www.kaggle.com/netflix-inc/netflix-prize-data
> >
> > The files have two types of row, either an *ID* e.g., "1:" , "2:", etc.
> or
> > 3 values associated to each ID:
> >
> > The format is as follows:
> > *1:*
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > *2:*
> > value1,value2, value3
> > value1,value2, value3
> > *3:*
> > value1,value2, value3
> > value1,value2, value3
> > value1,value2, value3
> > *4:*
> > etc ...
> >
> > And I want to create a matrix where each line is of the form:
> >
> > ID value1, value2, value3
> >
> > Si "ID" needs to be duplicated - I could write a Perl script to convert
> > this format to CSV, but I'm sure there's a simple R trick.
> >
>
> I'd be tempted to use pipe() to separately read the ID lines and the value
> lines, but in R you can do this:
>
> fc <- count.fields( "yourfile.txt", sep = ",")
> rlines <- split( readLines( "yourfile.txt" ), fc)
> mat <-
>   cbind( rlines[[1]],
> do.call( rbind, strsplit( rlines[[2]], ",")))
>
>
> This assumes that there are exactly 1 or 3 fields in each row of
> "yourfile.txt", if not, some incantation of grepl() applied to the text of
> readLines() should suffice.
>
> HTH,
>
> Chuck
>
>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to read a file containing two types of rows - (for the Netflix challenge data format)

2020-01-31 Thread Emmanuel Levy
Hi,

I'd like to use the Netflix challenge data and just can't figure out how to
efficiently "scan" the files.
https://www.kaggle.com/netflix-inc/netflix-prize-data

The files have two types of row, either an *ID* e.g., "1:" , "2:", etc. or
3 values associated to each ID:

The format is as follows:
*1:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
value1,value2, value3
*2:*
value1,value2, value3
value1,value2, value3
*3:*
value1,value2, value3
value1,value2, value3
value1,value2, value3
*4:*
etc ...

And I want to create a matrix where each line is of the form:

ID value1, value2, value3

Si "ID" needs to be duplicated - I could write a Perl script to convert
this format to CSV, but I'm sure there's a simple R trick.

Thanks for suggestions!

Emmanuel

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Adding a column to an **empty** data.frame

2016-11-02 Thread Emmanuel Levy
Dear Bill,

Thanks a lot this is what I was looking for.

Emmanuel

On 2 November 2016 at 15:59, William Dunlap <wdun...@tibco.com> wrote:

> Give your new column a vector type - NULL cannot be extended beyond length
> 0.  Also, the syntax df$col<-NULL means to remove 'col' from 'df'.
>
> > df <- data.frame(id=integer(0), data=numeric(0))
> > df$new.col <- character(0)
> > str(df)
> 'data.frame':   0 obs. of  3 variables:
>  $ id : int
>  $ data   : num
>  $ new.col: chr
>
> (Think of 'zero-row' or 'zero-column' data.frames, not 'empty'
> data.frames, which could be either.)
>
>
>
> Bill Dunlap
> TIBCO Software
> wdunlap tibco.com
>
> On Wed, Nov 2, 2016 at 6:48 AM, Emmanuel Levy <emmanuel.l...@gmail.com>
> wrote:
>
>> Dear All,
>>
>> This sounds simple but can't figure out a good way to do it.
>>
>> Let's say that I have an empty data frame "df":
>>
>> ## creates the df
>> df = data.frame( id=1, data=2)
>>
>> ## empties the df, perhaps there is a more elegant way to create an empty
>> df?
>> df = df[-c(1),]
>>
>> > df
>> [1] id   data
>> <0 rows> (or 0-length row.names)
>>
>> Now, how can I add a third column name to that empty df?
>>
>> Normally I would use df$new.col = 1
>>
>> But here I can't use this because it's empty!
>>
>> I tried df$new.col=NULL -- this doesn't give an error but doesn't add the
>> column.
>>
>> Thanks for your help,
>>
>> Emmanuel
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Adding a column to an **empty** data.frame

2016-11-02 Thread Emmanuel Levy
Dear All,

This sounds simple but can't figure out a good way to do it.

Let's say that I have an empty data frame "df":

## creates the df
df = data.frame( id=1, data=2)

## empties the df, perhaps there is a more elegant way to create an empty
df?
df = df[-c(1),]

> df
[1] id   data
<0 rows> (or 0-length row.names)

Now, how can I add a third column name to that empty df?

Normally I would use df$new.col = 1

But here I can't use this because it's empty!

I tried df$new.col=NULL -- this doesn't give an error but doesn't add the
column.

Thanks for your help,

Emmanuel

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] combining columns into a combination index of the same length

2015-07-21 Thread Emmanuel Levy
Thanks Thierry, you made my day :)


On 21 July 2015 at 17:00, Thierry Onkelinx thierry.onkel...@inbo.be wrote:

 Please always keep the mailing list in cc.

 If mat is a data.frame, then you can use do.call. Then the number of
 columns doesn't matter.

 do.call(paste, mtcars[, c(mpg, cyl)])
 do.call(paste, mtcars[, c(mpg, cyl, disp)])
 do.call(paste, mtcars)


 ir. Thierry Onkelinx
 Instituut voor natuur- en bosonderzoek / Research Institute for Nature and
 Forest
 team Biometrie  Kwaliteitszorg / team Biometrics  Quality Assurance
 Kliniekstraat 25
 1070 Anderlecht
 Belgium

 To call in the statistician after the experiment is done may be no more
 than asking him to perform a post-mortem examination: he may be able to say
 what the experiment died of. ~ Sir Ronald Aylmer Fisher
 The plural of anecdote is not data. ~ Roger Brinner
 The combination of some data and an aching desire for an answer does not
 ensure that a reasonable answer can be extracted from a given body of data.
 ~ John Tukey

 2015-07-21 15:43 GMT+02:00 Emmanuel Levy emmanuel.l...@gmail.com:

 Thanks! -- this is indeed much faster (plus I made a mistake, one has to
 use paste with the option collapse=.

 The thing is I'm looking for a solution *without paste*. The reason is
 that* there may be two or more columns*.



 On 21 July 2015 at 16:32, Thierry Onkelinx thierry.onkel...@inbo.be
 wrote:

 Yes. paste0() can work on vectors. So paste0(mat[, col1], mat[, col2])

 ir. Thierry Onkelinx
 Instituut voor natuur- en bosonderzoek / Research Institute for Nature
 and Forest
 team Biometrie  Kwaliteitszorg / team Biometrics  Quality Assurance
 Kliniekstraat 25
 1070 Anderlecht
 Belgium

 To call in the statistician after the experiment is done may be no more
 than asking him to perform a post-mortem examination: he may be able to say
 what the experiment died of. ~ Sir Ronald Aylmer Fisher
 The plural of anecdote is not data. ~ Roger Brinner
 The combination of some data and an aching desire for an answer does not
 ensure that a reasonable answer can be extracted from a given body of data.
 ~ John Tukey

 2015-07-21 15:21 GMT+02:00 Emmanuel Levy emmanuel.l...@gmail.com:

 Hi,

 The answer to this is probably straightforward, I have a dataframe and
 I'd
 like to build an index of column combinations, e.g.

 col1 col2  -- col3 (the index I need)
 A 1   1
 A 1   1
 A 2   2
 B 1   3
 B 2   4
 B 2   4


 At the moment I use:
 col3 - apply(mat[,sel.col], 1, paste0)

 But I wonder if another approach could be faster?

 Thanks,

 Emmanuel

 [[alternative HTML version deleted]]

 __
 R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.






[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] combining columns into a combination index of the same length

2015-07-21 Thread Emmanuel Levy
Hi,

The answer to this is probably straightforward, I have a dataframe and I'd
like to build an index of column combinations, e.g.

col1 col2  -- col3 (the index I need)
A 1   1
A 1   1
A 2   2
B 1   3
B 2   4
B 2   4


At the moment I use:
col3 - apply(mat[,sel.col], 1, paste0)

But I wonder if another approach could be faster?

Thanks,

Emmanuel

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Retrieve indexes of the first occurrence of numbers in an effective manner

2012-12-27 Thread Emmanuel Levy
Hi,

That sounds simple but I cannot think of a really fast way of getting
the following:

 c(1,1,2,2,3,3,4,4) would give c(1,3,5,7)

i.e., a function that returns the indexes of the first occurrences of numbers.

Note that numbers may have any order e.g., c(3,4,1,2,1,1,2,3,5), can
be very large, and the vectors are also very large (which prohibits
any loop).

The best I could think of is:

tmp = c(1,1,2,2,3,3,4,4)
u = unique(tmp)
sapply(u, function(x){ which(!is.na(match(tmp,x)))[1]} )

But there might be a useful function I don't know ..

Thanks for any hint.
All the best,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Finding (swapped) repetitions of numbers pairs across two columns

2012-12-27 Thread Emmanuel Levy
Hi,

I've had this problem for a while and tackled it is a quite dirty way
so I'm wondering is a better solution exists:

If we have two vectors:

v1 = c(0,1,2,3,4)
v2 = c(5,3,2,1,0)

How to remove one instance of the 3,1 / 1,3 double?

At the moment I'm using the following solution, which is quite horrible:

v1 = c(0,1,2,3,4)
v2 = c(5,3,2,1,0)
ft - cbind(v1, v2)
direction = apply( ft, 1, function(x) return(x[1]x[2]))
ft.tmp = ft
ft[which(direction),1] = ft.tmp[which(direction),2]
ft[which(direction),2] = ft.tmp[which(direction),1]
uniques = apply( ft, 1, function(x) paste(x, collapse=%) )
uniques = unique(uniques)
ft.unique   = matrix(unlist(strsplit(uniques,%)), ncol=2, byrow=TRUE)


Any better solution would be very welcome!

All the best,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Finding (swapped) repetitions of numbers pairs across two columns

2012-12-27 Thread Emmanuel Levy
I did not know that unique worked on entire rows!

That is great, thank you very much!

Emmanuel


On 27 December 2012 22:39, Marc Schwartz marc_schwa...@me.com wrote:
 unique(t(apply(cbind(v1, v2), 1, sort)))

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to re-order clusters of hclust output?

2012-05-11 Thread Emmanuel Levy
Hello,

The heatmap function conveniently has a reorder.dendrogram function
so that clusters follow a certain logic.

It seems that the hclust function doesn't have such feature. I can use
the reorder function on the dendrogram obtained from hclust, but
this does not modify the hclust object itself.

I understand that the answer should be within the heatmap function
given that it uses hclust to work, but I find it very hard to follow
actually. For example I just don;t get what is 1L and 2L.

Any help would be appreciated!

Thanks,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to re-order clusters of hclust output?

2012-05-11 Thread Emmanuel Levy
Hi Michael,

Thanks for the info!

I think the key function I was missing is: order.dendrogram and not
REorder.dendrogram. It returns me the new order, so I think I should
get going with that :)

Emmanuel



On 11/05/2012, R. Michael Weylandt michael.weyla...@gmail.com
michael.weyla...@gmail.com wrote:
 I don't have a general answer to your question, but 1L and 2L are just the
 integers 1 and 2 (the L makes them integers instead of doubles which is
 useful for some things)

 Michael

 On May 11, 2012, at 2:15 PM, Emmanuel Levy emmanuel.l...@gmail.com wrote:

 Hello,

 The heatmap function conveniently has a reorder.dendrogram function
 so that clusters follow a certain logic.

 It seems that the hclust function doesn't have such feature. I can use
 the reorder function on the dendrogram obtained from hclust, but
 this does not modify the hclust object itself.

 I understand that the answer should be within the heatmap function
 given that it uses hclust to work, but I find it very hard to follow
 actually. For example I just don;t get what is 1L and 2L.

 Any help would be appreciated!

 Thanks,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to flatten a multidimensional array into a dataframe?

2012-04-19 Thread Emmanuel Levy
Hi,

I have a three dimensional array, e.g.,

my.array = array(0, dim=c(2,3,4), dimnames=list( d1=c(A1,A2),
d2=c(B1,B2,B3), d3=c(C1,C2,C3,C4)) )

what I would like to get is then a dataframe:

d1 d2 d3  value
A1 B1 C1 0
A2 B1 C1 0
.
.
.
A2 B3 C4 0

I'm sure there is one function to do this transformation, I just don't
know which one.

Thanks for your help,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to flatten a multidimensional array into a dataframe?

2012-04-19 Thread Emmanuel Levy
OK, it seems that the array2df function from arrayhelpers package does
the job :)


On 19 April 2012 16:46, Emmanuel Levy emmanuel.l...@gmail.com wrote:
 Hi,

 I have a three dimensional array, e.g.,

 my.array = array(0, dim=c(2,3,4), dimnames=list( d1=c(A1,A2),
 d2=c(B1,B2,B3), d3=c(C1,C2,C3,C4)) )

 what I would like to get is then a dataframe:

 d1 d2 d3  value
 A1 B1 C1 0
 A2 B1 C1 0
 .
 .
 .
 A2 B3 C4 0

 I'm sure there is one function to do this transformation, I just don't
 know which one.

 Thanks for your help,

 Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Idea/package to linearize a curve along the diagonal?

2012-03-13 Thread Emmanuel Levy
Dear David and Jeff,

 Only if you were going apply some sort of transformation that did not extend 
 globally

Exactly, this is why the LPCM package is great, as it assigns points
to parts of a curve.

I think I pretty much got what I need - it is not perfect yet but it
should be enough to give you an idea of what I was trying to achieve.

All the best,

Emmanuel


### Example ###
library(LPCM)
tmp=rnorm(2000)
X.1 = 5+tmp
Y.1 = 5+ (5*tmp+rnorm(2000))
tmp=rnorm(1000)
X.2 = 9+tmp
Y.2 = 40+ (1.5*tmp+rnorm(1000))
X.3 = 7+ 0.5*runif(500)
Y.3 = 15+20*runif(500)
Y = c(X.1,X.2,X.3)
X = c(Y.1,Y.2,Y.3)

lpc1 = lpc(cbind(X,Y), scaled=FALSE, h=c(1,1) , control=lpc.control(
boundary=1))
my.proj = lpc.spline(lpc1, project=TRUE, optimize=TRUE)

data = cbind( dist= my.proj$closest.pi, X1=lpc1$data[,1],
Y1=lpc1$data[,2], Xo=my.proj$closest.coords[,1],
Yo=my.proj$closest.coord[,2])
transfoData = matrix(apply(data, 1, function(x) { return( transfo(
(5+x[1])/10,x[2],x[3],x[4],x[5]))}), ncol=2, byrow=TRUE)

plot(transfoData)  ## This shows the result I'm looking for, not
perfect yet but it gives an idea.

###
### Moves a point from it's position to the new normalized position
###
transfo = function(dist, X1, Y1, X0, Y0) {
   # First, the point needs to be rotated
   trans=newCoord(X1, Y1, X0, Y0) ;
   Xnew=X1+trans[1]
   Ynew=Y1+trans[2]

   # second it is taken on the diagonal.
   Xfinal=dist
   Yfinal=dist
   X.TransToDiag = Xfinal-X0
   Y.TransToDiag = Yfinal-Y0
   return( c(Xnew+X.TransToDiag, Ynew+Y.TransToDiag))
}

## Rotates a point X1,Y1 relative to Xo,Yo
## The new point is either at 3pi/4 or 7pi/4 i.e., 90 degrees left or
## right of the diagonal.
##
newCoord = function(X1,Y1, Xo=0, Yo=0){

   # First calculates the coordinates of the point relative to Xo,Yo
   Xr = X1-Xo
   Yr = Y1-Yo

   # Now calculates the new coordinates,
   # i.e.,
   # if V is the vector defined from Xo,Yo to X1,Y1,
   # the new coordinates are such that Xf, Yf are at angle TETA
   # by default TETA=3*pi/4 or 135 degrees
   To = atan2(Yr,Xr)

   # XXX This is not perfect but will do the job for now
   if(Yr  Xr){
   TETA=3*pi/4
   } else {
   TETA=7*pi/4
   }
   Xn = Xr * (cos(TETA)/cos(To))
   Yn = Yr * (sin(TETA)/sin(To))

   # Xn, Yn are the new coordinates relative to Xo, Yo
   # However for the translation I need absolute coordinates
   # These are given by Xo + Xn and Y0 + Yn
   Xabs = Xo+Xn
   Yabs = Yo+Yn

   ## the translation that need to be applied to X1 and Y1 are thus:
   Xtrans = Xabs-X1
   Ytrans = Yabs-Y1
   return(c(Xtrans,Ytrans))
}







On 12 March 2012 20:58, David Winsemius dwinsem...@comcast.net wrote:

 On Mar 12, 2012, at 3:07 PM, Emmanuel Levy wrote:

 Hi Jeff,

 Thanks for your reply and the example.

 I'm not sure if it could be applied to the problem I'm facing though,
 for two reasons:

 (i) my understanding is that the inverse will associate a new Y
 coordinate given an absolute X coordinate. However, in the case I'm
 working on, the transformation that has to be applied depends on X
 *and* on its position relative to the *normal* of the fitted curve.
 This means, for instance, that both X and Y will change after
 transformation.

 (ii) the fitted curve can be described by a spline, but I'm not sure
 if inverse of such models can be inferred automatically (I don't know
 anything about that).

 The procedure I envision is the following: treat the curve segment by
 segment, apply rotation+translation to each segment to bring it on
 the
 diagonal,


 That makes sense. Although the way I am imagining it would be to do a
 condition (on x) shift.


 and apply the same transformation to all points
 corresponding to the same segment (i.e., these are the points that are
 close and within the normal area covered by the segment).

 Does this make sense?


 The first part sort of makes sense to me... maybe. You are thinking of some
 sort of local transformation that converts a curve to a straight line by
 rotation or deformation. Seems like a problem of finding a transformation of
 a scalar field. But you then want it extended outward to affect the regions
 at some distance from the curve. That's where might break down or at least
 becomes non-trivial. Because a region at a distance could be in the sights
 of the normal vector to the curve (not in the statistical sense but in the
 vector-field sense) of more than one segment of the curve. Only if you were
 going apply some sort of transformation that did not extend globally would
 you be able to do anything other than a y|x (y conditional on x) shift or
 expansion contraction

 All the best,

 Emmanuel


 On 12 March 2012 02:15, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote:

 It is possible that I do not see what you mean, but it seems like the
 following code does what you want. The general version

Re: [R] Idea/package to linearize a curve along the diagonal?

2012-03-12 Thread Emmanuel Levy
Hi Jeff,

Thanks for your reply and the example.

I'm not sure if it could be applied to the problem I'm facing though,
for two reasons:

(i) my understanding is that the inverse will associate a new Y
coordinate given an absolute X coordinate. However, in the case I'm
working on, the transformation that has to be applied depends on X
*and* on its position relative to the *normal* of the fitted curve.
This means, for instance, that both X and Y will change after
transformation.

(ii) the fitted curve can be described by a spline, but I'm not sure
if inverse of such models can be inferred automatically (I don't know
anything about that).

The procedure I envision is the following: treat the curve segment by
segment, apply rotation+translation to each segment to bring it on
the
diagonal, and apply the same transformation to all points
corresponding to the same segment (i.e., these are the points that are
close and within the normal area covered by the segment).

Does this make sense?
All the best,

Emmanuel


On 12 March 2012 02:15, Jeff Newmiller jdnew...@dcn.davis.ca.us wrote:
 It is possible that I do not see what you mean, but it seems like the 
 following code does what you want. The general version of this is likely to 
 be rather more difficult to do, but in theory the inverse function seems like 
 what you are trying to accomplish.

 x - 1:20
 y - x^2 + rnorm(20)

 y.lm - lm( y ~ I(x^2) + x )
 plot( x, y )
 lines( x, predict( y.lm ) )

 qa - coef(y.lm)[I(x^2)]
 qb - coef(y.lm)[x]
 qc - coef(y.lm)[(Intercept)]

 # define inverse of known model
 f1 - function( y ) { ( sqrt( 4*qa*( y -qc ) + qb^2 ) - qb ) / ( 2*qa ) }
 f2 - function( y ) { -( sqrt( 4*qa*( y -qc ) + qb^2 ) + qb ) / ( 2*qa ) }

 plot( x, f1( y ) )


 ---
 Jeff Newmiller                        The     .       .  Go Live...
 DCN:jdnew...@dcn.davis.ca.us        Basics: ##.#.       ##.#.  Live Go...
                                      Live:   OO#.. Dead: OO#..  Playing
 Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
 /Software/Embedded Controllers)               .OO#.       .OO#.  rocks...1k
 ---
 Sent from my phone. Please excuse my brevity.



 Emmanuel Levy emmanuel.l...@gmail.com wrote:

Dear Jeff,

I'm sorry but I do not see what you mean. It'd be great if you could
let me know in more details what you mean whenever you can.

Thanks,

Emmanuel


On 12 March 2012 00:07, Jeff Newmiller jdnew...@dcn.davis.ca.us
wrote:
 Aren't you just reinventing the inverse of a function?

---
 Jeff Newmiller                        The     .       .  Go
Live...
 DCN:jdnew...@dcn.davis.ca.us        Basics: ##.#.       ##.#.  Live
Go...
                                      Live:   OO#.. Dead: OO#..
 Playing
 Research Engineer (Solar/Batteries            O.O#.       #.O#.  with
 /Software/Embedded Controllers)               .OO#.       .OO#.
 rocks...1k

---
 Sent from my phone. Please excuse my brevity.

 Emmanuel Levy emmanuel.l...@gmail.com wrote:

Hi,

I am trying to normalize some data. First I fitted a principal curve
(using the LCPM package), but now I would like to apply a
transformation so that the curve becomes a straight diagonal line
on
the plot.  The data used to fit the curve would then be normalized by
applying the same transformation to it.

A simple solution could be to apply translations only (e.g., as done
after a fit using loess), but here rotations would have to be applied
as well. One could visualize this as the stretching of a curve,
i.e., during such an unfolding process, both translations and
rotations would need to be applied.

Before I embark on this (which would require me remembering long
forgotten geometry principles) I was wondering if you can think of
packages or tricks that could make my life simpler?

Thanks for any input,

Emmanuel


Below I provide an example - the black curve is to be brought along
the diagonal, and all data points normal to a small segment (of the
black curve) would undergo the same transformation as it - what
small is remains to be defined though.

    tmp=rnorm(2000)
    X.1 = 5+tmp
    Y.1 = 5+ (5*tmp+rnorm(2000))
    tmp=rnorm(1000)
    X.2 = 9+tmp
    Y.2 = 40+ (1.5*tmp+rnorm(1000))
    X.3 = 7+ 0.5*runif(500)
    Y.3 = 15+20*runif(500)
    X = c(X.1,X.2,X.3)
    Y = c(Y.1,Y.2,Y.3)

    lpc1 = lpc(cbind(X,Y), scaled=FALSE, h=c(1,1) )
    plot(lpc1)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r

Re: [R] Which non-parametric regression would allow fitting this type of data? (example given).

2012-03-11 Thread Emmanuel Levy
Hello Bert,

Thanks so much for these suggestions. They led me to the package
LPCM, which I found worked best with minimum tuning.

lpc1 = lpc(cbind(X,Y), scaled=TRUE, h=c(0.05,0.05))
plot(lpc1)

... et voila!

All the best,

Emmanuel



On 11 March 2012 00:37, Bert Gunter gunter.ber...@gene.com wrote:
 Thanks for the example.

 Have you tried fitting a principal curve via either the princurve or
 pcurve packages?  I think this might work for what you want, but no
 guarantees.

 Note that loess, splines, etc. are all fitting y|x, that is, a
 nonparametric regression of y on x. That is not what you say you want,
 so these approaches are unlikely to work.


 -- Bert

 On Sat, Mar 10, 2012 at 6:20 PM, Emmanuel Levy emmanuel.l...@gmail.com 
 wrote:
 Hi,

 I'm wondering which function would allow fitting this type of data:

tmp=rnorm(2000)
X.1 = 5+tmp
Y.1 = 5+ (5*tmp+rnorm(2000))
tmp=rnorm(100)
X.2 = 9+tmp
Y.2 = 40+ (1.5*tmp+rnorm(100))
X.3 = 7+ 0.5*runif(500)
Y.3 = 15+20*runif(500)
X = c(X.1,X.2,X.3)
Y = c(Y.1,Y.2,Y.3)
   plot(X,Y)

 The problem with loess is that distances for the goodness of fit are
 calculated on the Y-axis. However, distances would need to be
 calculated on the normals of the fitted curve. Is there a function
 that provide this option?

 A simple trick in that case consists in swapping X and Y, but I'm
 wondering if there is a more general solution?

 Thanks for your input,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



 --

 Bert Gunter
 Genentech Nonclinical Biostatistics

 Internal Contact Info:
 Phone: 467-7374
 Website:
 http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Idea/package to linearize a curve along the diagonal?

2012-03-11 Thread Emmanuel Levy
Hi,

I am trying to normalize some data. First I fitted a principal curve
(using the LCPM package), but now I would like to apply a
transformation so that the curve becomes a straight diagonal line on
the plot.  The data used to fit the curve would then be normalized by
applying the same transformation to it.

A simple solution could be to apply translations only (e.g., as done
after a fit using loess), but here rotations would have to be applied
as well. One could visualize this as the stretching of a curve,
i.e., during such an unfolding process, both translations and
rotations would need to be applied.

Before I embark on this (which would require me remembering long
forgotten geometry principles) I was wondering if you can think of
packages or tricks that could make my life simpler?

Thanks for any input,

Emmanuel


Below I provide an example - the black curve is to be brought along
the diagonal, and all data points normal to a small segment (of the
black curve) would undergo the same transformation as it - what
small is remains to be defined though.

tmp=rnorm(2000)
X.1 = 5+tmp
Y.1 = 5+ (5*tmp+rnorm(2000))
tmp=rnorm(1000)
X.2 = 9+tmp
Y.2 = 40+ (1.5*tmp+rnorm(1000))
X.3 = 7+ 0.5*runif(500)
Y.3 = 15+20*runif(500)
X = c(X.1,X.2,X.3)
Y = c(Y.1,Y.2,Y.3)

lpc1 = lpc(cbind(X,Y), scaled=FALSE, h=c(1,1) )
plot(lpc1)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to fit a line through the Mountain crest, i.e., through the highest density of points - in a loess-like fashion.

2012-03-10 Thread Emmanuel Levy
Hi,

I'm trying to normalize data by fitting a line through the highest density
of points (in a 2D plot).
In other words, if you visualize the data as a density plot, the fit I'm
trying to achieve is the line that goes through the crest of the mountain.

This is similar yet different to what LOESS does. I've been using loess
before, but it does not exactly that as it takes into account all points.
Although points farther from the fit have a smaller weight, they result in
the fit being a bit off the crest.

Do you know a package or maybe even an option in loess that would allow me
achieve this?
Any advice or idea appreciated.

Emmanuel

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to improve the robustness of loess? - example included.

2012-03-10 Thread Emmanuel Levy
Hi,

I posted a message earlier entitled How to fit a line through the
Mountain crest ...

I figured loess is probably the best way, but it seems that the
problem is the robustness of the fit. Below I paste an example to
illustrate the problem:

tmp=rnorm(2000)
X.background = 5+tmp; Y.background = 5+ (10*tmp+rnorm(2000))
X.specific = 3.5+3*runif(1000); Y.specific = 5+120*runif(1000)
X = c(X.background, X.specific);Y = c(Y.background, Y.specific)
MINx=range(X)[1];MAXx=range(X)[2]

my.loess = loess(Y ~ X, data.frame( X = X, Y = Y),
family=symmetric, degree=2, span=0.1)
lo.pred = predict(my.loess, data.frame(X = seq(MINx, MAXx,
length=100)), se=TRUE)
plot( seq(MINx, MAXx, length=100), lo.pred$fit, lwd=2,col=2, l)
points(X,Y, col= grey(abs(my.loess$res)/max(abs(my.loess$res))) )

As you will see, the red line does not follow the background signal.
However, when decreasing the specific signal to 500 points it
becomes perfect.

I'm sure there is a way to tune the fitting so that it works but I
can't figure out how. Importantly, *I cannot increase the span*
because in reality the relationship I'm looking at is more complex so
I need a small  span value to allow for a close fit.

I foresee that changing the weigthing is the way to go but I do not
really understand how the weight option is used (I tried to change
it and nothing happened), and also the embedded tricubic weighting
does not seem changeable.

So any idea would be very welcome.

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to fit a line through the Mountain crest, i.e., through the highest density of points - in a loess-like fashion.

2012-03-10 Thread Emmanuel Levy
Hi,

Thanks a lot for your reply - I posted a second message where I
provide a dummy example, entitled
How to improve the robustness of loess? - example included.

I need to fit a curve which makes it a bit difficult to work with kde2d only.

I'm actually trying to use kde2d in combination with loess - basically
I give the output density of kde2d as weights in the loess function.
It seems to give nice results :)

In my second post I wrote that the weight option did not work but
that's because I was writing weigth - not sure why I did not get an
error message.

I'll post the lines of code as a reply to the second post.

All the best,

Emmanuel




On 10 March 2012 19:46, David Winsemius dwinsem...@comcast.net wrote:

 On Mar 10, 2012, at 3:55 PM, Emmanuel Levy wrote:

 Hi,

 I'm trying to normalize data by fitting a line through the highest density
 of points (in a 2D plot).
 In other words, if you visualize the data as a density plot, the fit I'm
 trying to achieve is the line that goes through the crest of the
 mountain.


 Are you familiar with the kde2d  of bkde2D functions in various packages? If
 you then collected the max density for each X and Y you might want to see
 whether that 2-d function would follow a sufficiently regular path that
 would represent the projection of the ridge on the z=0 plane.



 This is similar yet different to what LOESS does.


 Do you want a curve or a line?


 I've been using loess
 before, but it does not exactly that as it takes into account all points.
 Although points farther from the fit have a smaller weight, they result in
 the fit being a bit off the crest.

 Do you know a package or maybe even an option in loess that would allow me
 achieve this?


 I don't. I happen to have a dataset where I could test it. But you are
 likely to get better responses if you provide a test case.

 Any advice or idea appreciated.

 Emmanuel

[[alternative HTML version deleted]]


 Plain text is preferred.

 --

 David Winsemius, MD
 West Hartford, CT


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to improve the robustness of loess? - example included.

2012-03-10 Thread Emmanuel Levy
Ok so this seems to work :)


tmp=rnorm(2000)
X.background = 5+tmp
Y.background = 5+ (10*tmp+rnorm(2000))
X.specific = 3.5+3*runif(3000)
Y.specific = 5+120*runif(3000)

X = c(X.background, X.specific)
Y = c(Y.background, Y.specific)

MINx=range(X)[1]
MAXx=range(X)[2]
MINy=range(Y)[1]
MAXy=range(Y)[2]

  ## estimates the density for each datapoint
nBins=50
my.lims= c(range(X,na.rm=TRUE),range(Y,na.rm=TRUE))

z1 = kde2d(X,Y,n=nBins, lims=my.lims, h= c(
(my.lims[2]-my.lims[1])/(nBins/4) ,  (my.lims[4]-my.lims[3])/(nBins/4)
) )
X.cut = cut(X, seq(z1$x[1], z1$x[nBins],len=(nBins+1) ))
Y.cut = cut(Y, seq(z1$y[1], z1$y[nBins],len=(nBins+1) ))
xy.cuts = data.frame(X.cut,Y.cut, ord=1:(length(X.cut)) )
density = data.frame( X=rep(factor(levels(X.cut)),rep(nBins) ),
Y=rep(factor(levels(Y.cut)), rep(nBins,nBins) ) , Z= as.vector(z1$z))

xy.density = merge( xy.cuts, density, by=c(1,2), sort=FALSE, all.x=TRUE)
xy.density = xy.density[order(x=xy.density$ord),]

### Now uses the density as a weight
my.loess = loess(Y ~ X, data.frame( X = X, Y = Y),
family=symmetric, degree=2, span=0.1, weights= xy.density$Z^3)
lo.pred = predict(my.loess, data.frame(X = seq(MINx, MAXx,
length=100)), se=TRUE)
plot( seq(MINx, MAXx, length=100), lo.pred$fit, lwd=2,col=2, l)
#, ylim=c(0, max(tmp$fit, na.rm=TRUE) ) , col=dark grey)
points(X,Y, pch=., col= grey(abs(my.loess$res)/max(abs(my.loess$res))) )


On 10 March 2012 18:30, Emmanuel Levy emmanuel.l...@gmail.com wrote:
 Hi,

 I posted a message earlier entitled How to fit a line through the
 Mountain crest ...

 I figured loess is probably the best way, but it seems that the
 problem is the robustness of the fit. Below I paste an example to
 illustrate the problem:

tmp=rnorm(2000)
X.background = 5+tmp; Y.background = 5+ (10*tmp+rnorm(2000))
X.specific = 3.5+3*runif(1000); Y.specific = 5+120*runif(1000)
X = c(X.background, X.specific);Y = c(Y.background, Y.specific)
MINx=range(X)[1];MAXx=range(X)[2]

my.loess = loess(Y ~ X, data.frame( X = X, Y = Y),
 family=symmetric, degree=2, span=0.1)
lo.pred = predict(my.loess, data.frame(X = seq(MINx, MAXx,
 length=100)), se=TRUE)
plot( seq(MINx, MAXx, length=100), lo.pred$fit, lwd=2,col=2, l)
points(X,Y, col= grey(abs(my.loess$res)/max(abs(my.loess$res))) )

 As you will see, the red line does not follow the background signal.
 However, when decreasing the specific signal to 500 points it
 becomes perfect.

 I'm sure there is a way to tune the fitting so that it works but I
 can't figure out how. Importantly, *I cannot increase the span*
 because in reality the relationship I'm looking at is more complex so
 I need a small  span value to allow for a close fit.

 I foresee that changing the weigthing is the way to go but I do not
 really understand how the weight option is used (I tried to change
 it and nothing happened), and also the embedded tricubic weighting
 does not seem changeable.

 So any idea would be very welcome.

 Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Which non-parametric regression would allow fitting this type of data? (example given).

2012-03-10 Thread Emmanuel Levy
Hi,

I'm wondering which function would allow fitting this type of data:

tmp=rnorm(2000)
X.1 = 5+tmp
Y.1 = 5+ (5*tmp+rnorm(2000))
tmp=rnorm(100)
X.2 = 9+tmp
Y.2 = 40+ (1.5*tmp+rnorm(100))
X.3 = 7+ 0.5*runif(500)
Y.3 = 15+20*runif(500)
X = c(X.1,X.2,X.3)
Y = c(Y.1,Y.2,Y.3)
   plot(X,Y)

The problem with loess is that distances for the goodness of fit are
calculated on the Y-axis. However, distances would need to be
calculated on the normals of the fitted curve. Is there a function
that provide this option?

A simple trick in that case consists in swapping X and Y, but I'm
wondering if there is a more general solution?

Thanks for your input,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Best HMM package to generate random (protein) sequences?

2011-03-22 Thread Emmanuel Levy
Dear All,

I would like to generate random protein sequences using a HMM model.
Has anybody done that before, or would you have any idea which package
is likely to be best for that?

The important facts are that the HMM will be fitted on ~3 million
sequential observations, with 20 different states (one for each amino
acid). I guess that 2-5 hidden states should be enough, and an order
of 3 would also be enough I would say.

Thanks a lot for any suggestion!

All the best,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] How to do a probability density based filtering in 2D?

2010-11-19 Thread Emmanuel Levy
Hello,

This sounds like a problem to which many solutions should exist, but I
did not manage to find one.

Basically, given a list of datapoints, I'd like to keep those within
the X% percentile highest density.
That would be equivalent to retain only points within a given line of
a contour plot.

Thanks to anybody who could let me know which function I could use!

Best,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to do a probability density based filtering in 2D?

2010-11-19 Thread Emmanuel Levy
Hello David,

I thought about this at first as well, e.g.,

  x1.lim = quantile(x1,prob=c(0.05,0.95))
  y2.lim = quantile(y2,prob=c(0.05,0.95))

  x1.sub = x1[ x1  x1.lim[1]  x1  x1.lim[2]  y2  y2.lim[1]  y2 
y2.lim[2]]
  y2.sub = y2[ x1  x1.lim[1]  x1  x1.lim[2]  y2  y2.lim[1]  y2 
y2.lim[2]]

But this is actually does not do what I want because it is not based
on the density of the data. What I would like is to keep only the
points within an area where the density of points is over x.

In other words, if you imagine a contour plot, I'd like to keep all
the points within a given contour line. So the data has to be
interpolated or smoothed at some point. I am using the kde2d function
of the MASS package to plot contour lines but I don't know how to
retrieve subsets of what I plot.

Also I wouldn't be surprised if there is a trick with quantile that
escapes my mind.

Thanks for your help,

Emmanuel







On 19 November 2010 21:25, David Winsemius dwinsem...@comcast.net wrote:

 On Nov 19, 2010, at 8:44 PM, Emmanuel Levy wrote:

 Hello,

 This sounds like a problem to which many solutions should exist, but I
 did not manage to find one.

 Basically, given a list of datapoints, I'd like to keep those within
 the X% percentile highest density.
 That would be equivalent to retain only points within a given line of
 a contour plot.

 Thanks to anybody who could let me know which function I could use!

 What's wrong with quantile?

 --
 David Winsemius, MD
 West Hartford, CT



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to do a probability density based filtering in 2D?

2010-11-19 Thread Emmanuel Levy
Hello Roger,

Thanks for the suggestions.

I finally managed to do it using the output of kde2d - The code is
pasted below. Actually this made me realize that the outcome of kde2d
can be quite influenced by outliers if a boundary box is not given
(try running the code without the boundary box, e.g.lims, and you will
see).

library(MASS)
x1 = rnorm(1)
x2 = rnorm(1)
nBins=200

z1 = kde2d(x1,x2,n=nBins, lims=c(-4,4,-4,4))

x1.cut = cut(x1, seq(z1$x[1], z1$x[nBins],len=(nBins+1) ))
x2.cut = cut(x2, seq(z1$y[1], z1$y[nBins],len=(nBins+1) ))
xy.cuts = data.frame(x1.cut,x2.cut, ord=1:(length(x1.cut)) )

density = data.frame( X=rep(factor(levels(x1.cut)),rep(nBins,nBins) ),
Y=rep(factor(levels(x2.cut)),nBins) , Z= as.vector(z1$z))

xy.density = merge( xy.cuts, density, by=c(1,2), sort=FALSE, all.x=TRUE)
xy.density = xy.density[order(x=xy.density$ord),]
xy.density$Z[is.na(xy.density$Z)]=0

Z.lim = quantile(xy.density$Z, prob=c(0.1))
to.plot = xy.density$Z  Z.lim[1]

par(mfrow=c(1,2))
plot(x1,x2, xlim=c(-3,3) ,ylim=c(-3,3))
plot(x1[to.plot], x2[to.plot], xlim=c(-3,3),ylim=c(-3,3))






On 19 November 2010 21:52, Roger Koenker rkoen...@illinois.edu wrote:

 Also I wouldn't be surprised if there is a trick with quantile that
 escapes my mind.

 I would be surprised...  this is all very dependent on how you do the
 density estimation, but you might look at contourLines and then try
 to use in.convex.hull() from tripack, but beware that things get more
 complicated if the density is multi-modal.

 You might also look at one of the packages that do data depth
 sorts of things.

 Roger Koenker
 rkoen...@illinois.edu







__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] problem with PDF/postcript, cannot change paper size: ‘mode(width)’ and ‘mod e(height)’ differ between new and previous

2010-11-16 Thread Emmanuel Levy
Hi,

The pdf function would not let me change the paper size and gives me
the following warning:

pdf(figure.pdf, width=6, height=10)
Warning message:
‘mode(width)’ and ‘mode(height)’ differ between new and previous
 == NOT changing ‘width’  ‘height’

If I use the option   paper = a4r,   it does not give me a warning
but still prints on a square region (it does not use the entire page).

Two days ago I updated R and associated packages. I'm not sure if this
could be the cause? Even if I restart a new session I keep getting the
same error.  I also tried X11.options(reset=TRUE) and it does not
help.

Thanks for any hint you could give me, this is really annoying (the
function postscript gives the same error so I cannot make any
figure!).
all the best,

Emmanuel

 sessionInfo()
R version 2.12.0 (2010-10-15)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_CA.UTF-8   LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=C  LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=en_CA.UTF-8   LC_NAME=C
 [9] LC_ADDRESS=C   LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] problem with PDF/postcript, cannot change paper size: ‘mode(width)’ and ‘mod e(height)’ differ between new and previous

2010-11-16 Thread Emmanuel Levy
Update - sorry for the stupid question, let's say it's pretty late.

For those who may be as tired as I am and get the same warning, the
paper size should be given as an integer!


On 16 November 2010 04:17, Emmanuel Levy emmanuel.l...@gmail.com wrote:
 Hi,

 The pdf function would not let me change the paper size and gives me
 the following warning:

 pdf(figure.pdf, width=6, height=10)
 Warning message:
 ‘mode(width)’ and ‘mode(height)’ differ between new and previous
         == NOT changing ‘width’  ‘height’

 If I use the option   paper = a4r,   it does not give me a warning
 but still prints on a square region (it does not use the entire page).

 Two days ago I updated R and associated packages. I'm not sure if this
 could be the cause? Even if I restart a new session I keep getting the
 same error.  I also tried X11.options(reset=TRUE) and it does not
 help.

 Thanks for any hint you could give me, this is really annoying (the
 function postscript gives the same error so I cannot make any
 figure!).
 all the best,

 Emmanuel

 sessionInfo()
 R version 2.12.0 (2010-10-15)
 Platform: x86_64-pc-linux-gnu (64-bit)

 locale:
  [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
  [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
  [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
  [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
  [9] LC_ADDRESS=C               LC_TELEPHONE=C
 [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C

 attached base packages:
 [1] stats     graphics  grDevices utils     datasets  methods   base


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Random sampling while keeping distribution of nearest neighbor distances constant.

2009-08-12 Thread Emmanuel Levy
Dear All,

I cannot find a solution to the following problem although I imagine
that it is a classic, hence my email.

I have a vector V of X values comprised between 1 and N.

I would like to get random samples of X values also comprised between
1 and N, but the important point is:
* I would like to keep the same distribution of distances between the X values *

For example let's say N=10 and I have V = c(3,4,5,6)
then the random values could be 1,2,3,4 or 2,3,4,5 or 3,4,5,6, or 4,5,6,7 etc..
so that the distribution of distances (3 - 4, 3 -5, 3 - 6, 4 -
5, 4 - 6 etc ...) is kept constant.

I couldn't find a package that help me with this, but it looks like it
should be a classic problem so there should be something!

Many thanks in advance for any help or hint you could provide,

All the best,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Random sampling while keeping distribution of nearest neighbor distances constant.

2009-08-12 Thread Emmanuel Levy
Dear All,(my apologies if it got posted twice, it seems it didn't
get through)

I cannot find a solution to the following problem although I suppose
this is a classic.

I have a vector V of X=length(V) values comprised between 1 and N.

I would like to get random samples of X values also comprised between
1 and N, but the important point is:
* I would like to keep the same distribution of distances between the
original X values *

For example let's say N=10 and I have V = c(3,4,5,6)
then the random values could be 1,2,3,4 or 2,3,4,5 or 3,4,5,6, or 4,5,6,7 etc..
so that the distribution of distances (3 - 4, 3 -5, 3 - 6, 4 -
5, 4 - 6 etc ...) is kept constant.

I couldn't find a package that help me with this, but it looks like it
should be a classic problem so there should be something!

Many thanks in advance for any help or hint you could provide,

All the best,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random sampling while keeping distribution of nearest ne

2009-08-12 Thread Emmanuel Levy
Thanks for your suggestion Ted,

This would indeed work for the particular example I gave, but I am
looking for a general solution.

For example, if my values are: V=c(2,4,5,6)
Then there would be two possibilities: 2,4,5,6 or 4,5,6,8
more generally, what I mean is that the matrix of distances between
pairs of values in V should be similar in the vector of random values.

Note that in practice, N is around 7,000,000 and X=length(V) may vary
between 20,000 and 500,000.

It'd be great if you could point me out to the name of this class of
problem, to a book, or to a package that could help me solve it.

Many thanks!

Emmanuel


PS: I apologize that I sent a second post. This one did not appear in
my R-help label so I assumed it wasn't sent for some reason.





2009/8/12 Ted Harding ted.hard...@manchester.ac.uk:
 On 12-Aug-09 22:05:24, Emmanuel Levy wrote:
 Dear All,
 I cannot find a solution to the following problem although I imagine
 that it is a classic, hence my email.

 I have a vector V of X values comprised between 1 and N.

 I would like to get random samples of X values also comprised between
 1 and N, but the important point is:
 * I would like to keep the same distribution of distances between the X
 values *

 For example let's say N=10 and I have V = c(3,4,5,6)
 then the random values could be 1,2,3,4 or 2,3,4,5 or 3,4,5,6, or
 4,5,6,7 etc..
 so that the distribution of distances (3 - 4, 3 -5, 3 - 6, 4 -
 5, 4 - 6 etc ...) is kept constant.

 I couldn't find a package that help me with this, but it looks like it
 should be a classic problem so there should be something!

 Many thanks in advance for any help or hint you could provide,
 All the best,
 Emmanuel

 If I've understood you right, you are basically putting a sequence
 with given spacings in a random position amongst the available
 positions. In your example, you would randomly choose between
 1,2,3,4/2,3,4,5/3,4,5,6/4,5,6,7/5,6,7,8/6,7,8,9/7,8,9,10/

 Hence a result Y could be:

  A - min(V)
  L - max(V) - A + 1
  M - (0:(N-L))
  Y - 1 + (V-A) + sample(M,1)

 I think this does it!

 
 E-Mail: (Ted Harding) ted.hard...@manchester.ac.uk
 Fax-to-email: +44 (0)870 094 0861
 Date: 12-Aug-09                                       Time: 23:49:22
 -- XFMail --


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random sampling while keeping distribution of nearest neighbor distances constant.

2009-08-12 Thread Emmanuel Levy
Dear Daniel,

Thank a lot for your suggestion. It is helpful and got me thinking
more about it so that I can rephrase it:

Given a vector V containing X values, comprised within 1 and N. I'd
like to sample values so that the *distribution* of distances between
the X values is similar.

There are several distributions: the 1st order would be given by the
function diff.
The 2d order distribution would be given by
diff(V[seq(1,length(V),by=2)]) and diff(V[seq(2,length(V),by=2)])
The 3rd order distribution diff(V[seq(1,length(V),by=3)]) and
diff(V[seq(2,length(V),by=3)]) and diff(V[seq(3,length(V),by=3)])
The 4th order 

I would like to produce different samples, where the first, or first
and second, or first and second and third, or up to say five orders
distance distributions are reproduced.

Is anybody aware of a formalism that is explained in a book and that
could help me deal with this problem? Or even better of a package?

Thanks for your help,

Emmanuel




2009/8/12 Nordlund, Dan (DSHS/RDA) nord...@dshs.wa.gov:
 -Original Message-
 From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On
 Behalf Of Emmanuel Levy
 Sent: Wednesday, August 12, 2009 3:05 PM
 To: r-h...@stat.math.ethz.ch
 Cc: dev djomson
 Subject: [R] Random sampling while keeping distribution of nearest neighbor
 distances constant.

 Dear All,

 I cannot find a solution to the following problem although I imagine
 that it is a classic, hence my email.

 I have a vector V of X values comprised between 1 and N.

 I would like to get random samples of X values also comprised between
 1 and N, but the important point is:
 * I would like to keep the same distribution of distances between the X 
 values *

 For example let's say N=10 and I have V = c(3,4,5,6)
 then the random values could be 1,2,3,4 or 2,3,4,5 or 3,4,5,6, or 4,5,6,7 
 etc..
 so that the distribution of distances (3 - 4, 3 -5, 3 - 6, 4 -
 5, 4 - 6 etc ...) is kept constant.

 I couldn't find a package that help me with this, but it looks like it
 should be a classic problem so there should be something!

 Many thanks in advance for any help or hint you could provide,

 All the best,

 Emmanuel


 Emmanuel,

 I don't know if this is a classic problem or not.  But given your 
 description, you write your own function something like this

 sample.dist - function(vec, Min=1, Max=10){
  diffs - c(0,diff(vec))
  sum_d - sum(diffs)
  sample(Min:(Max-sum_d),1)+cumsum(diffs)
  }

 Where Min and Max are the minimum and maximum values that you are sampling 
 from (Min=1 and Max=10 in your example), and vec is passed the vector that 
 you are sampling distances from.  This assumes that your vector is sorted 
 smallest to largest as in your example.   The function could be changed to 
 accommodate a vector that isn't sorted.

 V - sort(sample(1:100,4))
 V
 #[1] 46 78 82 95
 sample.dist(V, Min=1, Max=100)
 #[1] 36 68 72 85
 sample.dist(V, Min=1, Max=100)
 #[1] 12 44 48 61

 This should get you started at least.  Hope this is helpful,

 Dan

 Daniel J. Nordlund
 Washington State Department of Social and Health Services
 Planning, Performance, and Accountability
 Research and Data Analysis Division
 Olympia, WA  98504-5204


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Random sampling while keeping distribution of nearest neighbor distances constant.

2009-08-12 Thread Emmanuel Levy
 But if the 1st order differences are the same, then doesn't it follow that 
 the 2nd, 3rd, ... order differences must be the same between the original and 
 the new random vector.  What am I missing?

You are missing nothing sorry, I wrote something wrong. What I would
like to be preserved is the distance with the *nearest* neighbor, so
diff is not the way to go. If you only consider the nearest neighbor,
then

c(3,4, 8,9) and c(4,5,6,7) are the same in terms of first order (all
closest neighbor are 1 unit away) but not in terms of second order.

Also, I don't know if there would be a simple way to maintain a
*distribution* of distances (even if not of nearest neighbor).
For example, c(2,4,5,6) could be c(1,3,4,5), c(3,5,6,7) as proposed by
your solution, but it could also be: c(4,5,6,8)
Or, c(2,3,6,7,8) could be c(2,3,4,7,8)

Actually, that's really simple! I can simply resample the diff vector!

OK so the only problem becomes the 1st, 2d, 3rd order thing now, but
you made me realize that I can skip it for the moment.

Thank you! :-)

Emmanuel




 Dan

 Daniel J. Nordlund
 Washington State Department of Social and Health Services
 Planning, Performance, and Accountability
 Research and Data Analysis Division
 Olympia, WA  98504-5204


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Is it normal that normalize.loess does not tolerate a single NA value?

2009-03-13 Thread Emmanuel Levy
Dear all,

I have been using normalize.loess and I get the following error
message when my matrix contains NA values:

 my.mat = matrix(nrow=100, ncol=4, runif(400) )
 my.mat[1,1]=NA
 my.mat.n = normalize.loess(my.mat, verbose=TRUE)
Done with 1 vs 2 in iteration 1
Done with 1 vs 3 in iteration 1
Done with 1 vs 4 in iteration 1
Done with 2 vs 3 in iteration 1
Done with 2 vs 4 in iteration 1
Done with 3 vs 4 in iteration 1
1 0.317319
Warning messages:
1: In means[, j] + aux :
  longer object length is not a multiple of shorter object length
2: In means[, k] - aux :
  longer object length is not a multiple of shorter object length
...

Is that normal? If not, do you have any suggestion to avoid it? I'm
scared that this introduces abnormalities in the normalization.

Thanks for your help,

Emmanuel


 sessionInfo()
R version 2.8.0 (2008-10-20)
x86_64-pc-linux-gnu

locale:
LC_CTYPE=en_CA.UTF-8;LC_NUMERIC=C;LC_TIME=en_CA.UTF-8;LC_COLLATE=en_CA.UTF-8;LC_MONETARY=C;LC_MESSAGES=en_CA.UTF-8;LC_PAPER=en_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_CA.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] splines   tools stats graphics  grDevices utils datasets
[8] methods   base

other attached packages:
 [1] limma_2.16.3GOstats_2.4.0   Category_2.4.0
 [4] genefilter_1.22.0   survival_2.34-1 RBGL_1.16.0
 [7] annotate_1.20.1 xtable_1.5-4GO.db_2.2.0
[10] AnnotationDbi_1.2.2 RSQLite_0.6-9   DBI_0.2-4
[13] graph_1.18.0YEAST_2.0.1 affy_1.20.0
[16] Biobase_2.2.1

loaded via a namespace (and not attached):
[1] affyio_1.10.1cluster_1.11.11  preprocessCore_1.4.0
[4] tcltk_2.8.0

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mathematica now working with Nvidia GPUs -- any plan for R?

2008-11-20 Thread Emmanuel Levy
Dear Brian, Mose, Peter and Stefan,

Thanks a lot for your replies - the issues are now clearer to me. (and
I apologize for not using the appropriate list).

Best wishes,

Emmanuel




2008/11/19 Peter Dalgaard [EMAIL PROTECTED]:
 Stefan Evert wrote:

 On 19 Nov 2008, at 07:56, Prof Brian Ripley wrote:

 I just read an announcement saying that Mathematica is launching a
 version working with Nvidia GPUs. It is claimed that it'd make it
 ~10-100x faster!
 http://www.physorg.com/news146247669.html

 Well, lots of things are 'claimed' in marketing (and Wolfram is not
 shy to claim).  I think that you need lots of GPUs, as well as the
 right problem.

 Which makes me wonder whether R would be able to make use of this
 processing power, since the figures claimed by Wolfram are very probably
 for single-precision floats? (Am I right in thinking that R only works
 with double precision?)

 According to the nVidia Web site, the Tesla architecture is _ten times_
 slower for double-precision operations than for single-precision, which
 makes it seem far less amazing than at first sight.


 It wouldn't be rocket science to add a single-precision data type to R,
 it is just a large amount of work to get all details right, and noone
 has found it worth the trouble.

 Probably more of a stumbling block is that GPUs traditionally have had a
 rather cavalier attitude to floating-point standards. We have had our
 share of problems with compilers that breaks the rules (or users setting
 flags like -fast-math that allow rule breakage), and ensuing breakage of
 user code and/or make check, so I don't think it is viable to try and
 get the GPU to take on all of  R's computations. Special-purpose
 computation, e.g. an R package interfacing to libgpublas.so or
 whatever,  is always a possibility, but that really goes without saying.

 --
   O__   Peter Dalgaard Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
  (*) \(*) -- University of Copenhagen   Denmark  Ph:  (+45) 35327918
 ~~ - ([EMAIL PROTECTED])  FAX: (+45) 35327907

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Mathematica now working with Nvidia GPUs -- any plan for R?

2008-11-18 Thread Emmanuel Levy
Dear All,

I just read an announcement saying that Mathematica is launching a
version working with Nvidia GPUs. It is claimed that it'd make it
~10-100x faster!
http://www.physorg.com/news146247669.html

I was wondering if you are aware of any development going into this
direction with R?

Thanks for sharing your thoughts,

Best wishes,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] gregexpr slow and increases exponentially with string length -- how to speed it up?

2008-10-30 Thread Emmanuel Levy
Dear All,

I have a long string and need to search for regular expressions in
there. However it becomes horribly slow as the string length
increases.

Below is an example: when i increases by 5, the time spent increases
by more! (my string is 11,000,000 letters long!)

I also noticed that
- the search time increases dramatically with the number of matches found.
- the perl=T option slows down the search

Any idea to speed this up would be greatly appreciated!

Best,

Emmanuel


 for (i in c(1, 5, 10, 50)){
+   aa = as.character(sample(1:9, i, replace=T))
+   aa = paste(aa, collapse='')
+   print(i)
+   print(system.time(gregexpr([367]2[1-9][129],aa)))
+ }
[1] 1
   user  system elapsed
  0.004   0.000   0.003
[1] 5
   user  system elapsed
  0.060   0.000   0.061
[1] 1e+05
   user  system elapsed
  0.240   0.000   0.238
[1] 5e+05
   user  system elapsed
  5.733   0.000   5.732


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] gregexpr slow and increases exponentially with string length -- how to speed it up?

2008-10-30 Thread Emmanuel Levy
Hi Chuck,

Thanks a lot for your suggestion.

 You can find all such matches (not just the disjoint ones that gregexpr
 finds) using something like this:

twomatch -function(x,y) intersect(x+1,y)
match.list -
list(
which( vec %in% c(3,6,7) ),
which( vec == 2 ),
which( vec %in% 1:9 ),
which( vec %in% c(1,2,9) ) )
res - Reduce( twomatch, match.list ) - length(match.list) + 1


I should have made explicit that I have many of these motifs to
match, and their structure vary quite a bit. This means that I'd need
a function to translate each motif into the solution you proposed,
which would be (although feasible), a bit painful.
In the meantime, the best solution I found is to cut the big string
into smaller strings. That actually speeds things up a lot.

Best,

E

 If you want to precisely match the gregexpr results, you'll need to filter
 out the overlapping matches.

 HTH,

 Chuck


 Best,

 Emmanuel


 for (i in c(1, 5, 10, 50)){

 +   aa = as.character(sample(1:9, i, replace=T))
 +   aa = paste(aa, collapse='')
 +   print(i)
 +   print(system.time(gregexpr([367]2[1-9][129],aa)))
 + }
 [1] 1
  user  system elapsed
  0.004   0.000   0.003
 [1] 5
  user  system elapsed
  0.060   0.000   0.061
 [1] 1e+05
  user  system elapsed
  0.240   0.000   0.238
 [1] 5e+05
  user  system elapsed
  5.733   0.000   5.732


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 Charles C. Berry(858) 534-2098
Dept of Family/Preventive
 Medicine
 E mailto:[EMAIL PROTECTED]   UC San Diego
 http://famprevmed.ucsd.edu/faculty/cberry/  La Jolla, San Diego 92093-0901




__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] If I known d1 (density1), and dmix is a mix between d1 and d2 (d2 is unknown), can one infer d2?

2008-10-22 Thread Emmanuel Levy
Dear All,

I hope the title speaks by itself.

I believe that there should be a solution when I see what Mclust is
able to do. However, this problem is quite
particular in that d3 is not known and does not necessarily correspond
to a common distribution (e.g. normal, exponential ...).
However it must look like dmix and d1 to some extent.


To illustrate my problem I worked out a simple example:

Imagine that there are two classes of people:
(i) fast walkers: they achieve a unit distance in a given distribution
of unit time (d1).
(ii) slow walkers: they achieve a unit distance in another
distribution of longer times (d2), cf example below.

If I have a mixed sample that contain X% of fast walkers and (100-X)%
of slow walkers, *is it possible to use it to estimate d2, as well as
X?*

R example:
walk.fast = sample(seq(1,5,length.out=1000),
prob=dlnorm(seq(1,5,length.out=1000)), replace=T)
walk.slow = sample(seq(1,5,length.out=1000),
prob=dlnorm(seq(1,5,length.out=1000), meanlog=1.2), replace=T)

percentage.fast =0.8
walk.all = c(sample(walk.fast, percentage.fast*1000),
sample(walk.slow, (1-percentage.fast)*1000 ) )

plot(density(walk.fast, from=1, to=5))  # d1
lines(density(walk.slow, from=1, to=5), col=2) # d2
lines(density(walk.all, from=1, to=5), col=3)# dmix,
mix between d1 and d2

Is there a method that could allow me to estimate d2 given dmix?

Any hint would be greatly appreciated.

Best,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mclust problem with mclust1Dplot: Error in to - from : non-numeric argument to binary operator

2008-10-21 Thread Emmanuel Levy
After playing with the data, I figured out what the problem is:
I've got many zeros in the dataset, which probably induces the
algorithm to determine a gaussian with variance=0.

If I remove the zeros it works, but then the decomposition is not as
it should be 

Any idea on how to solve this would be great; is it possible to
somehow force the parameters (e.g variance) to be
greater than a particular threshold?

Thanks,

Emmanuel



2008/10/20 Emmanuel Levy [EMAIL PROTECTED]:
 Dear list members,

 I am using Mclust in order to deconvolute a distribution that I
 believe is a sum of two gaussians.

 First I can make a model:
 my.data.model = Mclust(my.data, modelNames=c(E), warn=T, G=1:3)

 But then, when I try to plot the result, I get the following error:

 mclust1Dplot(my.data.model, parameters = my.data.model$parameters, what = 
 density)
 Error in to - from : non-numeric argument to binary operator

 Also, I'd like to allow for each gaussian to have a different variance
 (modelNmaes=c(V)) , but then I get another error message:

 my.data.model = Mclust(my.data, modelNames=c(V), warn=T, G=1:3)
 Warning messages:
 1: In meV(data = data, z = z, prior = prior, control = control, warn = warn) :
  sigma-squared falls below threshold
 2: In meV(data = data, z = z, prior = prior, control = control, warn = warn) :
  sigma-squared falls below threshold
 3: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
  best model occurs at the min or max # of components considered
 4: In Mclust(my.data, modelNames = c(V), warn = T, G = 1:3) :
  optimal number of clusters occurs at min choice

 Many thanks in advance for your help,

 Best wishes,

 Emmanuel


 If you would like to reproduce the above, the dataset is:

 my.data=c( 0.,0.0052,0.,-0.2136,0.4625,0.6047,0.,0.7370,0.5059
 ,-0.8060,-1.0790,0.,-1.5397,-0.0720,-3.2180,-1.6980,0.,2.2845
 ,-1.0741,0.,0.1020,-0.6010,0.2210,-0.0120,1.0785,0.,-0.4536
 ,-0.1127,-0.2032,-0.0421,-1.6818,-0.9935,-2.2105,-0.7963,-0.1820,-2.0468
 ,0.6161,-1.7663,-0.6800,-2.1290,-0.0167,0.,0.,0.,0.5427
 ,-0.0170,0.,0.,-0.6576,0.9055,0.1409,-0.1409,0.,0.3730
 ,-0.1800,-1.3141,0.6786,-0.2480,-2.5110,-0.1340,0.3000,-1.7350,0.
 ,-0.5464,0.,-0.7513,-1.9056,-1.4823,-0.5376,-0.4516,-1.1391,0.
 ,-2.2560,1.3770,0.3390,-2.6023,-1.0880,-0.1444,0.,-0.1459,0.1740
 ,0.,0.3310,0.0749,1.0360,-0.8345,-0.6843,-3.5171,-1.9482,-0.4972
 ,-0.0130,-2.0290,-0.2812,0.,0.,-0.0164,0.,-1.9220,-1.5941
 ,-1.0840,0.,0.0459,-2.2121,-1.1485,-1.1485,0.,-0.4449,-0.5001
 ,0.3520,1.9980,-3.8385,1.7160,1.0020,-0.2250,-0.8265,-0.2032)


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Mclust problem with mclust1Dplot: Error in to - from : non-numeric argument to binary operator

2008-10-21 Thread Emmanuel Levy
Dear All,

I haven't found a solution to the variance problem. However, I could
solve the plotting problem by plotting the data myself.
I think that the problem is due to a change in the data structure
returned by the function Mclust.

The web-page:
http://www.childrensmercy.org/stats/weblog2006/UnivariateClustering.asp
has been extremely helpful.

I paste the code below for people's future reference.

par(mfcol=c(1,2),mar=c(2.1,0.1,0.1,0.1))
v - Mclust(my.data)
x0 - seq(min(my.data),max(my.data),length.out=100)
nt - rep(0,100)
plot(range(my.data),c(0,0.15),type=n,xlab= ,ylab= ,axes=F, ylim=c(0,0.4) )
axis(side=1)
for (i in 1:2) {
  ni - v$parameters$pro[i]*dnorm(x0,
mean=as.numeric(v$parameters$mean[i]),sd=1)
  lines(x0,ni,col=1)
  nt - nt+ni
 }
lines(x0,nt,lwd=3)
segments(my.data,0,my.data,0.02)

Best,

Emmanuel



2008/10/21 Emmanuel Levy [EMAIL PROTECTED]:
 After playing with the data, I figured out what the problem is:
 I've got many zeros in the dataset, which probably induces the
 algorithm to determine a gaussian with variance=0.

 If I remove the zeros it works, but then the decomposition is not as
 it should be 

 Any idea on how to solve this would be great; is it possible to
 somehow force the parameters (e.g variance) to be
 greater than a particular threshold?

 Thanks,

 Emmanuel



 2008/10/20 Emmanuel Levy [EMAIL PROTECTED]:
 Dear list members,

 I am using Mclust in order to deconvolute a distribution that I
 believe is a sum of two gaussians.

 First I can make a model:
 my.data.model = Mclust(my.data, modelNames=c(E), warn=T, G=1:3)

 But then, when I try to plot the result, I get the following error:

 mclust1Dplot(my.data.model, parameters = my.data.model$parameters, what = 
 density)
 Error in to - from : non-numeric argument to binary operator

 Also, I'd like to allow for each gaussian to have a different variance
 (modelNmaes=c(V)) , but then I get another error message:

 my.data.model = Mclust(my.data, modelNames=c(V), warn=T, G=1:3)
 Warning messages:
 1: In meV(data = data, z = z, prior = prior, control = control, warn = warn) 
 :
  sigma-squared falls below threshold
 2: In meV(data = data, z = z, prior = prior, control = control, warn = warn) 
 :
  sigma-squared falls below threshold
 3: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
  best model occurs at the min or max # of components considered
 4: In Mclust(my.data, modelNames = c(V), warn = T, G = 1:3) :
  optimal number of clusters occurs at min choice

 Many thanks in advance for your help,

 Best wishes,

 Emmanuel


 If you would like to reproduce the above, the dataset is:

 my.data=c( 0.,0.0052,0.,-0.2136,0.4625,0.6047,0.,0.7370,0.5059
 ,-0.8060,-1.0790,0.,-1.5397,-0.0720,-3.2180,-1.6980,0.,2.2845
 ,-1.0741,0.,0.1020,-0.6010,0.2210,-0.0120,1.0785,0.,-0.4536
 ,-0.1127,-0.2032,-0.0421,-1.6818,-0.9935,-2.2105,-0.7963,-0.1820,-2.0468
 ,0.6161,-1.7663,-0.6800,-2.1290,-0.0167,0.,0.,0.,0.5427
 ,-0.0170,0.,0.,-0.6576,0.9055,0.1409,-0.1409,0.,0.3730
 ,-0.1800,-1.3141,0.6786,-0.2480,-2.5110,-0.1340,0.3000,-1.7350,0.
 ,-0.5464,0.,-0.7513,-1.9056,-1.4823,-0.5376,-0.4516,-1.1391,0.
 ,-2.2560,1.3770,0.3390,-2.6023,-1.0880,-0.1444,0.,-0.1459,0.1740
 ,0.,0.3310,0.0749,1.0360,-0.8345,-0.6843,-3.5171,-1.9482,-0.4972
 ,-0.0130,-2.0290,-0.2812,0.,0.,-0.0164,0.,-1.9220,-1.5941
 ,-1.0840,0.,0.0459,-2.2121,-1.1485,-1.1485,0.,-0.4449,-0.5001
 ,0.3520,1.9980,-3.8385,1.7160,1.0020,-0.2250,-0.8265,-0.2032)



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] unimodal VS bimodal normal distribution - how to get a pvalue?

2008-10-21 Thread Emmanuel Levy
Dear All,

I have a distribution of values and I would like to assess the
uni/bimodality of the distribution.

I managed to decompose it into two normal distribs using Mclust, and
the BIC criteria is best for two parameters.
However, the problem is that the BIC criteria is not a P-value, which
I would need ideally.

I saw the diptest package but it is not for gaussian distributions.

Any hint at a package or way-around this would be greatly appreciated.

Best,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unimodal VS bimodal normal distribution - how to get a pvalue?

2008-10-21 Thread Emmanuel Levy
Hi Duncan,

I'm really stupid --- yes of course!!

Thanks for pointing me out the (now) obvious.

All the best,

E

2008/10/21 Duncan Murdoch [EMAIL PROTECTED]:
 On 10/21/2008 2:56 PM, Emmanuel Levy wrote:

 Dear All,

 I have a distribution of values and I would like to assess the
 uni/bimodality of the distribution.

 I managed to decompose it into two normal distribs using Mclust, and
 the BIC criteria is best for two parameters.
 However, the problem is that the BIC criteria is not a P-value, which
 I would need ideally.

 I saw the diptest package but it is not for gaussian distributions.

 Any hint at a package or way-around this would be greatly appreciated.

 If your null is really a unimodal gaussian, then doing it by simulation
 should be easy:  simulate lots of normal samples of the same size as your
 observed one, and calculate the same BIC statistic on each of them.  Then
 your p-value is the true proportion of simulated statistics that give
 stronger evidence than the observed one, and is well approximated by the
 simulated proportion.

 Duncan Murdoch


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Mclust problem with mclust1Dplot: Error in to - from : non-numeric argument to binary operator

2008-10-20 Thread Emmanuel Levy
Dear list members,

I am using Mclust in order to deconvolute a distribution that I
believe is a sum of two gaussians.

First I can make a model:
 my.data.model = Mclust(my.data, modelNames=c(E), warn=T, G=1:3)

But then, when I try to plot the result, I get the following error:

 mclust1Dplot(my.data.model, parameters = my.data.model$parameters, what = 
 density)
Error in to - from : non-numeric argument to binary operator

Also, I'd like to allow for each gaussian to have a different variance
(modelNmaes=c(V)) , but then I get another error message:

 my.data.model = Mclust(my.data, modelNames=c(V), warn=T, G=1:3)
Warning messages:
1: In meV(data = data, z = z, prior = prior, control = control, warn = warn) :
  sigma-squared falls below threshold
2: In meV(data = data, z = z, prior = prior, control = control, warn = warn) :
  sigma-squared falls below threshold
3: In summary.mclustBIC(Bic, data, G = G, modelNames = modelNames) :
  best model occurs at the min or max # of components considered
4: In Mclust(my.data, modelNames = c(V), warn = T, G = 1:3) :
  optimal number of clusters occurs at min choice

Many thanks in advance for your help,

Best wishes,

Emmanuel


If you would like to reproduce the above, the dataset is:

my.data=c( 0.,0.0052,0.,-0.2136,0.4625,0.6047,0.,0.7370,0.5059
,-0.8060,-1.0790,0.,-1.5397,-0.0720,-3.2180,-1.6980,0.,2.2845
,-1.0741,0.,0.1020,-0.6010,0.2210,-0.0120,1.0785,0.,-0.4536
,-0.1127,-0.2032,-0.0421,-1.6818,-0.9935,-2.2105,-0.7963,-0.1820,-2.0468
,0.6161,-1.7663,-0.6800,-2.1290,-0.0167,0.,0.,0.,0.5427
,-0.0170,0.,0.,-0.6576,0.9055,0.1409,-0.1409,0.,0.3730
,-0.1800,-1.3141,0.6786,-0.2480,-2.5110,-0.1340,0.3000,-1.7350,0.
,-0.5464,0.,-0.7513,-1.9056,-1.4823,-0.5376,-0.4516,-1.1391,0.
,-2.2560,1.3770,0.3390,-2.6023,-1.0880,-0.1444,0.,-0.1459,0.1740
,0.,0.3310,0.0749,1.0360,-0.8345,-0.6843,-3.5171,-1.9482,-0.4972
,-0.0130,-2.0290,-0.2812,0.,0.,-0.0164,0.,-1.9220,-1.5941
,-1.0840,0.,0.0459,-2.2121,-1.1485,-1.1485,0.,-0.4449,-0.5001
,0.3520,1.9980,-3.8385,1.7160,1.0020,-0.2250,-0.8265,-0.2032)

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] RCurl compilation error on ubuntu hardy

2008-09-17 Thread Emmanuel Levy
Dear Duncan and Vincent,

Thanks for your help. It took me some time to upgrade R because
atp-get install would not actually install the packages,
but I used

sudo aptitude install r-base

and that managed to upgrade the r-base package.

I'm also happy to say that RCurl as well as BiomaRt could now be
installed with success!

Thanks again,

Best wishes,

Emmanuel




2008/9/17 Duncan Temple Lang [EMAIL PROTECTED]:
 -BEGIN PGP SIGNED MESSAGE-
 Hash: SHA1


 Hi Emanuel.

  The best thing to do is upgrade to R-2.7.2 (or any 2.7.*) and the
 problem should disappear.  It relates to encoding of strings.

   D.

 Emmanuel Levy wrote:
 Dear list members,

 I encountered this problem and the solution pointed out in a previous
 thread did not work for me.
 (e.g.  install.packages(RCurl, repos = http://www.omegahat.org/R;)

 I work with Ubuntu Hardy, and installed R 2.6.2 via apt-get.

 I really need RCurl in order to use biomaRt ... any help would be
 greatly appreciated.

 Best wishes,

 Emmanuel

 =

 sessionInfo()
 R version 2.6.2 (2008-02-08)
 x86_64-pc-linux-gnu

 locale:
 LC_CTYPE=en_CA.UTF-8;LC_NUMERIC=C;LC_TIME=en_CA.UTF-8;LC_COLLATE=en_CA.UTF-8;LC_MONETARY=en_CA.UTF-8;LC_MESSAGES=en_CA.UTF-8;LC_PAPER=en_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_CA.UTF-8;LC_IDENTIFICATION=C

 attached base packages:
 [1] stats graphics  grDevices utils datasets  methods   base

 loaded via a namespace (and not attached):
 [1] rcompgen_0.1-17 tools_2.6.2

 =


 install.packages(RCurl, repos = http://www.omegahat.org/R;)
 Warning in install.packages(RCurl, repos = http://www.omegahat.org/R;) :
   argument 'lib' is missing: using '/usr/local/lib/R/site-library'
 trying URL 'http://www.omegahat.org/R/src/contrib/RCurl_0.9-4.tar.gz'
 Content type 'application/x-gzip' length 150884 bytes (147 Kb)
 opened URL
 ==
 downloaded 147 Kb

 * Installing *source* package 'RCurl' ...
 checking for curl-config... /usr/bin/curl-config
 checking for gcc... gcc
 checking for C compiler default output file name... a.out
 checking whether the C compiler works... yes
 checking whether we are cross compiling... no
 checking for suffix of executables...
 checking for suffix of object files... o
 checking whether we are using the GNU C compiler... yes
 checking whether gcc accepts -g... yes
 checking for gcc option to accept ANSI C... none needed
 checking how to run the C preprocessor... gcc -E
 Version has a libidn field
 configure: creating ./config.status
 config.status: creating src/Makevars
 ** libs
 gcc -std=gnu99 -I/usr/share/R/include -I/usr/share/R/include
 -DHAVE_LIBIDN_FIELD=1 -fpic  -g -O2 -c base64.c -o base64.o
 In file included from base64.c:1:
 Rcurl.h:52: error: expected specifier-qualifier-list before 'cetype_t'
 make: *** [base64.o] Error 1
 chmod: cannot access `/usr/local/lib/R/site-library/RCurl/libs/*': No
 such file or directory
 ERROR: compilation failed for package 'RCurl'
 ** Removing '/usr/local/lib/R/site-library/RCurl'

 The downloaded packages are in
   /tmp/RtmpQ8FMBZ/downloaded_packages
 Warning message:
 In install.packages(RCurl, repos = http://www.omegahat.org/R;) :
   installation of package 'RCurl' had non-zero exit status

 

 Hi Martin,
 are you working on a 64-bit linux distribution and which version of
 RCurl are you trying to install? There has been a problem with a
 recent version of RCurl and the R_base64_decode, search the archives
 of the Bioconductor mailing list for a thread called
 RCurl loading problem with 64 bit linux distribution.
 Please try using the newest versions of R (R-2.7.0 has been released a
 few weeks ago) and RCurl, which you can obtain from within R by
 typing:
 install.packages(RCurl, repos = http://www.omegahat.org/R;)
 The new version of RCurl (= 0.9.2) worked fine for me, while 0.9.0
 and 0.9.1 did not.
 Hope this helps.
 Joern


 martin sikora wrote:

 dear list members,

 i'm having a problem installing the biomaRt package on my linux
 machine, due to the fact of a compilation error with RCurl. i am using
 R 2.6.2 on fedora 7, and this is the output i get:

 gcc -m32 -std=gnu99 -I/usr/include/R -I/usr/include/R
 -DHAVE_LIBIDN_FIELD=1 -I/usr/local/include-fpic  -O2 -g -pipe
 -Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
 --param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic
 -fasynchronous-unwind-tables -c base64.c -o base64.o
 In file included from base64.c:1:
 Rcurl.h:52: error: expected specifier-qualifier-list before ?cetype_t?
 base64.c: In function ?R_base64_decode?:
 base64.c:25: warning: pointer targets in assignment differ in signedness
 base64.c:39: warning: pointer targets in passing argument 1 of
 ?Rf_mkString? differ in signedness
 base64.c

[R] RCurl compilation error on ubuntu hardy

2008-09-16 Thread Emmanuel Levy
Dear list members,

I encountered this problem and the solution pointed out in a previous
thread did not work for me.
(e.g.  install.packages(RCurl, repos = http://www.omegahat.org/R;)

I work with Ubuntu Hardy, and installed R 2.6.2 via apt-get.

I really need RCurl in order to use biomaRt ... any help would be
greatly appreciated.

Best wishes,

Emmanuel

=

 sessionInfo()
R version 2.6.2 (2008-02-08)
x86_64-pc-linux-gnu

locale:
LC_CTYPE=en_CA.UTF-8;LC_NUMERIC=C;LC_TIME=en_CA.UTF-8;LC_COLLATE=en_CA.UTF-8;LC_MONETARY=en_CA.UTF-8;LC_MESSAGES=en_CA.UTF-8;LC_PAPER=en_CA.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_CA.UTF-8;LC_IDENTIFICATION=C

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

loaded via a namespace (and not attached):
[1] rcompgen_0.1-17 tools_2.6.2


=


 install.packages(RCurl, repos = http://www.omegahat.org/R;)
Warning in install.packages(RCurl, repos = http://www.omegahat.org/R;) :
  argument 'lib' is missing: using '/usr/local/lib/R/site-library'
trying URL 'http://www.omegahat.org/R/src/contrib/RCurl_0.9-4.tar.gz'
Content type 'application/x-gzip' length 150884 bytes (147 Kb)
opened URL
==
downloaded 147 Kb

* Installing *source* package 'RCurl' ...
checking for curl-config... /usr/bin/curl-config
checking for gcc... gcc
checking for C compiler default output file name... a.out
checking whether the C compiler works... yes
checking whether we are cross compiling... no
checking for suffix of executables...
checking for suffix of object files... o
checking whether we are using the GNU C compiler... yes
checking whether gcc accepts -g... yes
checking for gcc option to accept ANSI C... none needed
checking how to run the C preprocessor... gcc -E
Version has a libidn field
configure: creating ./config.status
config.status: creating src/Makevars
** libs
gcc -std=gnu99 -I/usr/share/R/include -I/usr/share/R/include
-DHAVE_LIBIDN_FIELD=1 -fpic  -g -O2 -c base64.c -o base64.o
In file included from base64.c:1:
Rcurl.h:52: error: expected specifier-qualifier-list before 'cetype_t'
make: *** [base64.o] Error 1
chmod: cannot access `/usr/local/lib/R/site-library/RCurl/libs/*': No
such file or directory
ERROR: compilation failed for package 'RCurl'
** Removing '/usr/local/lib/R/site-library/RCurl'

The downloaded packages are in
/tmp/RtmpQ8FMBZ/downloaded_packages
Warning message:
In install.packages(RCurl, repos = http://www.omegahat.org/R;) :
  installation of package 'RCurl' had non-zero exit status



Hi Martin,
are you working on a 64-bit linux distribution and which version of
RCurl are you trying to install? There has been a problem with a
recent version of RCurl and the R_base64_decode, search the archives
of the Bioconductor mailing list for a thread called
RCurl loading problem with 64 bit linux distribution.
Please try using the newest versions of R (R-2.7.0 has been released a
few weeks ago) and RCurl, which you can obtain from within R by
typing:
install.packages(RCurl, repos = http://www.omegahat.org/R;)
The new version of RCurl (= 0.9.2) worked fine for me, while 0.9.0
and 0.9.1 did not.
Hope this helps.
Joern


martin sikora wrote:

dear list members,

i'm having a problem installing the biomaRt package on my linux
machine, due to the fact of a compilation error with RCurl. i am using
R 2.6.2 on fedora 7, and this is the output i get:

gcc -m32 -std=gnu99 -I/usr/include/R -I/usr/include/R
-DHAVE_LIBIDN_FIELD=1 -I/usr/local/include-fpic  -O2 -g -pipe
-Wall -Wp,-D_FORTIFY_SOURCE=2 -fexceptions -fstack-protector
--param=ssp-buffer-size=4 -m32 -march=i386 -mtune=generic
-fasynchronous-unwind-tables -c base64.c -o base64.o
In file included from base64.c:1:
Rcurl.h:52: error: expected specifier-qualifier-list before ?cetype_t?
base64.c: In function ?R_base64_decode?:
base64.c:25: warning: pointer targets in assignment differ in signedness
base64.c:39: warning: pointer targets in passing argument 1 of
?Rf_mkString? differ in signedness
base64.c: In function ?R_base64_encode?:
base64.c:60: warning: pointer targets in assignment differ in signedness
make: *** [base64.o] Error 1

as far as i know i have all the necessary libraries installed:

$ yum list installed | grep libxml
libxml2.i386 2.6.31-1.fc7
installed  libxml2-devel.i386   2.6.31-1.fc7
installed  libxml2-python.i386
2.6.31-1.fc7   installed  perl-libxml-perl.noarch
0.08-1.2.1 installed
$ yum list installed | grep curl
curl.i3867.16.4-1.fc7
installed  curl-devel.i386  7.16.4-1.fc7
installed  python-pycurl.i386

Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Dear Peter and Henrik,

Thanks for your replies - this helps speed up a bit, but I thought
there would be something much faster.

What I mean is that I thought that a particular value of a level
could be accessed instantly, similarly to a hash key.

Since I've got about 6000 levels in that data frame, it means that
making a list L of the form
L[[1]] = values of name 1
L[[2]] = values of name 2
L[[3]] = values of name 3
...
would take ~1hour.

Best,

Emmanuel




2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:
 To simplify:

 n - 2.7e6;
 x - factor(c(rep(A, n/2), rep(B, n/2)));

 # Identify 'A':s
 t1 - system.time(res - which(x == A));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 - system.time(res - which(as.character(x) == A));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 - system.time(res - which(x == match(A, levels(x;

 # ...but gives no speed up
 print(t3/t1);
user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 - system.time(res - which(as.integer(x) == match(A, levels(x
 print(t4/t1);
 usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:
 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED] wrote:
 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would like 
 to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df
   names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run for 
 me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(
 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )
 system.time(out - which(foo$two==A))
   user  system elapsed
  0.566   0.146   0.761
 system.time(out - foo$two==A)
   user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.
 system.time(out - unstack(foo))
   user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed this 
 up.

 Best wishes,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Sorry for being unclear, I thought the example above was clear enough.

I have a data frame of the form:

  name   info
1  YAL001C 1
2  YAL001C 1
3  YAL001C 1
4  YAL001C 1
5  YAL001C 0
6  YAL001C 1
7  YAL001C 1
8  YAL001C 1
9  YAL001C 1
10 YAL001C 1
...
...
~270 lines, and ~6000 different names.

which corresponds to yeast proteins + some info.
So there are about 6000 names like YAL001C

I would like to transform this data frame into the following form:

1/ a list, where each protein corresponds to an index, and the info is
the vector
 L[[1]]
[1] 1 1 1 1 0 1 1 1 1 1 0 0 0 0 0 0 0 1 1 1 
 L[[2]]
[1] 0 0 0 0 0 0 1 0 1 1 0 0 0 0 0 0 0 0 0 0 
etc.

2/ an index, which gives me the position of each protein in the list:
 index
[1] YAL001C YAL002W YAL003W YAL005C YAL007C ...

I hope this will be clearer!

I'll have a look right now that the split and hash.mat functions.

Thanks for your help,

Emmanuel




2008/8/13 Erik Iverson [EMAIL PROTECTED]:
 I still don't understand what you are doing.  Can you make a small example
 that shows what you have and what you want?

 Is ?split what you are after?

 Emmanuel Levy wrote:

 Dear Peter and Henrik,

 Thanks for your replies - this helps speed up a bit, but I thought
 there would be something much faster.

 What I mean is that I thought that a particular value of a level
 could be accessed instantly, similarly to a hash key.

 Since I've got about 6000 levels in that data frame, it means that
 making a list L of the form
 L[[1]] = values of name 1
 L[[2]] = values of name 2
 L[[3]] = values of name 3
 ...
 would take ~1hour.

 Best,

 Emmanuel




 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:

 To simplify:

 n - 2.7e6;
 x - factor(c(rep(A, n/2), rep(B, n/2)));

 # Identify 'A':s
 t1 - system.time(res - which(x == A));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 - system.time(res - which(as.character(x) == A));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
   user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 - system.time(res - which(x == match(A, levels(x;

 # ...but gives no speed up
 print(t3/t1);
   user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 - system.time(res - which(as.integer(x) == match(A, levels(x
 print(t4/t1);
usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:

 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED]
 wrote:

 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would
 like to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df

  names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run
 for me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(

 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )

 system.time(out - which(foo$two==A))

  user  system elapsed
  0.566   0.146   0.761

 system.time(out - foo$two==A)

  user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.

 system.time(out - unstack(foo))

  user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I
 am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed
 this up.

 Best wishes,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code

Re: [R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-13 Thread Emmanuel Levy
Wow great! Split was exactly what was needed. It takes about 1 second
for the whole operation :D

Thanks again - I can't believe I never used this function in the past.

All the best,

Emmanuel


2008/8/13 Erik Iverson [EMAIL PROTECTED]:
 I still don't understand what you are doing.  Can you make a small example
 that shows what you have and what you want?

 Is ?split what you are after?

 Emmanuel Levy wrote:

 Dear Peter and Henrik,

 Thanks for your replies - this helps speed up a bit, but I thought
 there would be something much faster.

 What I mean is that I thought that a particular value of a level
 could be accessed instantly, similarly to a hash key.

 Since I've got about 6000 levels in that data frame, it means that
 making a list L of the form
 L[[1]] = values of name 1
 L[[2]] = values of name 2
 L[[3]] = values of name 3
 ...
 would take ~1hour.

 Best,

 Emmanuel




 2008/8/12 Henrik Bengtsson [EMAIL PROTECTED]:

 To simplify:

 n - 2.7e6;
 x - factor(c(rep(A, n/2), rep(B, n/2)));

 # Identify 'A':s
 t1 - system.time(res - which(x == A));

 # To compare a factor to a string, the factor is in practice
 # coerced to a character vector.
 t2 - system.time(res - which(as.character(x) == A));

 # Interestingly enough, this seems to be faster (repeated many times)
 # Don't know why.
 print(t2/t1);
   user   system  elapsed
 0.632653 1.60 0.754717

 # Avoid coercing the factor, but instead coerce the level compared to
 t3 - system.time(res - which(x == match(A, levels(x;

 # ...but gives no speed up
 print(t3/t1);
   user   system  elapsed
 1.041667 1.00 1.018182

 # But coercing the factor to integers does
 t4 - system.time(res - which(as.integer(x) == match(A, levels(x
 print(t4/t1);
usersystem   elapsed
 0.417 0.000 0.3636364

 So, the latter seems to be the fastest way to identify those elements.

 My $.02

 /Henrik


 On Tue, Aug 12, 2008 at 7:31 PM, Peter Cowan [EMAIL PROTECTED] wrote:

 Emmanuel,

 On Tue, Aug 12, 2008 at 4:35 PM, Emmanuel Levy [EMAIL PROTECTED]
 wrote:

 Dear All,

 I have a large data frame ( 270 lines and 14 columns), and I would
 like to
 extract the information in a particular way illustrated below:


 Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df

  names col1
 1  A1
 2  A0
 3  A1
 4  A0
 5  A1
 6  B0
 7  B0
 8  B1
 9  B0
 10 B0

 I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

 I'm not sure I fully understand your problem, you example would not run
 for me.

 You could get a small speedup by omitting which(), you can subset by a
 logical vector also which give a small speedup.

 n - 270
 foo - data.frame(

 +   one = sample(c(0,1), n, rep = T),
 +   two = factor(c(rep(A, n/2 ),rep(B, n/2 )))
 +   )

 system.time(out - which(foo$two==A))

  user  system elapsed
  0.566   0.146   0.761

 system.time(out - foo$two==A)

  user  system elapsed
  0.429   0.075   0.588

 You might also find use for unstack(), though I didn't see a speedup.

 system.time(out - unstack(foo))

  user  system elapsed
  1.068   0.697   2.004

 HTH

 Peter

 My problem is that the command:  *** which(df$name==A) ***
 takes about 1 second because df is so big.

 I was thinking that a level could maybe be accessed instantly but I
 am not
 sure about how to do it.

 I would be very grateful for any advice that would allow me to speed
 this up.

 Best wishes,

 Emmanuel

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] which(df$name==A) takes ~1 second! (df is very large), but can it be speeded up?

2008-08-12 Thread Emmanuel Levy
Dear All,

I have a large data frame ( 270 lines and 14 columns), and I would like to
extract the information in a particular way illustrated below:


Given a data frame df:

 col1=sample(c(0,1),10, rep=T)
 names = factor(c(rep(A,5),rep(B,5)))
 df = data.frame(names,col1)
 df
   names col1
1  A1
2  A0
3  A1
4  A0
5  A1
6  B0
7  B0
8  B1
9  B0
10 B0

I would like to tranform it in the form:

 index = c(A,B)
 col1[[1]]=df$col1[which(df$name==A)]
 col1[[2]]=df$col1[which(df$name==B)]

My problem is that the command:  *** which(df$name==A) ***
takes about 1 second because df is so big.

I was thinking that a level could maybe be accessed instantly but I am not
sure about how to do it.

I would be very grateful for any advice that would allow me to speed this up.

Best wishes,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Smoothing z-values according to their x, y positions

2008-03-19 Thread Emmanuel Levy
Dear All,

I'm sure this is not the first time this question comes up but I
couldn't find the keywords that would point me out to it - so
apologies if this is a re-post.

Basically I've got thousands of points, each depending on three variables:
x, y, and z.

if I do a plot(x,y, col=z), I get something very messy.

So I would like to smooth the values of z according to the values of
their neighbors (given by x and y).

Could you please point me out to the function(s) I should look into
for the smothing, and
possibly for the plotting?

Many thanks for your help,

Emmanuel

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Smoothing z-values according to their x, y positions

2008-03-19 Thread Emmanuel Levy
Dear Bert,

Thanks for your reply - I indeed saw a lot of functions using:
help.search(smooth)

The problem is that most seem to not be very appropriate to what I'd like, or
they seem extremely complicated (e.g. gma). I am probably missing
something as I don't
see how to use Loess. From my poor understanding, it seems to be for 2D data.
Here I want to smooth the third z component.

In the meantime I adapted a function from Greg Warnes, which sort of
does the job (although the smoothing is not very nice).
So I'd be very happy to learn how do do something similar with loess!

x = runif(1000)
y = runif(1000)
z = x+y
hist2d.el(x,y,z, nbins=10)

## -- This function was adapted from Greg Warnes
hist2d.el - function( x,y=NULL, z=NULL, nbins=200, my.cuts = NULL,
same.scale=FALSE, na.rm=TRUE, show=TRUE, col=c(black,
heat.colors(12)), ... )
  {
if(is.null(y))
  {
if(ncol(x) != 2) stop(If y is ommitted, x must be a 2 column matirx)
y - x[,2]
x - x[,1]
  }

if(length(nbins)==1)
  nbins - rep(nbins,2)

nas - is.na(x) | is.na(y)

if(na.rm)
  {
x - x[!nas]
y - y[!nas]
  }
else
  stop(missinig values not permitted if na.rm=FALSE)

if(!is.null(my.cuts)){

  nbins = rep(length(my.cuts)-1,2)
  x.cuts=my.cuts
  y.cuts=my.cuts

} else {

  if(same.scale)
{
  x.cuts - seq( from=min(x,y), to=max(x,y), length=nbins[1]+1)
  y.cuts - seq( from=min(x,y), to=max(x,y), length=nbins[2]+1)
}
  else
{
  x.cuts - seq( from=min(x), to=max(x), length=nbins[1]+1)
  y.cuts - seq( from=min(y), to=max(y), length=nbins[2]+1)
}
}

print(nbins)

index.x - cut( x, x.cuts, include.lowest=TRUE)
index.y - cut( y, y.cuts, include.lowest=TRUE)

m - matrix( 0, nrow=nbins[1], ncol=nbins[2],
dimnames=list( levels(index.x),
   levels(index.y) ) )

m.sum - matrix( 0, nrow=nbins[1], ncol=nbins[2],
dimnames=list( levels(index.x),
   levels(index.y) ) )

for( i in 1:length(index.x) )
  m[ index.x[i], index.y[i] ] -  m[ index.x[i], index.y[i] ] + 1

for( i in 1:length(index.x) )
  m.sum[ index.x[i], index.y[i] ] -  m.sum[ index.x[i], index.y[i] ] + z[i]

m[which(m==0)]=1
m = m.sum/(m)

xvals - x.cuts[1:nbins[1]]
yvals - y.cuts[1:nbins[2]]

if(show)
  image( xvals,yvals, m, col=col,...)

#
invisible(list(counts=m,x=xvals,y=yvals))
invisible(m)
  }




On 19/03/2008, Bert Gunter [EMAIL PROTECTED] wrote:
 There are dozens of functions within R and contributed packages that do
  this. The spatial statistics packages (see the spatial task view on CRAN)
  is certainly where most are concentrated, but thin plate splines (in splines
  packages, I believe), tensor product splines (in mgcv package), local
  likelihood (in locfit package) etc. are also available for spatial
  smoothing.  Perhaps the most basic place to look is the ?loess function in
  the base distribution, which will do exactly what you requested.


  Bert Gunter
  Genentech Nonclinical Statistics



  -Original Message-
  From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On
  Behalf Of Emmanuel Levy
  Sent: Wednesday, March 19, 2008 12:42 PM
  To: r-help@r-project.org
  Subject: [R] Smoothing z-values according to their x, y positions

  Dear All,

  I'm sure this is not the first time this question comes up but I
  couldn't find the keywords that would point me out to it - so
  apologies if this is a re-post.

  Basically I've got thousands of points, each depending on three variables:
  x, y, and z.

  if I do a plot(x,y, col=z), I get something very messy.

  So I would like to smooth the values of z according to the values of
  their neighbors (given by x and y).

  Could you please point me out to the function(s) I should look into
  for the smothing, and
  possibly for the plotting?

  Many thanks for your help,

  Emmanuel


 __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Smoothing z-values according to their x, y positions

2008-03-19 Thread Emmanuel Levy
Dear David,

Thanks a lot for pointing out kde2d, just tried it out but the problem is that
it indeed takes the density of points into account, which I dont want.

For example, if in an region of surface S I've got 10,000 points, and that their
average height is 0.5, and in an other region I've got only ten points and their
average value if also 0.5, I'd like all these points to be transformed
to the same
0.5 value. At the moment, it seems that it's not the case.

For example, the range of the values I give is:  0.2 - 3.7, but the
range of the values
that are outputed is 0 - 0.17.

I haven't looked yet at the locfit package as it is not installed, but
I will check it out!

Thanks for helping!

Emmanuel


On 20/03/2008, David Winsemius [EMAIL PROTECTED] wrote:
 Emmanuel Levy [EMAIL PROTECTED] wrote in
  news:[EMAIL PROTECTED]:


   Dear Bert,
  
   Thanks for your reply - I indeed saw a lot of functions using:
   help.search(smooth)
  
   The problem is that most seem to not be very appropriate to what I'd
   like, or they seem extremely complicated (e.g. gma). I am probably
   missing something as I don't
   see how to use Loess. From my poor understanding, it seems to be for
   2D data. Here I want to smooth the third z component.
  
   In the meantime I adapted a function from Greg Warnes, which sort of
   does the job (although the smoothing is not very nice).
   So I'd be very happy to learn how do do something similar with
   loess!
  


 You could consider at combining the kde2d() function in the MASS package
  within the VR bundle in combination with contour(). The example I
  followed came from MASS 2ed in chapter 5's section on density estimation.
  It is probably easiest to just copy the help example:

  library(MASS)
  attach(geyser)
  f1 - kde2d(duration[-272], duration[-1],
  h = rep(1.5, 2), n = 50, lims = c(0.5, 6, 0.5, 6))
  contour(f1, xlab = previous duration,
 ylab = duration, levels  =  c(0.05, 0.1, 0.2, 0.4) )

  I have also had success with Loader's locfit package.

  --

 David Winsemius


  __
  R-help@r-project.org mailing list
  https://stat.ethz.ch/mailman/listinfo/r-help
  PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
  and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.