Re: [R] Trouble getting rms::survplot(..., n.risk=TRUE) to behave properly

2016-06-02 Thread Steve Lianoglou
Ah!

Sorry ... should have dug deeper into the examples section to notice that.

Thank you for the quick reply,
-steve


On Thu, Jun 2, 2016 at 8:59 AM, Frank Harrell  wrote:
> This happens when you have not strat variables in the model.
>
>
> --
> Frank E Harrell Jr  Professor and Chairman  School of Medicine
>
> Department of *Biostatistics*  *Vanderbilt University*
>
> On Thu, Jun 2, 2016 at 10:55 AM, Steve Lianoglou 
> wrote:
>
>> Hello foks,
>>
>> I'm trying to plot the number of patients at-risk by setting the
>> `n.risk` parameter to `TRUE` in the rms::survplot function, however it
>> looks as if the numbers presented in the rows for each category are
>> just summing up the total number of patients at risk in all groups for
>> each timepoint -- which is to say that the numbers are equal in each
>> category down the rows, and they don't seem to be the numbers specific
>> to each group.
>>
>> You can reproduce the observed behavior by simply running the code in
>> the Examples section of ?survplot, which I'll paste below for
>> convenience.
>>
>> Is the error between the chair and the keyboard, here, or is this perhaps
>> a bug?
>>
>> === code ===
>> library(rms)
>> n <- 1000
>> set.seed(731)
>> age <- 50 + 12*rnorm(n)
>> label(age) <- "Age"
>> sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4)))
>> cens <- 15*runif(n)
>> h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
>> dt <- -log(runif(n))/h
>> label(dt) <- 'Follow-up Time'
>> e <- ifelse(dt <= cens,1,0)
>> dt <- pmin(dt, cens)
>> units(dt) <- "Year"
>> dd <- datadist(age, sex)
>> options(datadist='dd')
>> S <- Surv(dt,e)
>>
>> f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE)
>> survplot(f, sex, n.risk=TRUE)
>> ===
>>
>> I'm using the latest version of rms (4.5-0) running on R 3.3.0-patched.
>>
>> === Output o sessionInfo() ===
>> R version 3.3.0 Patched (2016-05-26 r70671)
>> Platform: x86_64-apple-darwin13.4.0 (64-bit)
>> Running under: OS X 10.11.4 (El Capitan)
>>
>> locale:
>> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
>>
>> attached base packages:
>> [1] stats graphics  grDevices utils datasets  methods   base
>>
>> other attached packages:
>> [1] rms_4.5-0   SparseM_1.7 Hmisc_3.17-4ggplot2_2.1.0
>> [5] Formula_1.2-1   survival_2.39-4 lattice_0.20-33
>>
>> loaded via a namespace (and not attached):
>>  [1] Rcpp_0.12.5 cluster_2.0.4   MASS_7.3-45
>>  [4] splines_3.3.0   munsell_0.4.3   colorspace_1.2-6
>>  [7] multcomp_1.4-5  plyr_1.8.3  nnet_7.3-12
>> [10] grid_3.3.0  data.table_1.9.6gtable_0.2.0
>> [13] nlme_3.1-128quantreg_5.24   TH.data_1.0-7
>> [16] latticeExtra_0.6-28 MatrixModels_0.4-1  polspline_1.1.12
>> [19] Matrix_1.2-6gridExtra_2.2.1 RColorBrewer_1.1-2
>> [22] codetools_0.2-14acepack_1.3-3.3 rpart_4.1-10
>> [25] sandwich_2.3-4  scales_0.4.0mvtnorm_1.0-5
>> [28] foreign_0.8-66  chron_2.3-47zoo_1.7-13
>> ===
>>
>>
>> Thanks,
>> -steve
>>
>>
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Genentech
>>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Trouble getting rms::survplot(..., n.risk=TRUE) to behave properly

2016-06-02 Thread Steve Lianoglou
Hello foks,

I'm trying to plot the number of patients at-risk by setting the
`n.risk` parameter to `TRUE` in the rms::survplot function, however it
looks as if the numbers presented in the rows for each category are
just summing up the total number of patients at risk in all groups for
each timepoint -- which is to say that the numbers are equal in each
category down the rows, and they don't seem to be the numbers specific
to each group.

You can reproduce the observed behavior by simply running the code in
the Examples section of ?survplot, which I'll paste below for
convenience.

Is the error between the chair and the keyboard, here, or is this perhaps a bug?

=== code ===
library(rms)
n <- 1000
set.seed(731)
age <- 50 + 12*rnorm(n)
label(age) <- "Age"
sex <- factor(sample(c('Male','Female'), n, rep=TRUE, prob=c(.6, .4)))
cens <- 15*runif(n)
h <- .02*exp(.04*(age-50)+.8*(sex=='Female'))
dt <- -log(runif(n))/h
label(dt) <- 'Follow-up Time'
e <- ifelse(dt <= cens,1,0)
dt <- pmin(dt, cens)
units(dt) <- "Year"
dd <- datadist(age, sex)
options(datadist='dd')
S <- Surv(dt,e)

f <- cph(S ~ rcs(age,4) + sex, x=TRUE, y=TRUE)
survplot(f, sex, n.risk=TRUE)
===

I'm using the latest version of rms (4.5-0) running on R 3.3.0-patched.

=== Output o sessionInfo() ===
R version 3.3.0 Patched (2016-05-26 r70671)
Platform: x86_64-apple-darwin13.4.0 (64-bit)
Running under: OS X 10.11.4 (El Capitan)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats graphics  grDevices utils datasets  methods   base

other attached packages:
[1] rms_4.5-0   SparseM_1.7 Hmisc_3.17-4ggplot2_2.1.0
[5] Formula_1.2-1   survival_2.39-4 lattice_0.20-33

loaded via a namespace (and not attached):
 [1] Rcpp_0.12.5 cluster_2.0.4   MASS_7.3-45
 [4] splines_3.3.0   munsell_0.4.3   colorspace_1.2-6
 [7] multcomp_1.4-5  plyr_1.8.3  nnet_7.3-12
[10] grid_3.3.0  data.table_1.9.6gtable_0.2.0
[13] nlme_3.1-128quantreg_5.24   TH.data_1.0-7
[16] latticeExtra_0.6-28 MatrixModels_0.4-1  polspline_1.1.12
[19] Matrix_1.2-6gridExtra_2.2.1 RColorBrewer_1.1-2
[22] codetools_0.2-14acepack_1.3-3.3 rpart_4.1-10
[25] sandwich_2.3-4  scales_0.4.0mvtnorm_1.0-5
[28] foreign_0.8-66  chron_2.3-47zoo_1.7-13
===


Thanks,
-steve


-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Scraping HTML using R

2015-02-05 Thread Steve Lianoglou
You want to take a look at rvest:

https://github.com/hadley/rvest

On Thu, Feb 5, 2015 at 2:36 PM, Madhuri Maddipatla
 wrote:
> Dear R experts,
>
> My requirement for web scraping in R goes like this.
>
> *Step 1* - All the medical condition from from A-Z are listed in the link
> below.
>
> http://www.webmd.com/drugs/index-drugs.aspx?show=conditions
>
> Choose the first condition say Acid Reflux(GERD-...)
>
> *Step 2 *- It lands on the this page
>
> http://www.webmd.com/drugs/condition-1999-Acid%20Reflux%20%20GERD-Gastroesophageal%20Reflux%20Disease%20.aspx?diseaseid=1999&diseasename=Acid+Reflux+(GERD-Gastroesophageal+Reflux+Disease)&source=3
>
> with a list of drugs.
>
> Choose the column user reviews of the first drug say "Nexium Oral"
>
> *Step 3*: Now it lands on the webpage
>
> http://www.webmd.com/drugs/drugreview-20536-Nexium+oral.aspx?drugid=20536&drugname=Nexium+oral
>
> with a list of reviews.
> I would like to scrape review information into a tabular format by scraping
> the html.
> For instance, i would like to fetch the full comment of each review as a
> column in a table.
> Also it should automatically go to next page and fetch the full comments of
> all reviewers.
>
>
> Please help me in this endeavor and thanks a lot in advance for reading my
> mail and expecting response with your experience and expertise.
>
> Also please suggest me the possibility around my stepwise plan and any
> advice you would like to give me along with the solution.
>
> High Regards,
> *-*
> *Madhuri Maddipatla*
> *-*
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Fastest way to calculate quantile in large data.table

2015-02-05 Thread Steve Lianoglou
Not sure if there is a question in here somewhere?

But if I can point out an observation: if you are doing summary
calculations across the rows like this, my guess is that using a
data.table (data.frame) structure for that will really bite you,
because this operation on a data.table/data.frame is expensive;

  x <- dt[i,]

However it's much faster with a matrix. It doesn't seem like you're
doing anything with this dataset that takes advantage of data.table's
quick grouping/indexing mojo, so why store it in  data.table at all?

Witness:

R> library(data.table)
R> m <- matrix(rnorm(1e6), nrow=10)
R> d <- as.data.table(m)
R> idxs <- sample(1:nrow(m), 500, replace=TRUE)

R> system.time(for (i in idxs) x <- m[i,])
   user  system elapsed
  0.497   0.169   0.670

R> system.time(for (i in idxs) x <- d[i,])
## I killed it after waiting for 14 seconds

-steve

On Thu, Feb 5, 2015 at 11:48 AM, Camilo Mora  wrote:
> In total I found 8 different way to calculate quantile in very a large 
> data.table. I share below their performances for future reference. Tests 1, 7 
> and 8 were the fastest I found.
>
> Best,
>
> Camilo
>
> library(data.table)
> v <- data.table(x=runif(1),x2 = runif(1),  
> x3=runif(1),x4=runif(1))
>
> #fastest
> Sys.time()->StartTEST1
> t(v[, apply(v,1,quantile,probs =c(.1,.9,.5),na.rm=TRUE)] )
> Sys.time()->EndTEST1
>
> Sys.time()->StartTEST2
> v[, quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
> Sys.time()->EndTEST2
>
> Sys.time()->StartTEST3
> v[, c("L","H","M"):=quantile(.SD,probs =c(.1,.9,.5)), by = 1:nrow(v)]
> Sys.time()->EndTEST3
> v
> v[, c("L","H","M"):=NULL]
>
> v[,Names:=rownames(v)]
> setkey(v,Names)
>
> Sys.time()->StartTEST4
> v[, c("L","H","M"):=quantile(.SD,probs =c(.1,.9,.5)), by = Names]
> Sys.time()->EndTEST4
> v
> v[, c("L","H","M"):=NULL]
>
>
> Sys.time()->StartTEST5
> v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names]
> Sys.time()->EndTEST5
>
>
> Sys.time()->StartTEST6
> v[,  as.list(quantile(.SD,c(.1,.90,.5),na.rm=TRUE)), by=Names,.SDcols=1:4]
> Sys.time()->EndTEST6
>
>
> Sys.time()->StartTEST7
> v[,  as.list(quantile(c(x ,   x2,x3,x4 
> ),c(.1,.90,.5),na.rm=TRUE)), by=Names]
> Sys.time()->EndTEST7
>
>
> # melting the database and doing quantily by summary. This is the second 
> fastest, which is ironic given that the database has to be melted first
> library(reshape2)
> Sys.time()->StartTEST8
> vs<-melt(v)
> vs[,  as.list(quantile(value,c(.1,.90,.5),na.rm=TRUE)), by=Names]
> Sys.time()->EndTEST8
>
>
> EndTEST1-StartTEST1
> EndTEST2-StartTEST2
> EndTEST3-StartTEST3
> EndTEST4-StartTEST4
> EndTEST5-StartTEST5
> EndTEST6-StartTEST6
> EndTEST7-StartTEST7
> EndTEST8-StartTEST8
>
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] loops in R

2014-11-05 Thread Steve Lianoglou
While you should definitely read the tutorial that Don is referring
to, I'd recommend you take a different approach and use more R
idiomatic code here.

In base R, this could be addressed with few approaches. Look for help
on the following functions:

  * tapply
  * by
  * aggregate

I'd rather recommend you also learn about some of the packages that
are better suited to deal with computing over data.frames,
particularly:

  * dplyr
  * data.table

You can certainly achieve what you want with for loops, but you'll
likely find that going this route will be more rewarding in the long
run.

HTH,
-steve


On Wed, Nov 5, 2014 at 10:02 AM, Don McKenzie  wrote:
> Have you read the tutorial that comes with the R distribution?  This is a 
> very basic database calculation that you will
> encounter (or some slight variation of it) over and over.  The solution is a 
> few lines of code, and someone may write it
> out for you, but if no one does
>
> You have 20 populations, so you will have 20 iterations in your for loop. For 
> each one, you will need a unique identifier that points to
> the rows of "R" associated with that population. You'll calculate a mean and 
> variance 20 times, and will need a data object to store
> those calculations.
>
> Look in the tutorial for syntax for identifying subsets of your data frame.
>
>> On Nov 5, 2014, at 5:41 AM, Noha Osman  wrote:
>>
>> Hi Folks
>>
>> Iam  a new user of R and I have a question . Hopefully anyone help me in 
>> that issue
>>
>>
>> I have that dataset as following
>>
>> Sample  Population  Species  Tissue R GB
>> 1 Bari1_062-1  Bari1 ret   seed  94.52303  80.70346 67.91760
>> 2 Bari1_062-2  Bari1 ret   seed  98.27683  82.68690 68.55485
>> 3 Bari1_062-3  Bari1 ret   seed 100.53170  86.56411 73.27528
>> 4 Bari1_062-4  Bari1 ret   seed  96.65940  84.09197 72.05974
>> 5 Bari1_062-5  Bari1 ret   seed 117.62474  98.49354 84.65656
>> 6 Bari1_063-1  Bari1 ret   seed 144.39547 113.76170 99.95633
>>
>> and I have 20 populations as following
>>
>> [1] Bari1  Bari2  Bari3  Besev  Cermik Cudi   Derici 
>> Destek Egil
>> [10] GunasanKalkan Karabace   Kayatepe   Kesentas   OrtancaOyali 
>>  Cultivated Sarikaya
>> [19] Savur  Sirnak
>>
>> I need to calculate mean and variance of each population using column [R] 
>> using  for-loop
>>
>>
>> Thanks
>>
>>   [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> Don McKenzie
> Research Ecologist
> Pacific Wildland Fire Sciences Lab
> US Forest Service
>
> Affiliate Faculty
> School of Environmental and Forest Sciences
> University of Washington
> d...@uw.edu
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] NA's introduced by coercion

2014-08-26 Thread Steve Lianoglou

Hi Madhvi,

First, please use "reply-all" when responding to emails form this list 
so that others can help (and benefit from) the discussion.


Comment down below:

On 26 Aug 2014, at 22:15, madhvi.gupta wrote:


On 08/27/2014 10:42 AM, Steve Lianoglou wrote:

Hi,

On Tue, Aug 26, 2014 at 9:56 PM, madhvi.gupta 
 wrote:

Hi,

I am applyin function as.numeric to a vector having many values as 
NA and it

is giving :
Warning message:
NAs introduced by coercion

Can anyone help me to know how to remove this warning and sor it 
out?

Let's say that the vector you are calling `as.numeric` over is called
`x`. If you could show us the output of the following command:

R> head(x[is.na(as.numeric(x))])

You'll see why you are getting the warning.

How you choose to sort it out probably depends on what you are trying
to do with your data after you convert it to a "numeric"

-steve

Hi,
I am having this error bacouse vector contains value NA but i want to 
convert that vector to numeric


I don't quite follow what the problem is, then ... what is the end 
result that you want to happen?


When you convert the vector to a numeric, the NA's that were in it 
originally, will remain as NAs (but they will be of a 'numeric' type).


What would you like to do with the NA values? Do you just want to keep 
them, but want to silence the warning?


If so, you can do:

R> suppressWarnings(y <- as.numeric(x))

-steve

--
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] NA's introduced by coercion

2014-08-26 Thread Steve Lianoglou
Hi,

On Tue, Aug 26, 2014 at 9:56 PM, madhvi.gupta  wrote:
> Hi,
>
> I am applyin function as.numeric to a vector having many values as NA and it
> is giving :
> Warning message:
> NAs introduced by coercion
>
> Can anyone help me to know how to remove this warning and sor it out?

Let's say that the vector you are calling `as.numeric` over is called
`x`. If you could show us the output of the following command:

R> head(x[is.na(as.numeric(x))])

You'll see why you are getting the warning.

How you choose to sort it out probably depends on what you are trying
to do with your data after you convert it to a "numeric"

-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R, RStudio, and a server for my iPad.

2014-04-11 Thread Steve Lianoglou
I see. You might indeed be over thinking it ... since everything is
installable by the package manager for the flavor of linux you choose,
it should actually be relatively straightfoward (although if this is
your first time setting up a linux box, it can take a few tries).

I'll assume you are using an ubuntu server. There are two things you
can do to get one

(1) Get yourself a "spare" CPU. Blow out the OS on it, and setup
"vanilla" ubuntu; or
(2) Rent an ubuntu server from somewhere like linode.com

Then:

(a) Install R on it, by following the instructions here:
http://cran.rstudio.com/bin/linux/ubuntu/README.html

Which boils down to basically:
(i) adding the appropriate line to your `/etc/apt/sources.list` given
the version of ubuntu you installed (saucy, quantal, etc) ; and
(ii) Running the following from the terminal:

   sudo apt-get update
   sudo apt-get install r-base
   sudo apt-get install r-base-dev

(b) Install RStudio Server. Start from the "Download and Install"
section here (since you already installed R):
https://www.rstudio.com/ide/download/server

Then go here:
https://www.rstudio.com/ide/docs/server/getting_started

If you've set this up on your own machine (option (1) from above) then
you have to figure out how to connect it to the internet if you want
to use it from anywhere outside of your LAN.

It might seem like a lot of work, but as mentioned above -- since
these packages are available by the system's package manager, all of
the dependencies (like apache (if you need it)) will be installed for
you if you follow the (presumably correct) installation instructions
from RStudio.

=== An even easier setup ===

Just realized that there is  3rd route. You can save yourself a lot of
time by using an already built amazon machine instance that the
bioconductor folks have setup for you. Look at the instructions here:

http://www.bioconductor.org/help/bioconductor-cloud-ami/

If you follow along (you'll need an amazon account), you can rent a
machine and have it loaded w/ RStudio Server as well as some
bioconductor-specific libraries to get you started (by using the
bioconductor AMI). Once you have that up, you can connect to the
server running on Amazon's cloud via your web browser on both your CPU
(to make sure it all worked) then using an iOS device.

This would be a very easy way for you to see if taking one of the 2
avenues of setting up RStudio Server yourself would be worth the pain.

HTH,
-steve

On Fri, Apr 11, 2014 at 2:34 PM, John Sorkin
 wrote:
> Steve,
> Thank you for your help.
> I have seen the material you have sent me to, but do not fully
> understand it. Do I have to build a linux server first? If so does it
> have to run Apache or some other web server? Is RStudio server run under
> Apache, if so how? On the other hand, do I simply need a box running a
> flavor of Linux (without Apache)., and then simply download RStudio
> server? Perhaps I am over thinking this . . .
> Thanks,
> John
>
>
> John David Sorkin M.D., Ph.D.
> Professor of Medicine
> Chief, Biostatistics and Informatics
> University of Maryland School of Medicine Division of Gerontology and
> Geriatric Medicine
> Baltimore VA Medical Center
> 10 North Greene Street
> GRECC (BT/18/GR)
> Baltimore, MD 21201-1524
> (Phone) 410-605-7119
> (Fax) 410-605-7913 (Please call phone number above prior to faxing)
>>>> Steve Lianoglou  4/11/2014 5:23 PM >>>
> Hi,
>
> On Fri, Apr 11, 2014 at 2:00 PM, John Sorkin
>  wrote:
>> I bemoan the fact that I can not run R or Rstudio on my iPad.
>
> I feel like this is something I'm not missing much, at all, but
> different strokes ... :-)
>
>> A possible work around would be to set up a server (probably under
> Linux), and get the server to present a web page that to would allow me
> to run R on the server. I have searched the web for a clear, simple
> answer on how to do this but can not find one. There are answers, but
> not for someone who has not built a Linux server. Can someone provide
> either a reference to, or a short explanation of how I can build the
> server, get R or RStudio to run on it, and get the server and its R or
> RStudio program available to me on my iPad? I can probably find guidance
> on how to build an Apache server under Linux.
>
> You might try to follow the instructions to get RStudio Server
> running:
>
> https://www.rstudio.com/ide/docs/server/getting_started
>
> It seems like they have packages built for many popular linux distros:
>
> https://www.rstudio.com/ide/download/server
>
> Once your linux server is set up, you should be able to access the
> server with any client via a web browser, although I have no idea if
> any of the iOS browsers are up to the task of dealing with the RStudio
> web i

Re: [R] R, RStudio, and a server for my iPad.

2014-04-11 Thread Steve Lianoglou
Hi,

On Fri, Apr 11, 2014 at 2:00 PM, John Sorkin
 wrote:
> I bemoan the fact that I can not run R or Rstudio on my iPad.

I feel like this is something I'm not missing much, at all, but
different strokes ... :-)

> A possible work around would be to set up a server (probably under Linux), 
> and get the server to present a web page that to would allow me to run R on 
> the server. I have searched the web for a clear, simple answer on how to do 
> this but can not find one. There are answers, but not for someone who has not 
> built a Linux server. Can someone provide either a reference to, or a short 
> explanation of how I can build the server, get R or RStudio to run on it, and 
> get the server and its R or RStudio program available to me on my iPad? I can 
> probably find guidance on how to build an Apache server under Linux.

You might try to follow the instructions to get RStudio Server running:

https://www.rstudio.com/ide/docs/server/getting_started

It seems like they have packages built for many popular linux distros:

https://www.rstudio.com/ide/download/server

Once your linux server is set up, you should be able to access the
server with any client via a web browser, although I have no idea if
any of the iOS browsers are up to the task of dealing with the RStudio
web interface, but you can try and find out. RStudio's support forum
is more likely to be helpful here, for instance:

https://support.rstudio.com/hc/en-us/search?utf8=✓&query=ios&commit=Search

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to make a proper use of blocking in limma using voom

2014-04-07 Thread Steve Lianoglou
Hi,

This is a bioconductor-related question so please post any follow up
questions on that mailing list. You can sign up to that list here:

https://stat.ethz.ch/mailman/listinfo/bioconductor

Comments in line:


On Sun, Apr 6, 2014 at 10:41 PM, Catalina Aguilar Hurtado
 wrote:
> Hi all,
>
> I have a RNAseq data to analyse were I have a control and a one treatment
> for different individuals. I need to block the effects of the individual,
> but I am having several troubles to get the data that I need. I am using
> voom because my data is very heterogeneous and voom seams to do a good job
> normalising my reads.
>
> I am having the following issues:
>
>1.
>
>I want to get the differentially expressed genes (DEGs) of my treatment
>not of my control. I don't understand after the eBayes analysis why I get
>the coefficients for both. I have tried a > makeContrasts (TreatvsCont=
>c2-co, levels = design) to subtract the control effect but then I get 0
>DEGs.
>2.
>
>I am not sure when to include the 0 (null model) in the model formula, I
>have read examples for both types of models.
>
> This are my targets, with my column names of my counts, individual and
> condition
>
>>targets
>
> Individual condition
>
> A1 1 co
>
> A2 3 co
>
> A4 4 co
>
> A5 5 co
>
> E1 1 c2
>
> E2 2 c2
>
> E3 3 c2
>
> E4 4 c2
>
> E5 5 c2
>
> This is the code I have been trying:
>
>>co2=as.matrix(read.table("2014_04_02_1h_PB.csv",header=T, sep=",",
> row.names=1))
>
>>nf = calcNormFactors (co2)
>
>>targets= read.table ("targets.csv", header = T, sep=",",row.names=1)
>
>>treat <- factor (targets$condition, levels= c("co", "c2"))
>
>>design <- model.matrix(~0+treat)
>
>>colnames (design) <- levels (treat)
>
>>y <- voom(co2,design,lib.size=colSums(co2)*nf)
>
>>corfit <- duplicateCorrelation(y,design,block=targets$Individual)
>
>>fit <-
> lmFit(y,design,block=targets$Individual,correlation=corfit$consensus)
>
>>fit2<- eBayes (fit)
>
>>results_trt <- topTable (fit2, coef="c2", n=nrow (y), sort.by="none")
>
> >From which gives me 18,000 genes with adj.P.Val < 0.01 out of 22,000 genes
> that I have in total. Which makes no sense..

This is because you defined your model matrix to have no intercept
term (that's what the 0 in `~ 0 + treat` is doing).

You will have to test for a particular contrast (not just a coeficient
in your design) in order to get what you are after. Sections 9.7 (
Multi-level Experiments) and 16.3 (
Comparing Mammary Progenitor Cell Populations with Illumina BeadChips)
in the lima user's guide may be the most useful for you to follow
along at this point:

http://www.bioconductor.org/packages/release/bioc/vignettes/limma/inst/doc/usersguide.pdf

(look for the `makeContrasts` call)

After you call `fit <- lmFit( ... )` in your code above you might do
something like:

R> cm <- makeContrasts(co2Vsco=treatco2 - treatco0, levels=design)
R> fit2 <- eBayes(contrasts.fit(fit, cm))
R> res <- topTable(fit2, coef='co2Vsco')

Note that the `treatco2` and `treatco` are only correct if these are
the column names of your design matrix -- substitute with the
appropriate names for your example, if necessary.

> Thanks in advance for the help.

Noodle on that a bit and if you still have questions, please do post a
follow up question on the bioconductor list.

Btw, to help make your question more interpretable, since we don't
have your targets file, I think it would be easier for us if you
copy/paste the output of `dput(targets)` and `dput(design)` after you
create those object in a follow up email if it's necessary to write
one.

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Trying to install package for LMER, getting a ton of errors

2014-02-03 Thread Steve Lianoglou
ion of package 'Rcpp' had non-zero exit status
> 5: running command '"C:/PROGRA~1/R/R-30~1.2/bin/i386/R" CMD INSTALL -l
> "C:\Program Files\R\R-3.0.2\library"
> C:\Users\Master\AppData\Local\Temp\RtmpyChM99/downloaded_packages/minqa_1.2.2.tar.gz'
> had status 1
> 6: In install.packages(type = "source") :
>   installation of package 'minqa' had non-zero exit status
> 7: running command '"C:/PROGRA~1/R/R-30~1.2/bin/i386/R" CMD INSTALL -l
> "C:\Program Files\R\R-3.0.2\library"
> C:\Users\Master\AppData\Local\Temp\RtmpyChM99/downloaded_packages/RcppEigen_0.3.2.0.2.tar.gz'
> had status 1
> 8: In install.packages(type = "source") :
>   installation of package 'RcppEigen' had non-zero exit status
> 9: running command '"C:/PROGRA~1/R/R-30~1.2/bin/i386/R" CMD INSTALL -l
> "C:\Program Files\R\R-3.0.2\library"
> C:\Users\Master\AppData\Local\Temp\RtmpyChM99/downloaded_packages/lme4_1.0-6.tar.gz'
> had status 1
> 10: In install.packages(type = "source") :
>   installation of package 'lme4' had non-zero exit status
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] a better method than a long expression with many OR clauses

2013-12-17 Thread Steve Lianoglou
Hi Chris,

(extra compelled to answer a Q from my undergrad alma mater :-)

see below:

On Tue, Dec 17, 2013 at 11:13 AM, Christopher W Ryan
 wrote:
> dd <- data.frame(longVariableName1=sample(1:4, 10, replace=TRUE),
> longVariableName2=sample(1:4, 10, replace=TRUE))
> dd
> # define who is a case and who is not
> transform(dd, case=(longVariableName1==3 | longVariableName2==3))
>
> But in reality I have 9 of those longVariableName variables,
> all of this pattern: alphaCauseX, where X is an integer 1:9.
> For any given observation, if any of them == 3, then case=TRUE
> Is there a shorter or more elegant way of doing this than
> typing out that long string of 9 OR clauses?
>
> I read about any(), but couldn't quite make that do what I want. Maybe
> I was using it wrong.

There are many ways to approach this,here is but one. The general idea is to:

(1) Crate a logical matrix from the appropriate columns in `dd`
(2) Check to see which rows have any vals == 3

Let's say columns 3:11 have the variable you want to check. The code
below will return a vector as long as there are rows in `dd` which are
TRUE if any value in the row == 3:

R> is.case <- rowSums(as.matrix(dd[, 3:11]) == 3) > 0)

Unwind that one liner into it's individual parts to see who is doing
what there.

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Thoughts for faster indexing

2013-11-26 Thread Steve Lianoglou
Hi,

On Tue, Nov 26, 2013 at 11:41 AM, Noah Silverman  wrote:
> All interesting suggestions.
>
> I guess a better example of the code would have been a good idea.  So,
> I'll put a relevant snippet here.
>
> Rows are cases.  There are multiple cases for each ID, marked with a
> date.  I'm trying to calculate a time recency weighted score for a
> covariate, added as a new column in the data.frame.
>
> So, for each row, I need to see which ID it belongs to, then get all the
> scores prior to this row's date, then compute the recency weighted summary.
>
> Right now, I do this in an obvious, but very very slow way.
>
> Here is my slow code:
> ==
> for(i in 1:nrow(d)){
> for(j in which( d$id == d$id[i] & d$date[j] < d$date[i]) ){
> days_since = as.numeric( d$date[i] - d$date[j] )
> w <- exp( -days_since/decay )
> temp <- temp + w * as.numeric(d[j,'score'])
> wTemp <- wTemp + w
> }
>
> temp <- temp / wTemp
> d$newScore[i,] <- temp
> }
> ==
>
> One immediate thought was to turn the "date" into an integer.  That
> should save a few cycles of date math.
>
> I need to do this process for a bunch of scores.  A grid search over
> different time decay levels might be nice.  So any speedup to this
> routine will save me a ton of time.
>
> Ideas?

A few quick ones.

You had said you tried data.table and found it to be slow still -- my
guess is that you might not have used it correctly, so here is a rough
sketch of what to do.

Let's assume that your date is converted to some integer -- I will
leave that excercise to you :-) -- but it seems like you just want to
calculate number of (whole) days since an event that you have a record
for, so this should be (in principle) easy to do (if you really need
full power of "date math", data.table supports that as well).

Also you never "reset" your `temp` variable, so it looks like you are
carrying over `temp` from one `id` group to the next (and, while I
have no knowledge of your problem, I would imagine this is not what
you want to do)

Anyway some rough ideas to get you started:

R> d <- as.data.table(d)
R> setkeyv(d, c('id', 'date'))

Now records within each date are ordered from first to last.

The specifics of your decay score escape me a bit, eg. what is the
value of "days_since" for the first record of each id? I'll let you
figure that out, but in the non-edge cases, it looks like you can just
calculate "days since" by subtracting the current date from the date
recorded in the record before it. (Note that `.I` is special
data.table variable for the row number of a given record in the
original data.table):

d[, newScore := {
  ## handle edge case for first record w/in each `id` group
  days_since <- date - d$date[.I -1]
  w <- exp(-days_since / decay)
  ## ...
  ## Some other stuff you are doing here which I can't
  ## understand with temp ... then multiple the 'score' column
  ## for the given row by the your correctly calculated weight `w`
  ## for that row (whatever it might be).
  w * score
}, by='id']

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Should there be an R-beginners list?

2013-11-24 Thread Steve Lianoglou
Hi,

On Sun, Nov 24, 2013 at 1:36 PM, Duncan Murdoch
 wrote:
> On 13-11-24 4:13 PM, Yihui Xie wrote:
>>
>> I do not see how it can be illegal to download and duplicate the
>> posts, since all the content is licensed under CC BY-SA. I might have
>> missed something there: http://stackexchange.com/legal If that is
>> really the case, I think I will have to reconsider if I should use it
>> any more.
>
>
> I'm not a lawyer, but I see claims restricting users to "personal use".

I guess one would have to clarify what is and isn't possible with the
data. I'm guessing they are trying to scare people/entities away from
trawling SO and repackaging it into another inko/knowledgebase
offering.

That having been said, there is a SO clone that was developed by the
folks at biostars.org which is an OSS StackOverflow "clone"

https://github.com/ialbert/biostar-central

Someone would just need to host it, though.

Given SO's critical mass, though, I think it's hard to argue against
simply using that.

-steve

-- 
Steve Lianoglou
Computational Biologist
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Library update from version

2013-09-30 Thread Steve Lianoglou
Hi,

On Mon, Sep 30, 2013 at 12:44 PM, Ista Zahn  wrote:
> On Mon, Sep 30, 2013 at 2:49 PM, Steve Lianoglou
>  wrote:
>> Hi,
>>
>> On Mon, Sep 30, 2013 at 11:12 AM, Cem Girit  wrote:
>>> Hello,
>>>
>>>
>>>
>>> I recently installed version 3.0.1 of R on to a computer. I
>>> have a working installation for a Statconn application using R version
>>> 2.15.0 on another computer. I have many libraries under this old
>>> installation. Can I just copy them into the new library from the old, or do
>>> I install each one of them under the new R?
>>
>> No, you shouldn't do that.
>
> Really? Why not?

Because it has been my experience that people *just* do that (ie. the
*just* copy the libraries, as the OP asked).

But as you correctly point out:

> Note that the Windows upgrade FAQ
> (http://cran.r-project.org/bin/windows/base/rw-FAQ.html#What_0027s-the-best-way-to-upgrade_003f)
> says to do exactly that.

There is a way to copy the old package and then ensure that they are
updated to the versions that build correctly on the newest version of
R.


-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] FW: Library update from version

2013-09-30 Thread Steve Lianoglou
Hi,

On Mon, Sep 30, 2013 at 11:12 AM, Cem Girit  wrote:
> Hello,
>
>
>
> I recently installed version 3.0.1 of R on to a computer. I
> have a working installation for a Statconn application using R version
> 2.15.0 on another computer. I have many libraries under this old
> installation. Can I just copy them into the new library from the old, or do
> I install each one of them under the new R?

No, you shouldn't do that.

> Also how can I get a list of
> differences in two libraries so that I can use this list to update the new
> one?

A bit of googling could have provided several answers.

This post in particular has a few answers from some people who know
what they're doing w/ R, so probably a good place to start:

http://stackoverflow.com/questions/1401904/painless-way-to-install-a-new-version-of-r-on-windows

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] why is this a factor?

2013-08-29 Thread Steve Lianoglou
Hi,

On Thu, Aug 29, 2013 at 3:03 PM, Rolf Turner  wrote:
> On 29/08/13 12:10, Ista Zahn wrote:
>>
>> On Wed, Aug 28, 2013 at 7:44 PM, Steve Lianoglou
>>  wrote:
>>>
>>> Hi,
>>>
>>> On Wed, Aug 28, 2013 at 3:58 PM, Ista Zahn  wrote:
>>>>
>>>> Or go all the way and put
>>>>
>>>> options(stringsAsFactors = FALSE)
>>>>
>>>> at the top your script or in your .Rprofile. This will prevent this
>>>> kind of annoyance in the future without having to say stringsAsFactors
>>>> = FALSE all the time.
>>>
>>> I go back and forth about doing this too (setting a global hammer to
>>> stringsAsFactors), but then other things might mess up -- imagine a
>>> scenario where a package is written with the assumption that the
>>> default `stringsAsFactors=TRUE` setting hasn't been changed, which
>>> could then break when you go the nuclear-global-override route.
>>
>> Yes, possibly, but I've yet to have that problem, whereas before I
>> started changing it globally things used to break fairly regularly.
>
>
> Like Ista I have never had a problem arising from a package's assuming that
> `stringsAsFactors=TRUE` --- and I would opine that any package making such
> an assumption is badly written.  (Of course there is a lot of bad code out
> there )

It never happened to me either, except when code that *I* wrote was
dependent on the global options settings to stringsAsFactors=FALSE.

I had to hand over a codebase to a colleague in my lab when I left.
Her options(stringsAsFactors) was at the default (TRUE), and things
mysteriously broke until we (eventually) sorted out what was the what
-- it took a while to find because I *totally* forgot I had set
`options(stringsAsFactors=FALSE)` my ~/.Rprofile several years prior
(a testament to how little it breaks things I guess).

Of course, I can't argue with your premise that code written that
depends on the defaults (or changed defaults) is, in the end, poorly
written code ... sometimes we have to own up to being the ones who
write poorly written code ;-)

I only posted my original warning here to serve, more or less, as the
sentiment put forth in this poster since a decent amount of time was
lost chasing our tails:

http://www.despair.com/mistakes.html

;-)

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] why is this a factor?

2013-08-28 Thread Steve Lianoglou
Hi,

On Wed, Aug 28, 2013 at 3:58 PM, Ista Zahn  wrote:
> Or go all the way and put
>
> options(stringsAsFactors = FALSE)
>
> at the top your script or in your .Rprofile. This will prevent this
> kind of annoyance in the future without having to say stringsAsFactors
> = FALSE all the time.

I go back and forth about doing this too (setting a global hammer to
stringsAsFactors), but then other things might mess up -- imagine a
scenario where a package is written with the assumption that the
default `stringsAsFactors=TRUE` setting hasn't been changed, which
could then break when you go the nuclear-global-override route.

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] how to use a column name from the data frame in the function

2013-08-22 Thread Steve Lianoglou
Hi,

On Thu, Aug 22, 2013 at 9:49 PM, Jeff Newmiller
 wrote:
> Please don't post in HTML format... it messes with code examples.
>
> Use character indexing (please read the Introduction to R... again if 
> necessary).
>
> myf <- function(df, colname){
>   df[ ,colname ]
> }

Or df[[colname]] for data.frames

> colname  <- "a"
> myf(m,colname)
>
> Until you learn simple R syntax, I strongly recommend avoiding writing tricky 
> code that plays with names of variables.

And even after you learn simple R syntax, if you think the right thing
to do is to use some combo of substitute/eval/quote and friends, I'd
strongly encourage you to think again ...

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] First time r user

2013-08-18 Thread Steve Lianoglou
Yes, please do some reading first and give take a crack at your data first.

This will only be a fruitful endeavor for you after you get some
working knowledge of R.

Hadley is compiling a nice book online that I think is very helpful to
read through:
https://github.com/hadley/devtools/wiki/Introduction

The section on "functional looping patterns" will be immediately
useful (once you have a bit more background working with R):
http://github.com/hadley/devtools/wiki/functionals#looping-patterns

It's really a great resource and you should spend the time to read
through it. Once you read and understand the looping-patterns section,
you'll be able to handle your data like a pro and you can move on to
asking more interesting questions ;-)

If something is unclear there, though, please do raise that issue.

HTH,
-steve


On Sun, Aug 18, 2013 at 7:22 AM, Bert Gunter  wrote:
> This is ridiculous!
>
> Please read "An Introduction to R" (ships with R) or other online R
> tutorial. There are many good ones. There are also probably online
> courses. Please make an effort to learn the basics before posting
> further here.
>
> -- Bert
>
>
>
> On Sun, Aug 18, 2013 at 7:13 AM, Dylan Doyle  wrote:
>> Hello all thank-you for your speedy replies ,
>>
>> Here is the first few lines from the head function
>>
>>  brewery_idbrewery_name review_time review_overall review_aroma
>> review_appearance review_profilename
>> 1  10325 Vecchio Birraio  12348178231.5
>>  2.0   2.5stcules
>> 2  10325 Vecchio Birraio  12359150973.0
>>  2.5   3.0stcules
>> 3  10325 Vecchio Birraio  12359166043.0
>>  2.5   3.0stcules
>> 4  10325 Vecchio Birraio  12347251453.0
>>  3.0   3.5stcules
>> 5   1075 Caldera Brewing Company  12937352064.0
>>  4.5   4.0 johnmichaelsen
>> 6   1075 Caldera Brewing Company  13255246593.0
>>  3.5   3.5oline73
>>
>>beer_style review_palate review_taste  beer_name
>> beer_abv beer_beerid
>> 1 Hefeweizen   1.5  1.5   Sausa
>> Weizen  5.0   47986
>> 2 English Strong Ale   3.0  3.0
>> Red Moon  6.2   48213
>> 3 Foreign / Export Stout   3.0  3.0 Black Horse
>> Black Beer  6.5   48215
>> 4German Pilsener   2.5  3.0
>> Sausa Pils  5.0   47969
>> 5 American Double / Imperial IPA   4.0  4.5
>>  Cauldron DIPA  7.7   64883
>> 6   Herbed / Spiced Beer   3.0  3.5Caldera
>> Ginger Beer  4.7   52159
>>
>> '
>> I have only discovered how to import the data set , and run some basic r
>> functions on it my goal is to be able to answer questions like what are the
>> top 10 pilsner's , or the brewer with the highest abv average. Also using
>> two factors such as best beer aroma and appearance, which beer style should
>> I try. Let me know if i can give you any more information you might need to
>> help me.
>>
>> Thanks again ,
>>
>> Dylan
>>
>>>
>>
>>
>>
>> On Sun, Aug 18, 2013 at 4:16 AM, Paul Bernal  wrote:
>>
>>> Thank you so much Steve.
>>>
>>> The computer I'm currently working with is a 32 bit windows 7 OS. And RAM
>>> is only 4GB so I guess thats a big limitation.
>>> El 18/08/2013 03:11, "Steve Lianoglou" 
>>> escribió:
>>>
>>> > Hi Paul,
>>> >
>>> > On Sun, Aug 18, 2013 at 12:56 AM, Paul Bernal 
>>> > wrote:
>>> > > Thanks a lot for the valuable information.
>>> > >
>>> > > Now my question would necessarily be, how many columns can R handle,
>>> > > provided that I have millions of rows and, in general, whats the
>>> maximum
>>> > > amount of rows and columns that R can effortlessly handle?
>>> >
>>> > This is all determined by your RAM.
>>> >
>>> > Prior to R-3.0, R could only handle vectors of length 2^31 - 1. If you
>>> > were working with a matrix, that meant that you could only have that
>>> > many elements in the entire matrix.
>>> >
>>> > If you were working with a data.frame, you could have data.frames with
>>> > 2^31-1 r

Re: [R] First time r user

2013-08-18 Thread Steve Lianoglou
Hi Paul,

On Sun, Aug 18, 2013 at 12:56 AM, Paul Bernal  wrote:
> Thanks a lot for the valuable information.
>
> Now my question would necessarily be, how many columns can R handle,
> provided that I have millions of rows and, in general, whats the maximum
> amount of rows and columns that R can effortlessly handle?

This is all determined by your RAM.

Prior to R-3.0, R could only handle vectors of length 2^31 - 1. If you
were working with a matrix, that meant that you could only have that
many elements in the entire matrix.

If you were working with a data.frame, you could have data.frames with
2^31-1 rows, and I guess as many columns, since data.frames are really
a list of vectors, the entire thing doesn't have to be in one
contiguous block (and addressable that way)

R-3.0 introduced "Long Vectors" (search for that section in the release notes):

https://stat.ethz.ch/pipermail/r-announce/2013/000561.html

It almost doubles the size of a vector that R can handle (assuming you
are running 64bit). So, if you've got the RAM, you can have a
data.frame/data.table w/ billion(s) of rows, in theory.

To figure out how much data you can handle on your machine, you need
to know the size of real/integer/whatever and the number of elements
of those you will have so you can calculate the amount of RAM you need
to load it all up.

Lastly, I should mention there are packages that let you work with
"out of memory" data, like bigmemory, biglm, ff. Look at the HPC Task
view for more info along those lines:

http://cran.r-project.org/web/views/HighPerformanceComputing.html


>
> Best regards and again thank you for the help,
>
> Paul
> El 18/08/2013 02:35, "Steve Lianoglou"  escribió:
>
>> Hi Paul,
>>
>> First: please keep your replies on list (use reply-all when replying
>> to R-help lists) so that others can help but also the lists can be
>> used as a resource for others.
>>
>> Now:
>>
>> On Aug 18, 2013, at 12:20 AM, Paul Bernal  wrote:
>>
>> > Can R really handle millions of rows of data?
>>
>> Yup.
>>
>> > I thought it was not possible.
>>
>> Surprise :-)
>>
>> As I type, I'm working with a ~5.5 million row data.table pretty
>> effortlessly.
>>
>> Columns matter too, of course -- RAM is RAM, after all and you've got
>> to be able to fit the whole thing into it if you want to use
>> data.table. Once loaded, though, data.table enables one to do
>> split/apply/combine calculations over these data quite efficiently.
>> The first time I used it, I was honestly blown away.
>>
>> If you find yourself wanting to work with such data, you could do
>> worse than read through data.table's vignette and FAQ and give it a
>> spin.
>>
>> HTH,
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Computational Biologist
>> Bioinformatics and Computational Biology
>> Genentech
>>
>
> [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] First time r user

2013-08-18 Thread Steve Lianoglou
Hi Paul,

First: please keep your replies on list (use reply-all when replying
to R-help lists) so that others can help but also the lists can be
used as a resource for others.

Now:

On Aug 18, 2013, at 12:20 AM, Paul Bernal  wrote:

> Can R really handle millions of rows of data?

Yup.

> I thought it was not possible.

Surprise :-)

As I type, I'm working with a ~5.5 million row data.table pretty effortlessly.

Columns matter too, of course -- RAM is RAM, after all and you've got
to be able to fit the whole thing into it if you want to use
data.table. Once loaded, though, data.table enables one to do
split/apply/combine calculations over these data quite efficiently.
The first time I used it, I was honestly blown away.

If you find yourself wanting to work with such data, you could do
worse than read through data.table's vignette and FAQ and give it a
spin.

HTH,

-steve

--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] First time r user

2013-08-17 Thread Steve Lianoglou
Hi,

In addition to Rainer's suggestion (which are to give an small example
of what your input data look like and an example of what you want to
output), given the size of your input data, you might want to try to
use the data.table package instead of plyr::ddply -- especially while
you are exploring different combinations/calculations over your data.

Usually, the equivalent data.table approach (to the ddply one) tend to
be orders of magnitude faster and usually more memory efficient.

When the size of my data is small, I often use both (I think the
plyr/ddply "language" is rather beautiful), but when my data gets into
the 1000++ rows, I'll universally switch to data.table.

HTH,
-steve


On Sat, Aug 17, 2013 at 4:33 PM, Dylan Doyle  wrote:
>
> Hello R users,
>
>
> I have recently begun a project to analyze a large data set of approximately 
> 1.5 million rows it also has 9 columns. My objective consists of locating 
> particular subsets within this data ie. take all rows with the same column 9 
> and perform a function on that subset. It was suggested to me that i use the 
> ddply() function from the Pylr package. Any advice would be greatly 
> appreciated
>
>
> Thanks much,
>
> Dylan
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to extract last value in each group

2013-08-15 Thread Steve Lianoglou
Hi,

On Thu, Aug 15, 2013 at 4:03 PM, arun  wrote:
> HI Steve,
>
> Thanks for testing.
>
> When I run a slightly bigger dataset:
> set.seed(1254)
> name<- sample(letters,1e7,replace=TRUE)
> number<- sample(1:10,1e7,replace=TRUE)
>
> datTest<- data.frame(name,number,stringsAsFactors=FALSE)
> library(data.table)
>
> dtTest<- data.table(datTest)
>
> system.time(res3<- dtTest[,list(Sum_Number=sum(number)),by=name])
>  #user  system elapsed
>  # 0.592   0.028   0.623
>
> #Then I tried this:
>
> dtTest1<- data.table(datTest,key=name)
> #Error: C stack usage is too close to the limit
>
> Cstack_info()
> #  sizecurrent  direction eval_depth
>  #  8388608   7320  1  2

Do you get this stack problem if you quote `name`, eg:

R> dtTest1 <- data.table(datTest, key="name")

?

Perhaps we should move this to data.table-help if you want to debug
further, though:

https://lists.r-forge.r-project.org/cgi-bin/mailman/listinfo/datatable-help

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to extract last value in each group

2013-08-15 Thread Steve Lianoglou
Hi,

On Thu, Aug 15, 2013 at 1:38 PM, arun  wrote:
> I tried it again on a fresh start using the data.table alone:
> Now.
>
>  dt1 <- data.table(dat2, key=c('Date', 'Time'))
>  system.time(ans <- dt1[, .SD[.N], by='Date'])
> #   user  system elapsed
> # 40.908   0.000  40.981
> #Then tried:
> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
>  #  user  system elapsed
>  # 0.148   0.000   0.151  #same time as before

Amazing. This is what I get on my MacBook Pro, i7 @ 3GHz (very close
specs to your machine):

R> dt1 <- data.table(dat2, key=c('Date', 'Time'))
R> system.time(ans <- dt1[, .SD[.N], by='Date'])
   user  system elapsed
  0.064   0.009   0.073

R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
   user  system elapsed
  0.148   0.016   0.165

On one of our compute server running who knows what processor on some
version of linux, but shouldn't really matter as we're talking
relative time to each other here:

R> system.time(ans <- dt1[, .SD[.N], by='Date'])
   user  system elapsed
  0.160   0.012   0.170

R> system.time(res7<- dat2[cumsum(rle(dat2[,1])$lengths),])
   user  system elapsed
  0.292   0.004   0.294

There's got to be some other explanation for the heavily degraded
performance you're observing... our R & data.table versions also
match.

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to extract last value in each group

2013-08-15 Thread Steve Lianoglou
Hi,

Looks like you have some free time on your hands :-)

Something looks a bit off here, though, I was surprised to see the
time you reported for the data.table option:

> #separate the data.table creation step:
>  dt1 <- data.table(dat2, key=c('Date', 'Time'))
> system.time(ans <- dt1[, .SD[.N], by='Date'])
> # user  system elapsed
> # 38.500   0.000  38.566

When I do the same, this is what I get:

   user  system elapsed
  0.064   0.009   0.074

I know this is very much dependent on what type of cpu you are running
on, but unless you're running your tests on a commodore 64, looks like
something went wonky.

Lastly, neither here or there: some of the solutions assumed the data
were already grouped and sorted for you, so there are clever ways to
pick off the last one (cumsum and the like), but I've found it prudent
to always assume that the data has been handed to me by a rather
clever and insidious adversary and taking steps to ensure you are
getting what you want (whether using an index on a data.table, or some
combo of split + max/which.max) probably is a good way to go.

My 2 cents,

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to extract last value in each group

2013-08-14 Thread Steve Lianoglou
Or with plyr:

R> library(plyr)
R> ans <- ddply(x, .(Date), function(df) df[which.max(df$Time),])

-steve

On Wed, Aug 14, 2013 at 2:18 PM, Steve Lianoglou
 wrote:
> While we're playing code golf, likely faster still could be to use
> data.table. Assume your data is in a data.frame named "x":
>
> R> library(data.table)
> R> x <- data.table(x, key=c('Date', 'Time'))
> R> ans <- x[, .SD[.N], by='Date']
>
> -steve
>
> On Wed, Aug 14, 2013 at 2:01 PM, William Dunlap  wrote:
>> A somewhat faster version (for datasets with lots of dates, assuming it is 
>> sorted by date and time) is
>>   isLastInRun <- function(x) c(x[-1] != x[-length(x)], TRUE)
>>   f3 <- function(dataFrame) {
>>   dataFrame[ isLastInRun(dataFrame$Date), ]
>>   }
>> where your two suggestions, as functions, are
>>   f1 <- function (dataFrame) {
>>   dataFrame[unlist(with(dataFrame, tapply(Time, list(Date), FUN = 
>> function(x) x == max(x, ]
>>   }
>>   f2 <- function (dataFrame) {
>>   dataFrame[cumsum(with(dataFrame, tapply(Time, list(Date), FUN = 
>> which.max))), ]
>>   }
>>
>> Bill Dunlap
>> Spotfire, TIBCO Software
>> wdunlap tibco.com
>>
>>
>>> -Original Message-
>>> From: r-help-boun...@r-project.org [mailto:r-help-boun...@r-project.org] On 
>>> Behalf
>>> Of arun
>>> Sent: Wednesday, August 14, 2013 1:08 PM
>>> To: Noah Silverman
>>> Cc: R help
>>> Subject: Re: [R] How to extract last value in each group
>>>
>>> Hi,
>>> Try:
>>> dat1<- read.table(text="
>>> Date Time  O  H  L  C  U  D
>>> 06/01/2010 1358 136.40 136.40 136.35 136.35  2  12
>>> 06/01/2010 1359 136.40 136.50 136.35 136.50  9  6
>>> 06/01/2010 1400 136.45 136.55 136.35 136.40  8  7
>>> 06/01/2010 1700 136.55 136.55 136.55 136.55  1  0
>>> 06/02/2010  331 136.55 136.70 136.50 136.70  36  6
>>> 06/02/2010  332 136.70 136.70 136.65 136.65  3  1
>>> 06/02/2010  334 136.75 136.75 136.75 136.75  1  0
>>> 06/02/2010  335 136.80 136.80 136.80 136.80  4  0
>>> 06/02/2010  336 136.80 136.80 136.80 136.80  8  0
>>> 06/02/2010  337 136.75 136.80 136.75 136.80  1  2
>>> 06/02/2010  338 136.80 136.80 136.80 136.80  3  0
>>> ",sep="",header=TRUE,stringsAsFactors=FALSE)
>>>
>>>  dat1[unlist(with(dat1,tapply(Time,list(Date),FUN=function(x) x==max(x,]
>>> # Date Time  O  H  L  C U D
>>> #4  06/01/2010 1700 136.55 136.55 136.55 136.55 1 0
>>> #11 06/02/2010  338 136.80 136.80 136.80 136.80 3 0
>>> #or
>>>  dat1[cumsum(with(dat1,tapply(Time,list(Date),FUN=which.max))),]
>>>  Date Time  O  H  L  C U D
>>> 4  06/01/2010 1700 136.55 136.55 136.55 136.55 1 0
>>> 11 06/02/2010  338 136.80 136.80 136.80 136.80 3 0
>>>
>>> #or
>>> dat1[as.logical(with(dat1,ave(Time,Date,FUN=function(x) x==max(x,]
>>>  #Date Time  O  H  L  C U D
>>> #4  06/01/2010 1700 136.55 136.55 136.55 136.55 1 0
>>> #11 06/02/2010  338 136.80 136.80 136.80 136.80 3 0
>>> A.K.
>>>
>>>
>>>
>>>
>>> - Original Message -
>>> From: Noah Silverman 
>>> To: "R-help@r-project.org" 
>>> Cc:
>>> Sent: Wednesday, August 14, 2013 3:56 PM
>>> Subject: [R] How to extract last value in each group
>>>
>>> Hello,
>>>
>>> I have some stock pricing data for one minute intervals.
>>>
>>> The delivery format is a bit odd.  The date column is easily parsed and 
>>> used as an index
>>> for an its object.  However, the time column is just an integer (1:1807)
>>>
>>> I just need to extract the *last* entry for each day.  Don't actually care 
>>> what time it was,
>>> as long as it was the last one.
>>>
>>> Sure, writing a big nasty loop would work, but I was hoping that someone 
>>> would be able
>>> to suggest a faster way.
>>>
>>> Small snippet of data below my sig.
>>>
>>> Thanks!
>>>
>>>
>>> --
>>> Noah Silverman, M.S., C.Phil
>>> UCLA Department of Statistics
>>> 8117 Math Sciences Building
>>> Los Angeles, CA 90095
>>>
>>> --
>>>
>>>  

Re: [R] How to extract last value in each group

2013-08-14 Thread Steve Lianoglou
6.80 136.75 136.80   1   2
>> 06/02/2010  338 136.80 136.80 136.80 136.80   3   0
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to "vectorize" subsetting

2013-08-14 Thread Steve Lianoglou
Howdy,

On Wed, Aug 14, 2013 at 9:40 AM, Bert Gunter  wrote:
> mod Jeff Newmiller's comments...
>
> 1. Have you read"An Introduction to R? (or other basic tutorial --
> there are many on the web). If no, stop posting and do so. This will
> help you to understand R's basic data manipulation capabilities and
> structures (list, "apply" type functions,...).
>
> 2. mod 1), perhaps
>
> ?tapply (and friends like ?ave, ?aggregate, ?by)
> ?split

... and after you've discovered the existence and explored the
functionality of these functions, run (don't walk) over to check out
the plyr package:

http://plyr.had.co.nz

And read through this relevant chapter in Hadley's book:
https://github.com/hadley/devtools/wiki/functionals#data-structure-functionals

It will take you through looping, to *apply-ing, to plyr-ing

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Off-topic? Linux laptop for R

2013-08-11 Thread Steve Lianoglou
Hi,

I have no real input here from personal experience, but the author of
the coderspiel blog has these two "recent" posts about his experience
with Ubuntu on (what seem to be) two very nice machines:

The latest is a Vaio Pro of some sort. Ubuntu is a bit difficult to
install, but doable:

http://code.technically.us/post/55425026899/vaio-pro-for-programming

An earlier post talks about the ThinkPad Carbon X1:

http://code.technically.us/post/50837506478/senistive-touchpads-and-ubuntu

Which apparently supports ubuntu quite easily (out of the box, I think).

>From quickly skimming, it seems like his only gripe with the X1 is
that it has a large monitor.

I can't really imagine why any of these laptops would have a problem
running R. I agree with what (I think) Rolf is saying in that your
biggest issue will be to find a laptop that runs your favorite flavor
of Linux well. Once you satisfy that constraint, I'm relatively sure
that the chances of running R "well" is quite high. Whether or not the
machine can run R well doesn't say much about how easily linux will be
installed (and fully functional).

HTH,
-steve


On Sun, Aug 11, 2013 at 1:19 PM, Rolf Turner  wrote:
>
>
> I think that Hasan Diwan's assertion is a bit of an over-simplification.  I
> have a Toshiba Satellite
> L850 that has no problems of any sort running R.  However it *does* have
> problems with WiFi.
> The WiFi drivers for my laptop won't work under (any?) Linux system.
> Apparently (I don't completely
> grok the concepts here) this is because the drivers are proprietary and so
> Linux developers can't
> get at the code.
>
> My previous laptop (an elderly IBM ThinkPad) had no problems with WiFi, at
> least not after I
> upgraded to the then most recent versions of Fedora and later Ubuntu.
>
> I have managed to work around the WiFi problem by using a USB WiFi device.
> Be careful,
> but.  The first one I got, an ASUS USB-N10, was advertised to have "Linux
> support" but
> after much travail (and after having got a great deal of expert advice) I
> decided it was
> no go.  I am currently using an EnGenius EUB9801 which seems to work
> smoothly.  I have
> also ordered a "Penguin Wireless G USB Adapter for GNU / Linux" from
> ThinkPenguin.com,
> but it hasn't arrived yet.  (Being shipped from the USA to New Zealand.)
> The ThinkPenguin
> people seem to have their heads screwed on right, and answered my inquiry
> promptly,
> thoroughly and comprehensibly.
>
> I hope this is of some relevance to someone!
>
> cheers,
>
> Rolf Turner
>
>
>
> On 12/08/13 06:47, Hasan Diwan wrote:
>>
>> Any laptop that performs well with Linux will perform acceptably with R
>> and
>> vice versa. -- H
>>
>>
>> On 11 August 2013 11:03, Mitchell Maltenfort  wrote:
>>
>>> Can anyone recommend a laptop that performs well running R under Linux?
>>> Thanks.
>
>
> ______
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Advice on use of R for Generalised Linear Modelling

2013-08-11 Thread Steve Lianoglou
Hi,

On Sun, Aug 11, 2013 at 4:47 AM, Alan Sausse  wrote:
> Hi,
>
> Not an expert R user, something of a novice - please be gentle with me!
>
> I have a particular interest in generalised linear models (GLMs) and I'm
> experienced in fitting them using other bits of software.
>
> R can fit GLMs of course, using the glm() command.  I have some large
> multivariate data sets I'd like to fit GLMs to, ideally using R.  Two
> concerns though:
>
> Firstly, I'm told that R isn't especially fast at fitting GLMs, especially
> if the data files are too large to fit into RAM.  Can anyone advise if
> there are alternatives to glm() around which might cope better.  For
> example, I've heard that RevolutionR is available, and claims to fit GLMs
> faster in these cases.  Might it be possible, alternatively, to write some
> very quick code using C (for example) and to get R to invoke this instead?
>  Has anyone tried to do this?

Likely not -- you'll need to have RevolutionR around for that, and if
you've have RevoR, then just use RevoR -- not sure what the point
would be call RevoR-specific functionality from R.

Perhaps the biglm package can help you from R, though, as it provides
a bigglm function that can do GLMs with out-of-memory data -- no idea
how well/fast it works, though.

You should also consider that your data may not require that, though
-- glmnet, for instance, works incredibly fast on large data. If your
data can actually be loaded (perhaps via a sparse matrix), then you
can try that.

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ifelse() applied on function

2013-08-11 Thread Steve Lianoglou
Hi,

On Sun, Aug 11, 2013 at 1:18 PM, Ron Michael  wrote:
> Hi,
>
> How can I apply ifelse function to chose appropriate function?
>
> My goal is user will chose lapply() function if "ChooseFn = T" otherwise to 
> chose sfLapply() from snowfall package.
>
> I am basically trying to avoid repeatation to write all internal steps, if 
> user choses lapply or sfLapply. For example, I am wondering if there is any 
> posibility to write something like:
>
> ifelse(ChooseFn, lapply, sfLapply)(MyList, function(x) {
>
> ...
> ..
> return())
>
> Really appreciate if someone helps me out.

How about something like:

loop <- if (ChooseFn) lapply else sfLapply

result <- loop(MyList, function(x) {
  ## ...
})

Should work as long as `sfLapply` has same function signature as lapply.

HTH,

-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] glmnet inclusion / exclusion of categorical variables

2013-08-09 Thread Steve Lianoglou
Hi,

On Fri, Aug 9, 2013 at 6:44 AM, Kevin Shaney  wrote:
>
> Hello -
>
> I have been using GLMNET of the following form to predict multinomial 
> logistic / class dependent variables:
>
> mglmnet=glmnet(xxb,yb ,alpha=ty,dfmax=dfm,
> family="multinomial",standardize=FALSE)
>
> I am using both continuous and categorical variables as predictors, and am 
> using sparse.model.matrix to code my x's into a matrix.  This is changing an 
> example categorical variable whose original name / values is {V1 = "1" or "2" 
> or "3"} into two recoded variables {V12= "1" or "0" and V13 = "1" or "0"}.
>
> As i am cycling through different penalties, i would like to either have both 
> recoded variables included or both excluded, but not one included - and
> can't figure out how to make that work.   I tried changing the
> "type.multinomial" option, as that looks like this option should do what i 
> want, but can't get it to work (maybe the difference in recoded variable 
> names is driving this).
>
> To summarize, for categorical variables, i would like to hierarchically 
> constrain inclusion / exclusion of recoded variables in the model - either 
> all of the recoded variables from the same original categorical  variable are 
> in, or all are out.

Pretty sure that you'll need the "grouped lasso" for that. Quick
googling over CRAN suggests:

grplasso: http://cran.r-project.org/web/packages/grplasso/index.html
standGL: http://cran.r-project.org/web/packages/standGL/index.html
gglasso: http://code.google.com/p/gglasso/

Unfortunately it doesn't look like any of them support the equivalent
of family="multinomial", only 2-class classification.

HTH,
-steve

-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why is mclappy slower than apply in this case?

2013-08-08 Thread Steve Lianoglou
<- diag> class='pun'>(m> class='pun'>);>  class='pln'>
>>Dmat > class='pun'><- diff> class='pun'>(E> class='pun'>);>  class='pln'>
>>X <- rbind> class='pun'>(E,> class='pln'> lambda
>> * > class='typ'>Dmat)
>>u <- c> class='pun'>(y,> class='pln'> rep(> class='lit'>0, m
>> - > class='lit'>1))
>>
>># Call quantile regression> class='pln'>
>>q <- rq> class='pun'>.fit> class='pun'>.fnb> class='pun'>(X,> class='pln'> u, tau
>> = p> class='pun'>)
>>q
>> }
>>
>> Function rq.fit.fnb (quantreg library):
>>
>> rq.> class='pln'>fit.fnb
>> <- > class='kwd'>function > class='pun'>(x,> class='pln'> y, tau
>> = > class='lit'>0.5, beta
>> = > class='lit'>0.5, eps
>> = > class='lit'>1e-06)
>> {
>> n <- length> class='pun'>(y)> class='pln'>
>> p <- ncol> class='pun'>(x)> class='pln'>
>> if > class='pun'>(n > class='pun'>!= nrow> class='pun'>(x> class='pun'>))>  class='pln'>
>> stop("x and y don't
>> match n")
>> if > class='pun'>(tau > class='pun'>< eps > class='pun'>|| tau > class='pun'>> 1> class='pln'> -
>> eps)
>> stop("No parametric
>> Frisch-Newton method.  Set tau in (0,1)"> class='pun'>)>  class='pln'>
>> rhs <- > class='pun'>(1 > class='pun'>- tau> class='pun'>) *> class='pln'> apply(> class='pln'>x, > class='lit'>2,
>> sum)
>> d <- rep> class='pun'>(1,> class='pln'> n)
>> u <- rep> class='pun'>(1,> class='pln'> n)
>> wn <- rep> class='pun'>(0,> class='pln'> 10 
>> >  class='pun'>* n> class='pun'>)
>> wn[1> class='pun'>:n]> class='pln'> <- 
>> >  class='pun'>(1 
>> >  class='pun'>- tau> class='pun'>)
>> z <- > class='pun'>.Fortran> class='pun'>("rqfnb"> class='pun'>, > class='kwd'>as>  class='pun'>.integer> class='pun'>(n> class='pun'>),>  class='pln'> as> class='pun'>.integer> class='pun'>(p> class='pun'>),>  class='pln'> a =
>> as.> class='kwd'>double(> class='pln'>t(> class='kwd'>as>  class='pun'>.matrix> class='pun'>(x> class='pun'>))),
>> c = > class='kwd'>as.> class='kwd'>double(-> class='pln'>y), rhs
>> = > class='kwd'>as.> class='kwd'>double(> class='pln'>rhs), d
>> = > class='kwd'>as.> class='kwd'>double(> class='pln'>d),
>> as.> class='kwd'>double(> class='pln'>u), beta
>> = > class='kwd'>as.> class='kwd'>double(> class='pln'>beta), eps
>> = > class='kwd'>as.> class='kwd'>double(> class='pln'>eps),
>> wn = > class='kwd'>as.> class='kwd'>double(> class='pln'>wn), wp
>> = > class='kwd'>double((p
>> + > class='lit'>3) > class='pun'>* p> class='pun'>), it> class='pun'>.count > class='pun'>= integer> class='pun'>(3> class='pun'>),>  class='pln'>
>> info =
>> integer(1> class='pun'>), PACKAGE > class='pun'>= > class='str'>"quantreg")
>> coefficients <-
>> -z$wp> class='pun'>[1:> class='pln'>p]
>> names(> class='pln'>coefficients)
>> <- dimnames> class='pun'>(x> class='pun'>)[[2> class='pun'>]]
>> residuals <- y
>> - x > class='pun'>%*% coefficients
>> list(coefficients
>> = coefficients> class='pun'>, tau > class='pun'>= tau> class='pun'>, residuals > class='pun'>= residuals> class='pun'>)
>> }
>>
>> For data vector of length 2000 i get:
>>
>> (value = elapsed time in sec; columns = different number of columns of
>> smoothed matrix/list)
>>
>>2cols> class='pln'> 4cols
>> 6cols > class='lit'>8cols
>> apply  0.178
>> 0.096 > class='lit'>0.069 > class='lit'>0.056
>> lapply16.555
>> 4.299 > class='lit'>1.785 > class='lit'>0.972
>> mc2lapply 11.192
>> 2.089 > class='lit'>0.927 > class='lit'>0.545
>> mc4lapply 10.649
>> 1.326 > class='lit'>0.694 > class='lit'>0.396
>> mc6lapply 11.271
>> 1.384 > class='lit'>0.528 > class='lit'>0.320
>> mc8lapply 10.133
>> 1.390 > class='lit'>0.560 > class='lit'>0.260
>>
>> For data of length 4000 i get:
>>
>> > class='lit'>2cols>  class='pln'>  4cols
>> 6cols > class='lit'>8cols
>> apply   0.351
>> 0.187  > class='lit'>0.137 > class='lit'>0.110
>> lapply189.339
>> 32.654 > class='lit'>14.544 > class='lit'>8.674
>> mc2lapply 186.047
>> 20.791  > class='lit'>7.261 > class='lit'>4.231
>> mc4lapply 185.382
>> 30.286  > class='lit'>5.767 > class='lit'>2.397
>> mc6lapply 184.048
>> 30.170  > class='lit'>8.059 > class='lit'>2.865
>> mc8lapply 182.611
>> 37.617  > class='lit'>7.408 > class='lit'>2.842
>>
>> Why is apply so much more efficient than mclapply? Maybe I'm just doing some
>> usual beginner mistake.
>>
>> Thank you for your reactions.
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
>
> Bert Gunter
> Genentech Nonclinical Biostatistics
>
> Internal Contact Info:
> Phone: 467-7374
> Website:
> http://pharmadevelopment.roche.com/index/pdb/pdb-functional-groups/pdb-biostatistics/pdb-ncb-home.htm
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] tree in tree package - Error: cannot allocate vector of size 2.0 Gb

2013-05-31 Thread Steve Lianoglou
Hi,

On Fri, May 31, 2013 at 5:38 PM, Stephen Sefick  wrote:
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.utf8   LC_NUMERIC=C
>  [3] LC_TIME=en_US.utf8LC_COLLATE=en_US.utf8
>  [5] LC_MONETARY=en_US.utf8LC_MESSAGES=en_US.utf8
>  [7] LC_PAPER=CLC_NAME=C
>  [9] LC_ADDRESS=C  LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics  grDevices utils datasets  methods   base
>
>
> 6GB RAM, intel core2 quad, Scientific Linux 6.4
>
> I am using tree in the tree package.  and I get the following error:
>
> Error: cannot allocate vector of size 2.0 Gb
>
> shouldn't I be able to allocate more memory than 2GB?  I am sure that I am
> missing something.  Any help would be greatly appreciated.

I don't remember who on the R list had previously put it this way, but
it puts it best, I think:

The 2gb is the straw that broke the camel's back.

R has likely been allocating more and more memory for whatever it's
trying to do and at some point it asked the OS for another 2gb more,
and *bam* ... toasted.

If you look at top (or htop) and monitor the R process as it's
running, I reckon that's what you'll see, too.

HTH,

-stee

--
Steve Lianoglou
Computational Biologist
Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ecdf --- title suggestion and question

2013-05-08 Thread Steve Lianoglou
Hi Ivo,

On Wed, May 8, 2013 at 1:37 PM, ivo welch  wrote:
> dear R-experts---first, a suggestion to martin: the ecdf() function
> could have an optional parameter to set the title.  by looking at
> str(), I see the plot title is set in an attr named "call".  i.e., I
> can reset it as
>
> ee <- ecdf( rnorm(25 ) )
> attr(ee,"call") <- "my own title"
> plot(ee)

Why not just:

R> plot(ee, main="my own title")

> alas, I cannot figure out how to get rid of the title altogether.
> attr(ee,"call") <- NULL gives me two quotation marks ("") .  is it
> possible to remove the title altogether?

R> plot(ee, main="")
R> plot(ee, main=NULL)

-steve

--
Steve Lianoglou
Computational Biologist
Department of Bioinformatics and Computational Biology
Genentech

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SVD on very large data matrix

2013-04-08 Thread Steve Lianoglou
Hi Andy,

On Mon, Apr 8, 2013 at 7:44 AM, Andy Cooper 
wrote:
>
>
>
> Dear All,
>
> I need to perform a SVD on a very large data matrix, of dimension ~
500,000 x 1,000 , and I am looking
> for an efficient algorithm that can perform an approximate (partial) SVD
to extract on the order of the top 50
> right and left singular vectors.


Scanning through the results after googling for "cran big svd" suggests
that the irlba package might be useful for you:

http://cran.r-project.org/web/packages/irlba/

The first sentence of its vignette looks quite promising:

"""The irlba package provides a fast way to compute partial singular value
decompositions (SVD) of large matrices ..."

HTH,
-steve

--
Steve Lianoglou
Defender of The Thesis
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Elasticnet - Cross validation problem

2013-03-14 Thread Steve Lianoglou
Hi,

On Thu, Mar 14, 2013 at 2:36 PM, Noah Silverman  wrote:
> Hello,
>
> I am attempting to use elasticnet to classify a number of documents.
>
> The features are words.  The data is coded into a matrix with each document 
> as a row and each word as a column.  The data is binary, with {0,1} 
> indicating the presence of a word.
>
> I want to use the cross validation function of elasticnet (cv.enet).  
> However, when the code selects a random subset of the data for a given run, 
> some of the word columns may be all 0.  (A given word simply isn't present in 
> the subset of data sampled.)  This causes the the function to return an error 
> about variance of 0.
>
> Any suggestions on how to mitigate this issue?  Given that I want a 5-fold 
> cross validation to determine optimal tuning?

It looks like you can jimmy-up your own splits for cross validation by
using the `foldid` parameter to `cv.glmnet`, so you can either
construct your own splits to make sure that this scenario that's
tripping you up doesn't happen.

Or, you can create a modified version of the cv function that still
picks samples randomly, but handles situations where you have all 0
columns as a special case -- I guess you would reduce your feature
matrix for that fold, run the goods, then drop the coefs back into the
original "columns" they'd belong to as if you ran the training on the
full feature matrix.

Know what I mean?

HTH,
-steve

-- 
Steve Lianoglou
Defender of The Thesis
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] loop in a data.table

2013-03-13 Thread Steve Lianoglou
Hi,

On Wed, Mar 13, 2013 at 7:25 PM, Camilo Mora  wrote:
> Hi everyone,
>
> I have a data.table called "data" with many columns which I want to group by
> column1 using data.table, given how fast it is.
>
> The problem with looping a data.table is that data.table does not like
> quotations  to define the column names (e.g. "col2" instead of col2). I
> found a way around which is to use get("col2"), which works fine but the
> processing time multiples by 20.
>
> So if I use:
>
> data[,sum(col2),by=(key)]
>
> entering the column names by hand, the operation is done in 1 sec. but if in
> the contrary I use:
>
> data[,sum(get("col2")),by=(key)]
>
> using a loop to put the column names, the same operation takes 20 sec. I
> cannot use the former code because I have 10 files to process but the
> later will simply take months to complete. Is there any alternative to the
> function "get" or any other way in which data.table con recognize the names
> of the columns?.

I'm still not sure what you're trying to do. Could you maybe create an
example that's a bit closer to you real data and the stuff you want to
do on it?

Are all the columns of the same type?
Are you just summing columns?

If you post code into an email that reconstructions a small version of
your data.table (maybe 5-10 columns and one or two groups) it'd be
more clear for me.

Thanks,
-steve
-- 
Steve Lianoglou
Defender of The Thesis
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] scan not working

2013-01-27 Thread Steve Lianoglou
Hi,

It sounds like you just want to write a command line script using R
and you would pass the suffix/prefix as command line args, no?

Why not just go with what Peter has already suggested with
`commandArgs`, or if you want a more feature-rich command line arg
parser, you can try:

http://cran.r-project.org/web/packages/optparse/index.html

If the command line argument thing won't work for you, perhaps you can
elaborate more? For instance, you mention that you know how you might
do "this" in Perl ... perhaps you can clarify "this" a bit more.

-steve

On Sun, Jan 27, 2013 at 12:27 PM, Emily Sessa  wrote:
> Hello all (again),
>
> I received a very helpful answer to this question, and would like to pose one 
> more:
>
> Right now I have this script, which is being called from the command line, 
> writing output to two generically named files ("pvalues" and "qvalues") that 
> are named in the script using the line:
>
> write(pvalues, file="pvalues", ncol=1)
> write(adjusted, file="qvalues", ncol=1)
>
> However, ideally I would like those two files to have something appended to 
> their names that make them separate from one another, so I can identify which 
> input they went with and so they won't write over each other when I script 
> this into a Perl pipeline that will process many input files, which is my 
> ultimate goal. I know how to do this in Perl, but not R... is there some way 
> I can add another argument on the command line that will get passed to the 
> R.script, like a simple letter code (e.g. "cro"), and then have it append 
> that to the output file names, so they are, for example: "qvalues_cro" and 
> "pvalues_cro"?
>
> Thank you very much,
> Emily
>
> On Jan 27, 2013, at 4:34 AM, peter dalgaard  wrote:
>
>>
>> On Jan 27, 2013, at 08:33 , Emily Sessa wrote:
>>
>>> Hi all,
>>>
>>> I am trying to use the scan function in an R script that I am calling from 
>>> the command line on a Mac; at the shell prompt I type:
>>>
>>> $ Rscript get_q_values.R LRT_codeml_output
>>>
>>> in the hope that LRT_codeml_output will get passed to the get_q_values R 
>>> script. The first line of that script is:
>>>
>>> chidata <- scan(file="")
>>>
>>> which, as I understand how scan works, will read the contents of the file 
>>> from the command line into the object chidata. I did this a few times and 
>>> it worked like a charm. And then, it stopped working. Now, every time I try 
>>> to do this, I get "Read 0 items" as the next line in the terminal window, 
>>> and the output produced by the script is empty, because it's apparently no 
>>> longer reading anything in. I don't think I changed anything in the script; 
>>> it just stopped being able to execute the scan function. Does anyone have 
>>> any idea how to fix this?? I did not have anything else in that scan line 
>>> when it was working before. I've updated R and restarted my computer in the 
>>> hope that it would help, but it hasn't. Any help would be much appreciated.
>>
>> I don't see how that would ever work. The 2nd and further args to Rscript 
>> are passed to R and accesible via commandArgs(). There's no way that scan() 
>> can know what the arguments are. It might work with
>>
>> Rscript get_q_values.R < LRT_codeml_output
>>
>> though. Or you need to arrange explicitly for 
>> scan(file=commandArgs(TRUE)[1]).
>>
>>>
>>> -ES
>>> __
>>> R-help@r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] foreach takes foreever?

2013-01-21 Thread Steve Lianoglou
Hi,

On Mon, Jan 21, 2013 at 10:59 AM, Andre Zege  wrote:
> I started to look at ways to improve times of certain very parallel tasks and 
> thought that foreach should be a valid candidate to do the job.
> So, i opened foreach tutorial by Steve Weston and started timing examples 
> from it. First example from tutorial is
>
>
>>system.time(for(i in 1:10) sqrt(i))
>
>user  system elapsed
>0.060.000.06
>> system.time(foreach(i=1:10) %do% sqrt(i))
>user  system elapsed
>  102.370.21  103.38
>
> Hmm, 1700 time slower?
>
> second example is
>> system.time(x <- exp(1:100))
>user  system elapsed
>0.340.030.42
>>system.time(x <- foreach(i=1:100, .combine='c') %do% exp(i))
>
>
> I stopped it at 958 seconds, didn't have enough patience -- it basically 
> seems that foreach  slows down this one down naive  by more than 2000 times. 
> I must be  doing something very wrong. Am i supposed to set some environment 
> variables before it works properly? I am running 64bit R on win7 dual core 
> 2.27GHZ CPUs and 4GB memory laptop.

You should keep reading that vignette you are working from :-)

>From Section 5 "Parallel Execution":

"""
... But for the kinds of quick running operations that we’ve been
doing, there wouldn’t be much point to executing them in parallel.
Running many tiny tasks in parallel will usually take more time to
execute than running them sequentially, and if it already runs fast,
there’s no motivation to make it run faster anyway. But if the
operation that we’re executing in parallel takes a minute or longer,
there starts to be some motivation.
"""

The task you are parallelizing is too trivial. The time to coordinate
the data splitting + forking + etc. is more than just running sqrt.

When the specific task you are running within each iteration is more
involved, the benefit of parallelization will become more clear.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] (no subject)

2012-12-31 Thread Steve Lianoglou
Hi,

Firstly -- please use an informative (non-empty!) subject line in your emails.

On Mon, Dec 31, 2012 at 11:54 AM, eliza botto  wrote:
>
> Dear useRs,
> I am getting following error while using my R java machine.
>>Error: OutOfMemoryError (Java): Java heap space
> to get rid of it i used
>>options( java.parameters = "-Xmx1200m")
> but unfortunatly its not working
> Does anyone ever encountered this error??

Have you tried increasing your value for Xmx?
Do you have enough RAM for the value of Xmx you are setting?
What is the output of your `sessionInfo()`?
Can you give a brief explanation (code would be helpful) of what you
are trying to do?

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] pROC and ROCR give different values for AUC

2012-12-19 Thread Steve Lianoglou
On Wed, Dec 19, 2012 at 12:36 PM, Ivana Cace  wrote:
> I considered that but on the list more people are likely to see it. So if 
> they ran into the same thing they may already have figured out what is going 
> on and have answers.
> Or if not, I'll make a reproducible example, ask the maintainers and post 
> back here where others may find it.

Even a reproducible example posted here, along w/ cc-ing the
maintainers, would probably go a long way to get your more help.

People are too busy to rig up toy examples to test, but given a toy
example, it might be easy for someone to see where things are going
south.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Line numbers with errors and warnings?

2012-12-06 Thread Steve Lianoglou
On Thursday, December 6, 2012, Richard M. Heiberger wrote:

> Worik,
>
> Please look at the developer and debug packages that have been in ESS since
> summer 2012.  They are designed to help you maneuver through complicated
> code.
> Among other things, it allows you to insert breakpoints at the source code
> level.
>
> I suggest you start with the detailed discussion with examples
> http://code.google.com/p/ess-tracebug/
>
> and then move to the within-ESS documentation
> C-h i C-s ESS mouse-2 C-s Developing mouse-2
>



This is ... wow ... even better than the Kinami Code:
 http://en.m.wikipedia.org/wiki/Konami_Code

:-)



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Line numbers with errors and warnings?

2012-12-05 Thread Steve Lianoglou
Hi,

On Thu, Dec 6, 2012 at 12:01 AM, Worik R  wrote:
>
>
>> If you `source("test.R", keep.source=FALSE)`, you will see that the
>> line number is not reported.
>>
>
> Not always.
>
> I have code that uses sapply to call another function and all I get back is
> the line of the sapply.

The function that is being called inside the sapply that throws the
error -- is it in a different package?

If you reinstall *that* package w/ `options(keep.source.pkg=TRUE)` (or
R_KEEP_PKG_SOURCE=yes in your environment if installing from cmd line
(see ?options)), does that help?

If not -- could you provide, in a similar fashion as I did w/ the
BadPackage on github from an earlier message in this thread, an
example that recapitulates this "no-line-number-on-error" problem and
point out where/how it happens so we can also trigger it and see? ("I
have a function that does xxx" is hard for anybody else to help you
with).

Also, emacs/ess also has tracebug:

http://code.google.com/p/ess-tracebug/

which may be useful.

anyway ... if you are leaving R for greener pastures, do us a favor
and send us an email w/ an update if you find Nirvana in another
language ;-)

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Using multicores in R

2012-12-03 Thread Steve Lianoglou
And also:

On Monday, December 3, 2012, Uwe Ligges wrote:

>
>
> On 03.12.2012 11:14, moriah wrote:
>
>> Hi,
>>
>> I have an R script which is time consuming because it has two nested loops
>> in it of at least 5000 iterations each, I have tried to use the multicore
>> package but id doesn't seem to improve the elapsed time of the script(a
>> shorter script for example) and I can't use the mcapply because of
>> technical
>> reasons.
>>
>
> Errr, but otherwise multicore does not have an effect ...
>
> See package "parallel" that offers various functions for parallel
> computations. We cannot help much more if you do not tell us what the
> technical reasons are why mcapply() does not work.


If the work you are doing within each iteration of the loop is trivial, you
will likely even see a decrease in performance if you try to parallelize it.

Without more info from you regarding your problem, there's little we can do
to help, tho.

 -Steve



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Line numbers with errors and warnings?

2012-12-02 Thread Steve Lianoglou
Similar to Duncan's example, if you have a script "test.R" which looks like so:

 start script =
a1 <- 1:10
a2 <- 101:122
plot(a1, a1)
plot(a1, a2)
 end script ==

You can source it one way:

R> source('test.R', keep.source=TRUE)
Error in xy.coords(x, y, xlabel, ylabel, log) :
  'x' and 'y' lengths differ
R> traceback()
8: stop("'x' and 'y' lengths differ")
7: xy.coords(x, y, xlabel, ylabel, log)
6: plot.default(a1, a2)
5: plot(a1, a2) at test.R#4  ### <- Error is on line 4
4: eval(expr, envir, enclos)
3: eval(ei, envir)
2: withVisible(eval(ei, envir))
1: source("test.R", keep.source = TRUE)

If you `source("test.R", keep.source=FALSE)`, you will see that the
line number is not reported.

Also:

R> library(devtools)
R> options(keep.source=TRUE)
R> install_github("BadPackage", "lianos")
R> plot.me(1:10)
Error in xy.coords(x, y, xlabel, ylabel, log) :
  'x' and 'y' lengths differ
R> traceback()
5: stop("'x' and 'y' lengths differ")
4: xy.coords(x, y, xlabel, ylabel, log)
3: plot.default(x, c(x, 0), pch = 16)
2: plot(x, c(x, 0), pch = 16) at test.R#4   <- BAM
1: plot.me(1:10)

HTH,
-steve

On Sun, Dec 2, 2012 at 5:08 PM, Duncan Murdoch  wrote:
> On 12-12-02 5:02 PM, John Sorkin wrote:
>>
>> Gentleman,
>> This thread has been of great interest. Perhaps I missed part of it, but
>> do far I have not seen an example of code that has line numbers that
>> demonstrates how one can (in some instances) recover the line number of
>> an error. Can I impose upon the people who contributed to this thread to
>> post example code? The question if very important, and the discussion
>> about solutions has been somewhat abstract to this point.
>
>
> From my post this morning:
>
>
>> For example, in Windows, if I put this code into the clipboard:
>>
>> f <- function() {
>>stop("this is the error")
>> }
>>
>> g <- function() {
>>f()
>> }
>>
>> g()
>>
>> then run source("clipboard") followed by traceback(), this is what I see:
>>
>>  > source("clipboard")
>> Error in f() (from clipboard#2) : this is the error
>>  > traceback()
>> 7: stop("this is the error") at clipboard#2
>> 6: f() at clipboard#6
>> 5: g() at clipboard#9
>> 4: eval(expr, envir, enclos)
>> 3: eval(ei, envir)
>> 2: withVisible(eval(ei, envir))
>> 1: source("clipboard")
>>
>> You can ignore entries 1 to 4; they are part of source().  Entries 5, 6,
>> and 7 each tell the line of the script where they were parsed.
>>
>> Duncan Murdoch
>
>
>
>> Thank you,
>> John
>>
>> John David Sorkin M.D., Ph.D.
>> Chief, Biostatistics and Informatics
>> University of Maryland School of Medicine Division of Gerontology
>> Baltimore VA Medical Center
>> 10 North Greene Street
>> GRECC (BT/18/GR)
>> Baltimore, MD 21201-1524
>> (Phone) 410-605-7119
>> (Fax) 410-605-7913 (Please call phone number above prior to faxing)>>>
>> Milan Bouchet-Valat  12/2/2012 4:00 PM >>>
>> Le dimanche 02 décembre 2012 à 14:21 -0500, Duncan Murdoch a écrit :
>>  > On 12-12-02 9:52 AM, Milan Bouchet-Valat wrote:
>>  > > Le dimanche 02 décembre 2012 à 09:02 -0500, Duncan Murdoch a écrit :
>>  > >> On 12-12-02 8:33 AM, Milan Bouchet-Valat wrote:
>>  > >>> Le dimanche 02 décembre 2012 à 06:02 -0500, Steve Lianoglou a écrit
>> :
>>  > >>>> Hi,
>>  > >>>>
>>  > >>>> On Sun, Dec 2, 2012 at 12:31 AM, Worik R  wrote:
>>  > >>>>> What I mean is how do I get the R compilation or execution
>> process to spit
>>  > >>>>> out a line number with errors and warnings?
>>  > >>> Indeed, I often suffer from the same problem when debugging R
>> code too.
>>  > >>> This is a real issue for me.
>>  > >>>
>>  > >>>> As Duncan mentioned already, you can't *always* get a line
>> number. You
>>  > >>>> can, however, usually get enough context around the failing call
>> for
>>  > >>>> you to be able to smoke the problem out.
>>  > >>> What are the cases where you cannot get line numbers? Duncan said
>>  > >>> source()ed code comes with line numbers, but what's the more
>> general
>>  > >>> rule?
>>  > >>
>

Re: [R] Line numbers with errors and warnings?

2012-12-02 Thread Steve Lianoglou
Hi,

On Sun, Dec 2, 2012 at 12:31 AM, Worik R  wrote:
> What I mean is how do I get the R compilation or execution process to spit
> out a line number with errors and warnings?

As Duncan mentioned already, you can't *always* get a line number. You
can, however, usually get enough context around the failing call for
you to be able to smoke the problem out.

> option(error=browser) is a help.  But it still does not say what piece of
> code caused the error.

I typically run with a slightly different setting:

R> options(error=utils:::dump.frames)

Whenever my script throws an error, after I'm done cursing at it I
then wonder where this error happened, so I call:

R> traceback()

And you'll see the details of the stack that just blew up, starting
(or ending, can't remember) with the call itself, then the parent
call, and its parent, etc. all the way up to the top most call (likely
the line in your script itself).

If that's not enough information for me to figure out how to fix the
code in my script, I'll then call:

R> debugger()

and this will then give me (more or less) the same information that
`traceback` showed (but in reverse order (which is why I never
remember the order of traceback)) and you are asked at what point
you'd like to enter the exploded wreckage to explore (via picking a
number) ... this way you can poke at the local variables until you see
what went wrong.

Your error:

Error in `[.xts`(x, xsubset) : subscript out of bounds

Is suggesting that you are trying to index an `xts` object with an
illegal value -- can you find the part in your code that's trying to
do this in your own script? You can put a call to `browser()` before
that part and explore the value of the subscript vs. the length of
your xts object to see what the problem is.

If you can't find this point, then take the traceback/debugger route.

> This is costing me a lot of time chasing down errors in mine and others
> code...

... which is typical when your wading in uncharted territory. As you
get a better feel of how to resolve these issues, your time-to-fix
these things will get better, so ... stay strong.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] GLM Coding Issue

2012-11-27 Thread Steve Lianoglou
Hi Peter,

On Tue, Nov 27, 2012 at 8:05 PM, Peter Ehlers  wrote:
>
> On 2012-11-27 14:34, Steve Lianoglou wrote:
[snip]
> Steve:
> re a matrix response: see MASS (the book, 4ed) page 191; also found
> in the ch07.R file in the /library/MASS/scripts folder. I seem to
> recall that this is mentioned somewhere in the docs, but put my finger
> on it now.

Indeed -- thanks for the pointer.

Well then ... let me try to recover and at least offer some decent
advice to Craig's original post.

I'd still recommend avoiding the use of `attach` and favor passing in
your data data.fame using the `data` param of the call to `glm`.

I'd also call the `avoid` data.frame something else, to avoid
potential confusion (if only in your mind) with the column `avoid`
inside the `avoid` data.frame.

Anyhow, when you pass in your data as param, things work ok:

R> model1 <- glm(cbind(avoid, noavoid) ~ treatment, binomial, avoid)
R> summary(model1)

Call:
glm(formula = cbind(avoid, noavoid) ~ treatment, family = binomial,
data = avoid)

Deviance Residuals:
Min   1Q   Median   3Q  Max
-3.6480  -1.3350   0.4297   1.5706   2.6200

...

OK ... well, hope that helped somewhat.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] GLM Coding Issue

2012-11-27 Thread Steve Lianoglou
Hi,

On Tuesday, November 27, 2012, David Winsemius wrote:
[snip]

> `cbind`-ing doesn't make much sense here. What is your target (y)
>> variable here? are you trying to predict `avoid` or `noavoid` status?
>>
>
> Sorry, Steve. It does make sense. See :
>
> ?glm  # First paragraph of Details.


Indeed ... I've tried to,send a follow up email salvaging my bad call with
some hopefully useful tidbits, but it "matched some headers" and is stuck
in the mailman queue. It might come through eventually.

Don't be sorry, though ... I learned something new :-)

Still, I do apologize for the flawed advice re: the cbind-ing thing

-Steve



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] GLM Coding Issue

2012-11-27 Thread Steve Lianoglou
Hi,

Comments inline:

On Tue, Nov 27, 2012 at 1:00 PM, Craig P O'Connell
 wrote:
>
>
> Dear all,
>
>I am having a recurring problem when I attempt to conduct a GLM.  Here is 
> what I am attempting (with fake data):
> First, I created a txt file, changed the directory in R (to the proper folder 
> containing the file) and loaded the file:
>
> #avoid<-read.table("avoid.txt",header=TRUE);avoid
> #  treatment feeding avoid noavoid
> #1   control  nofeed 1 357
> #2   controlfeed 2 292
> #3   control sat 4 186
> #4  proc  nofeed15 291
> #5  procfeed25 288
> #6  proc sat17 140
> #7   mag  nofeed87 224
> #8   magfeed34 229
> #9   mag sat46 151
>
> I then try to "attach(avoid)" the data, but continue to get an error message 
> ( The following object(s) are masked _by_ .GlobalEnv :), so to fix this, I do 
> the following:
>
> #newavoid<-avoid
> #newavoid(does this do anything?)

It essentially makes a copy of `avoid` to `newavoid` -- what did you
want it to do?

That having been said, a good rule of thumb is to never use `attach`,
so let's avoid it for now.

> Lastly, I have several GLM's I wanted to conduct.  Please see the following:
>
> #model1<-glm(cbind(avoid, noavoid)~treatment,data=,family=binomial)
>
> #model2=glm(cbind(avoid, noavoid)~feeding, familiy=binomial)
>
> #model3=glm(cbind(avoid, noavoid)~treatment+feeding, familiy=binomial)

`cbind`-ing doesn't make much sense here. What is your target (y)
variable here? are you trying to predict `avoid` or `noavoid` status?

Let's assume you were "predicting" `noavoid` from just `treatment` and
`feeding` (I guess you have more data (rows) than you show), you would
build a model like so:

R> model <- glm(noavoid ~ treatment + feeding, binomial, avoid)

Or to be explicit about the parameters:

R> model <- glm(noavoid ~ treatment + feeding, family=binomial, data=avoid)


> It would be greatly appreciated if somebody can help me with my coding, as 
> you can see I am a novice but doing my best to learn.  I figured if I can get 
> model1 to run, I should be able to figure out the rest of my models.

Since you're just getting started, maybe it would be helpful for
people writing documentation/tutorials/whatever what needs to be
explained better.

For instance, I'm curious why you thought to `cbind` in your first glm
call, which was:

model1<-glm(cbind(avoid, noavoid)~treatment,data=,family=binomial)

What did you think `cbind`-ing was accomplishing for you? Is there an
example somewhere that's doing that as the first parameter to a `glm`
call?

Also, why just have `data=`?

I'm not criticizing, just trying to better understand.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate() runs out of memory

2012-11-27 Thread Steve Lianoglou
Hi,

On Tue, Nov 27, 2012 at 11:29 AM, Sam Steingold  wrote:
>> * Steve Lianoglou  [2012-11-26 19:47:25 
>> -0500]:
[snip]
>> It just occurred to me that this is even better:
>>
>> R> setkeyv(f, c("share.id", "delay"))
>> R> result <- f[,  list(min=delay[1L], max=delay[.N], count=.N,
>> country=country[1L]), by="share.id"]
>>
>
> this assumes that delays are sorted (like in my example)
> which, in reality, they are not.
> thanks for your help!

When you include "delay" in the call to `setkeyv` as I did above, it
sorts low to high w/in each "share.id" group.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate() runs out of memory

2012-11-26 Thread Steve Lianoglou
On Monday, November 26, 2012, Sam Steingold wrote:
[snip]

>
> there is precisely one country for each id.
> i.e., unique(country) is the same as country[1].
> thanks a lot for the suggestion!
>
> > R> result <- f[, list(min=min(delay), max=max(delay),
> > count=.N,country=country[1L]), by="share.id"]


And is it performant?

It just occurred to me that this is even better:

R> setkeyv(f, c("share.id", "delay"))
R> result <- f[,  list(min=delay[1L], max=delay[.N], count=.N,
country=country[1L]), by="share.id"]



> --
> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X
> 11.0.11103000
> http://www.childpsy.net/ http://thereligionofpeace.com http://pmw.org.il
> http://honestreporting.com http://americancensorship.org
> Why do you never call me back after I scream that I will never talk to you
> again?!
>


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate() runs out of memory

2012-11-26 Thread Steve Lianoglou
Hi,

On Mon, Nov 26, 2012 at 4:57 PM, Sam Steingold  wrote:
[snip]
>> Could you please copy paste the output of `(head(infl, 20))` as
>> well as an approximation of what the result is that you want.

Don't know how "dput" got clipped in your reply from the quoted text I
wrote, but I actually asked for `dput(head(infl, 20))`

The dput makes a world of difference because I can easily copy/paste
the output into R and get a working table.

> this prints all the levels for all the factor columns and takes
> megabytes.

Try using droplevels, eg:

R> dput(droplevels(head(infl, 20)))


> --8<---cut here---start->8---
>> f <- data.frame(id=rep(1:3,4),country=rep(6:8,4),delay=1:12)
>> f
>id country delay
> 1   1   6 1
> 2   2   7 2
> 3   3   8 3
> 4   1   6 4
> 5   2   7 5
> 6   3   8 6
> 7   1   6 7
> 8   2   7 8
> 9   3   8 9
> 10  1   610
> 11  2   711
> 12  3   812
>> f <- as.data.table(f)
>> setkey(f,id)
>> delays <- 
>> f[,list(min=min(delay),max=max(delay),count=.N,country=unique(country)),by="id"]
>> delays
>id min max count country
> 1:  1   1  10 4   6
> 2:  2   2  11 4   7
> 3:  3   3  12 4   8
> --8<---cut here---end--->8---
>
> this is still too slow, apparently because of unique.
> how do I speed it up?

I think I'm missing something.

Your call to `min(delay)` and `max(delay)` will return the minimum and
maximum delays within the particular "id" you are grouping by. I guess
there must be several values for "country" within each "id" group --
do you really want the same min and max values to be replicated as
many times as there are unique "country"s?

Do you perhaps want to iterate over a combo of id and country?

Anyway: if you don't use `unique` inside your calculation, I guess it
goes significantly faster, like so:

R> result <- f[, list(min=min(delay), max=max(delay),
count=.N,country=country[1L]), by="share.id"]

If that's bearable, and you really want the way you suggest (or, at
least, what I'm interpreting), I wonder if this two-step would be
faster?

R> setkeyv(f, c('share.id', 'country'))
R> r1 <- f[, list(min=min(delay), max=max(delay), count=.N), by='share.id']
R> result <- unique(f)[r1]  ## I think

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate() runs out of memory

2012-11-26 Thread Steve Lianoglou
Hi Sam,

On Mon, Nov 26, 2012 at 3:13 PM, Sam Steingold  wrote:
> Hi,
>
>> * Steve Lianoglou  [2012-11-19 13:30:03 
>> -0800]:
>>
>> For instance, if you want the min and max of `delay` within each group
>> defined by `share.id`, and let's assume `infl` is a data.frame, you
>> can do something like so:
>>
>> R> as.data.table(infl)
>> R> setkey(infl, share.id)
>> R> result <- infl[, list(min=min(delay), max=max(delay)), by="share.id"]
>
> perfect, thanks.
> alas, the resulting table does not contain the share.id column.
> do I need to add something like "id=unique(share.id)" to the list?
> also, if there is a field in the original table infl which only depends
> on share.id, how do I add this unique value to the summary?
> it appears that "count=unique(country)" in list() does what I need, but
> it slows down the process.

Hmm ... I think it should be there, but I'm having  a hard time
remember what you want.

Could you please copy paste the output of `dput(head(infl, 20))` as
well as an approximation of what the result is that you want.

It will make it easier for us to talk more concretely about how to get
what you want.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] github

2012-11-19 Thread Steve Lianoglou
Why don't you try clicking on the "Help" link at the top of their site?

You can even google "github for dummies" to great success as well ...

-steve

On Mon, Nov 19, 2012 at 6:07 PM, Muhuri, Pradip (SAMHSA/CBHSQ)
 wrote:
>
> Hello,
>
> I would like to learn how to set up Github/repository and upload/update files 
> and am looking for "Github for Dummies".  Any help will be appreciated.
>
> Thanks,
>
> Pradip
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate() runs out of memory

2012-11-19 Thread Steve Lianoglou
Hi,

On Mon, Nov 19, 2012 at 1:25 PM, Sam Steingold  wrote:
> Thanks Steve,
> what is the analogue of .N for min and max?
> i.e., what is the data.table's version of
> aggregate(infl$delay,by=list(infl$share.id),FUN=min)
> aggregate(infl$delay,by=list(infl$share.id),FUN=max)
> thanks!

It would be helpful if I could see a bit of your table (like
`head(infl)`, if it's not too big), but anyway: there is no real
analogue of min/max -- you just use them

For instance, if you want the min and max of `delay` within each group
defined by `share.id`, and let's assume `infl` is a data.frame, you
can do something like so:

R> as.data.table(infl)
R> setkey(infl, share.id)
R> result <- infl[, list(min=min(delay), max=max(delay)), by="share.id"]

HTH,

-steve


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Stepwise regression scope: all interacting terms (.^2)

2012-11-16 Thread Steve Lianoglou
Hi Mark,

To put some context to David's response below, you can search the list
archives for times when people ask about stepwise regression. You can
get started here:

http://search.gmane.org/search.php?group=gmane.comp.lang.r.general&query=stepwise+penalized

The long and short of it is that you are almost always encouraged to
use some regularization/penalized model instead of this stepwise
approach. Frank Harrell, in particular, is generally quite vocal
against stepwise regression -- I'm actually surprised he hasn't chimed
in by now, but maybe he's getting a bit tired of fighting the good
fight -- or, it's close to the holiday and he's taking a break ;-)

Anyway ... HTH,

-steve

On Fri, Nov 16, 2012 at 4:13 PM, David Winsemius  wrote:
>
> On Nov 16, 2012, at 12:16 PM, Mark Ebbert wrote:
>
>> I haven't heard anything on this question. Is there something fundamentally 
>> wrong with my question? Any feedback is appreciated.
>>
>
> Perhaps failure to read this sig at the bottom of every posted message to 
> rhelp?
>
> "PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code."
>
>
>> Mark
>> On Nov 15, 2012, at 8:13 AM, Mark T. W. Ebbert wrote:
>>
>>> Dear Gurus,
>>>
>>> Thank you in advance for your assistance. I'm trying to understand scope 
>>> better when performing stepwise regression using "step."
>
> From the help page of step:
> "If scope is a single formula, it specifies the upper component, and the 
> lower model is empty. "
>
>>> I have a model with a binary response variable and 10 predictor variables. 
>>> When I perform stepwise regression I define scope=.^2 to allow interactions 
>>> between all terms.
>
> I generally avoid answering questions about stepwise regression, because most 
> of them do not include sufficient background material to justify that 
> strategy. Yours certainly did not.
>
>
>>> But I am missing something. When I perform stepwise regression (both 
>>> directions) on the main model (y~x1+x2+…+x10) the method returns quickly 
>>> with an answer; however, when I define all interactions in the main model 
>>> (y~x1+x2+…+x10+x1:x2+x1:x3+…) and then perform stepwise regression 
>>> (backward only) it runs so long I have to kill it.
>>>
>>> So here's my question: what is the difference between scope=.^2 on the 
>>> additive (proper term?) model and defining all interactions and doing 
>>> backward regression? My understanding is that .^2 is supposed to allow all 
>>> interactions!
>
> Well, I would have guessed all two-way interactions (all 45  of them in your 
> case) would be included and then successively reduce until you got to your 
> specified (arbitrary and most likely incorrectly set) endpoint.) I think the 
> help page Details section is unclear on this point. I do not think that the 
> 120 potential three-way interactions are part of the scope in that instance, 
> but it should be easy enough for you to test that possibility.
>
> --
> David Winsemius, MD
> Alameda, CA, USA
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] history() does not work?

2012-10-10 Thread Steve Lianoglou
On Wed, Oct 10, 2012 at 12:52 PM, Christian Hoffmann
 wrote:
>
> Am 10.10.12 18:17, schrieb Steve Lianoglou:
>
>> Hi,
>>
>> On Wed, Oct 10, 2012 at 12:03 PM, Christian Hoffmann
>>  wrote:
>>>
>>> Hi,
>>>
>>>> history()
>>>
>>> gives Error in savehistory(file) : no history available to save
>>>
>>> although I can scroll throu history with C^uparrow an C^downarrow.
>>>
>>> How can I make history() work and/or show the current history in a file,
>>> so
>>> that I can choose from previous commands?
>>>
>>> The web did not throw up anything useful.
>>
>> Out of curiosity, when you call:
>>
>> R> capabilities()['cledit']
>>
>> Do you get `FALSE`?
>>
> Yes, I do. So what do you suggest?

This suggests that R is running w/o readline support enabled -- which
is necessary for `history()` to work.

I don't really know much about R running on windows, so I'm not sure
if what you are seeing is normal or strange.

I don't see any mention of no `history()` support in in the R for Window FAQ:
http://cran.r-project.org/bin/windows/base/rw-FAQ.html

So my *guess* is that you should be able to get this to work rather easily.

How are you running R? I mean, is it some R GUI? Rstudio? Emacs/ESS
(in which case, it is run w/o readline)? Rcmdr?

I'm just stabbing at the dark right now -- I'm sure a windows useR
will swoop in soon w/ the right stuff if we don't sort it out this
way.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] history() does not work?

2012-10-10 Thread Steve Lianoglou
Hi,

On Wed, Oct 10, 2012 at 12:03 PM, Christian Hoffmann
 wrote:
> Hi,
>
>> history()
>
> gives Error in savehistory(file) : no history available to save
>
> although I can scroll throu history with C^uparrow an C^downarrow.
>
> How can I make history() work and/or show the current history in a file, so
> that I can choose from previous commands?
>
> The web did not throw up anything useful.

Out of curiosity, when you call:

R> capabilities()['cledit']

Do you get `FALSE`?

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Simple - Finding vector in a vector

2012-10-08 Thread Steve Lianoglou
Ugh, typo:

On Mon, Oct 8, 2012 at 10:04 AM, Steve Lianoglou
 wrote:

> R> x <- c(NA,  1, NA,  1,  1,  1,  1,  1,  1, NA,  1)
> R> e <- embed(x, e) ## Take a look at this matrix
> R> r <- apply(e, 1, rle)
> R> sapply(r, function(rr) rr$lengths[1])
> ## [1] 1 1 2 3 3 3 3 1 1

The 2nd param to embed should be 3, so:

R> x <- c(NA,  1, NA,  1,  1,  1,  1,  1,  1, NA,  1)
R> e <- embed(x, 3) ## Take a look at this matrix
R> r <- apply(e, 1, rle)
R> sapply(r, function(rr) rr$lengths[1])
## [1] 1 1 2 3 3 3 3 1 1

Sorry for the confusion ... e and 3 are so close ;-)

-st3v3

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Simple - Finding vector in a vector

2012-10-08 Thread Steve Lianoglou
Hi Mike,

On Mon, Oct 8, 2012 at 9:38 AM, Mike Spam  wrote:
> Sorry, i just realized, that it output the sum of all vectors. I can
> work with this function but it would be much faster and easier if it
> would be possible to get the positions of evry match.
>
> example:
>
> NA  1 NA  1  1  1  1  1  1 NA  1
>
> rle returns
> lengths: int [1:6] 1 1 1 6 1 1
>
> what i need would be something like,
> 1 1 1 3 3 3 3 1 1

Somehow peculiar ;-)

This gets you somehow close -- but I think this must be what you mean,
so ... let's see:

R> x <- c(NA,  1, NA,  1,  1,  1,  1,  1,  1, NA,  1)
R> e <- embed(x, e) ## Take a look at this matrix
R> r <- apply(e, 1, rle)
R> sapply(r, function(rr) rr$lengths[1])
## [1] 1 1 2 3 3 3 3 1 1

If your input vector (`x` here) is large, the call to `embed` may be painful.

HTH,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R: machine for moderately large data

2012-10-05 Thread Steve Lianoglou
Hi,

On Fri, Oct 5, 2012 at 1:41 PM, Ista Zahn  wrote:
> On Fri, Oct 5, 2012 at 12:09 PM, PIKAL Petr  wrote:
[snip]
>> If I compute correctly, such a big matrix (20e6*1000) needs about 160 GB 
>> just to be in memory. Are you prepared for this?
>
> This is not as outrageous as one might think -- you can get a mac pro
> with 32 gigs of memory for around $3,500


And even so, I suspect the matrices that will be worked with are
sparse, so you can get more savings there (although I'm not sure which
of the packages the OP had listed work with sparse input.

That having been said, if you don't want to sample from your data,
sometimes R isn't the best solution.

There are projects being developed to specifically deal with such big data.

For one, you might consider looking at the graphlab/graphchi stuff:

http://graphlab.org

(Graphchi is meant to process big data on a "modest" machine). If you
go to the "Toolkits" menu, you'll see they have an implementation of
kmeans++ clustering that might be suitable for your clustering
analysis (perhaps some matrix factorizations are useful here, too --
perhaps your "market basket" data can be viewed as some type of
collaborative filtering problem, in which case their collaborative
filtering toolkit is right up your alley ;-)

The OP also mentioned classification trees. Perhaps rf-ace might be useful:
http://code.google.com/p/rf-ace/

>From their website:

"""
RF-ACE implements both Random Forest (RF) and Gradient Boosting Tree
(GBT) algorithms, and is strongly related to ACE, originally outlined
in http://jmlr.csail.mit.edu/papers/volume10/tuv09a/tuv09a.pdf
"""

If you scroll down to the "case study" section of their main page, you
can there is some talk about how they used this in a distributed
manner ... perhaps it is applicable in your case as well (in which
case you might be able to rig up AWS to help you).

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to write R package

2012-09-27 Thread Steve Lianoglou
On Thu, Sep 27, 2012 at 5:15 PM, Dr. Alireza Zolfaghari
 wrote:
> Hi List,
> Would you please send me a good link to talk me through on how to write a R
> package?

There are many, many, many resources:
http://lmgtfy.com/?q=writing+r+packages+tutorial

Take the first hit.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate() runs out of memory

2012-09-14 Thread Steve Lianoglou
Hi,

On Fri, Sep 14, 2012 at 4:26 PM, Dennis Murphy  wrote:
> Hi:
>
> This should give you some idea of what Steve is talking about:
>
> library(data.table)
> dt <- data.table(x = sample(10, 1000, replace = TRUE),
>   y = rnorm(1000), key = "x")
> dt[, .N, by = x]
> system.time(dt[, .N, by = x])
>
> ...on my system, dual core 8Gb RAM running Win7 64-bit,
>> system.time(dt[, .N, by = x])
>user  system elapsed
>0.120.020.14
>
> .N is an optimized function to find the number of rows of each data subset.
> Much faster than aggregate(). It might take a little longer because you
> have more columns that suck up space, but you get the idea. It's also about
> 5-6 times faster if you set a key variable in the data table than if you
> don't.

Well done, sir! (slight critique in that .N isn't a function, it's
just a variable that is constantly reset within each by-subset/group)

Also, don't forget to use the .SDcols parameter in [.data.table if you
plan on only using a subset of the columns in side your "by" stuff.

There's lots of documentation in the package `?data.table` and the
vignettes/FAQ to help you tweak your usage, if you decide to take
data.table route.

HTH,
-steve

>
> Dennis
>
> On Fri, Sep 14, 2012 at 12:26 PM, Sam Steingold  wrote:
>
>> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17
>> columns).
>> I want to get the result of
>> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
>> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
>> 24.3G, and no end in sight.
>> both V1 and V2 are characters (not factors).
>> Is there anything I could do to speed this up?
>> Thanks.
>>
>> --
>> Sam Steingold (http://sds.podval.org/) on Ubuntu 12.04 (precise) X
>> 11.0.11103000
>> http://www.childpsy.net/ http://www.PetitionOnline.com/tap12009/
>> http://dhimmi.com http://think-israel.org http://iris.org.il
>> WinWord 6.0 UNinstall: Not enough disk space to uninstall WinWord
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> [[alternative HTML version deleted]]
>
> ______
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] $ operator is invalid for atomic vectors

2012-09-14 Thread Steve Lianoglou
Hi,

On Fri, Sep 14, 2012 at 2:33 PM, agrins  wrote:
> HI all-
>
> I have used this .fun in S+ without a problem however, in R when I run this
> code to generate multiple graphs:
>
> trendplot<-function(datafr,dataf2, abbrev="", titlestr="",
> devname="s",filen="",styr=1990,endyr=2012) {
> if (!is.null(dev.list())) {dev.off()}
>
> dataf<-datafr[datafr$abbrev==abbrev,]   #subset entire 
> dataset with one
> species at a time
> dataf2sp<-dataf2[dataf2$abbrev==abbrev,]  etc...
>
> It returns  "Error in dataf2$abbrev : $ operator is invalid for atomic
> vectors"
>
> Is there an easy fix for this error?

I suspect you just have to ensure that the thing you are passing in to
the `dataf2` parameter is in fact a data.frame -- the error you are
getting suggests that it is currently not.

Also -- have no fear for the space bar, it is your friend ;-)

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] aggregate() runs out of memory

2012-09-14 Thread Steve Lianoglou
Hi,

On Fri, Sep 14, 2012 at 3:26 PM, Sam Steingold  wrote:
> I have a large data.frame Z (2,424,185,944 bytes, 10,256,441 rows, 17 
> columns).
> I want to get the result of
> table(aggregate(Z$V1, FUN = length, by = list(id=Z$V2))$x)
> alas, aggregate has been running for ~30 minute, RSS is 14G, VIRT is
> 24.3G, and no end in sight.
> both V1 and V2 are characters (not factors).
> Is there anything I could do to speed this up?
> Thanks.

You might find you'll get a lot of mileage out of data.table when
working with such large data.frames ...

To get something close to what you're after, you can try:

R> library(data.table)
R> Z <- as.data.table(Z)
R> setkeyv(Z, 'V2')
R> agg <- Z[, list(count=.N), by='V2']

>From here you might

R> tab1 <- table(agg$count)

I think that'll get you where you want to be ... I'm ashamed to say
that I haven't really done much w/ aggregate since I mostly have used
plyr and data.table like stuff, so I might be missing your end goal --
providing a reproducible example with a small data.frame from you can
help here (for me at least).

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] calcular SVD de una matriz que no entra en memoria

2012-09-08 Thread Steve Lianoglou
Hi,

2012/9/8 federico ferreyra :
> Hola como andan? estuve leyendo un poco la documentacion de la libreria 
> BigMemory y no me quedan claro algunas cosas, le planteo mi problema, tengo 
> un matriz en disco que pesa 2gb con coeficientes float愀, solo numeros, y 
> tengo que hallar la descomposicion en valores singulares de esta matriz, 
> basicamente la pregunta seria: como hacer para leer la matriz, calcular su 
> DVS y como escribirla en disco? muchas gracias.

Quick aside: You'd get better help if you post your question in English.

That having been said, if my high-school spanish doesn't fail me, it
sounds like you want to run an SVD on a particularly huge matrix where
doing so in-memory will be prohibitive (or impossible).

As far as I know, there are no facilities (packages) to do this in R
just yet, but you might look at the graphchi project:

http://bickson.blogspot.com/2012/08/collaborative-filtering-with-graphchi.html

Again -- it's not an R package, but if you are in a bind, you can
likely build and run the SVD on your machine.

I think it'd be quite handy to wrap that library's functionality in R,
and if no one beats me to it I think I'll eventually do that myself.

Hope that helps,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] latex \subfloat{} incompatible with sweave/knitr code

2012-08-29 Thread Steve Lianoglou
Hi,

On Wed, Aug 29, 2012 at 6:56 AM, Liviu Andronic  wrote:
> Dear all
> Are LaTeX \subfloat{} commands incompatible with Sweave code? I cannot
> get the following code to compile properly:
> \begin{table}
> \subfloat[asdfa]{<<>>=
> 2+2
> @
>
> }
>
> \caption{asdf}
>
> \end{table}
>
>
> If I replace the Sweave chunk with a random string or a table, the
> compilation works fine. Any ideas what happens? I hit the same trouble
> when running the code chunks through knitr.

This isn't exactly what you want, but I'm using kintr and building and
saving my figures in the their own "chunks" then just inlining the
path to the generated figure in the \subloat{..}. Things are working
fine, eg. my default settings are to suppress chunk echo/output,
generate pdf figures, and fig.path='figs/gen-' so:

<>
plot(1:10, 1:10, ...)
@

\begin{figure}[...]
...
  \sublfoat[some][caption]{
\includegraphics[...]{figs/gen-someFig.pdf}
  }
...
\end{figure}

does the trick for me.

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] A LaTeX question -- Hope people won't mind

2012-08-20 Thread Steve Lianoglou
Hi Paul,

On Mon, Aug 20, 2012 at 4:13 PM, Paul Miller  wrote:
> Hello All,
>
> Hope people won't mind my posting a LaTeX question here. I know a lot of 
> people who use R are also using LaTeX. I'm in a bit of a rush to complete a 
> document and am having trouble with one aspect of the formatting.

You might get better help here:

http://tex.stackexchange.com/

But:

> I'm creating a list of tables using:
>
> \listoftables
>
> I also have some table captions that contain the number of patients in an 
> anlysis like:
>
> \caption{Results for Random Forest Model Using Scoring Data (N = 700)}
>
> The tables look great. Trouble is that LaTeX inserts the "(N = 700)" into the 
> text in the List of Tables at the beginning of the document. I'd prefer that 
> it not do so.

I think you can do something like:

\caption[Short caption for listoftables]{Longer caption for body of text ...}

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Opinion: Why I find factors convenient to use

2012-08-17 Thread Steve Lianoglou
Hi,

On Fri, Aug 17, 2012 at 1:58 PM, Jeff Newmiller
 wrote:
> I don't know if my recent post on this prompted your post, but I don't see 
> much to argue with in your discussion. I find factors to be useful for 
> managing display and some kinds of analysis.
>
> However, I find them mostly a handicap when importing, merging, and handling 
> data QC. Therefore I delay conversion until late in the game... but usually I 
> do eventually convert in most cases.

Agreed here -- I actually haven't been tuned into any such recent
conversation (if there was one), but if I were a gambling man, I'd bet
that the majority of the problems people have with factors can
probably be boiled down to the fact that the default value for
stringsAsFactors is TRUE.

I like factors -- that said, I am annoyed by them at times, but I
still like them.

Also, Bert mentioned that he thinks they save space over characters --
I believe that this is no longer true, but I'm not certain.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] r data structures

2012-08-16 Thread Steve Lianoglou
Hi,

On Thu, Aug 16, 2012 at 2:49 PM, Schumacher, Jay S  wrote:
>
>
> hi,
>   i'm trying to understand r data structures.  i see that vectors, matrix, 
> factors and arrays have a "dimension."

Out of curiosity, where do you "see" that vectors and factors have a
dimension? I mean -- I guess they're one dimensional, but ...

>   there seems to be no mention of dimensionality anywhere for lists or 
> dataframes.  can i consider lists and frames to be of fixed dimension 2?

data.frames: sure, I guess
lists: no

What would you consider the dimension of this list to be:

x = list(a=1:10, b='hello', c=matrix(1:100, nrow=10))

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] extended R

2012-07-29 Thread Steve Lianoglou
Hi,

On Sun, Jul 29, 2012 at 5:58 AM, hony  wrote:
> I need to create a software and use R engine in my software. but I don't know
> how to use R code in my software same www.revolutionanalytics.com.

If your software is written in C++ perhaps the RInside package will be helpful:

http://cran.r-project.org/web/packages/RInside/index.html

Keep in mind that R is released under the GPL, so depending on what
your "end game" is for your software (are you making a commercial
product?), this may or may not be an issue for you.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Help for Fisher's exact test

2012-07-16 Thread Steve Lianoglou
er there is significant difference between WT and
>> mt of each  gene in column A by calculate the p-value.   I will appreciate
>> it very much if you can send me the script of the test in R.  Thank you
>> very much for your time.
>> >
>> >Best,
>> >Feng
>> >
>> >
>> >On Sun, Jul 15, 2012 at 1:28 PM, arun  wrote:
>> >
>> >Hi,
>> >>
>> >>These links might be useful for you:
>> >>
>> >>https://stat.ethz.ch/pipermail/r-help/2009-July/204926.html
>> >>
>> http://stat.ethz.ch/R-manual/R-patched/library/stats/html/fisher.test.html
>> >>
>> >>A.K.
>> >>
>> >>
>> >>
>> >>
>> >>- Original Message -
>> >>From: Guanfeng Wang 
>> >>To: r-help@r-project.org
>> >>Cc:
>> >>Sent: Saturday, July 14, 2012 5:05 PM
>> >>Subject: [R] Help for Fisher's exact test
>> >>
>> >>Hi, R-help,
>> >>I have a group of data from RNA-seq want to be analyzed by  Fisher's
>> >>exact test in R. I want to compare the significant difference of about
>> >>30, individuals in two different samples, and I have no idea how to
>> use
>> >>R, so could you  please give me some suggestions or the scripts for
>> >>Fisher's exact test? Thank you very much.
>> >>
>> >>Best,
>> >>Guanfeng Wang
>> >>
>> >>[[alternative HTML version deleted]]
>> >>
>> >>__
>> >>R-help@r-project.org mailing list
>> >>https://stat.ethz.ch/mailman/listinfo/r-help
>> >>PLEASE do read the posting guide
>> http://www.r-project.org/posting-guide.html
>> >>and provide commented, minimal, self-contained, reproducible code.
>> >>
>> >>
>> >
>>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] handle large matrix in R

2012-06-12 Thread Steve Lianoglou
Hi,

As Oliver pointed out, you won't be able to fit all that data into RAM
unless you've got some big iron machine.

Besides using a sparse matrix representation, you might also look at
the "Large memory and out-of-memory data" section here:

http://cran.r-project.org/web/views/HighPerformanceComputing.html

Particularly the ff and bigmemory packages.

-steve

On Tue, Jun 12, 2012 at 6:47 AM, Oliver Ruebenacker  wrote:
>     Hello Hui,
>
> On Tue, Jun 12, 2012 at 2:12 AM, Hui Wang  wrote:
>> Dear all,
>>
>> I've run into a question of handling large matrices in R. I'd like to
>> define a 7*7 matrix in R on Platform:
>> x86_64-apple-darwin9.8.0/x86_64 (64-bit), but it seems to run out of memory
>> to handle this. Is it due to R memory limiting size or RAM of my laptop? If
>> I use a cluster with larger RAM, will that be able to handle this large
>> matrix in R? Thanks much!
>
>  Do you really mean 7e4 by 7e4? That would be 4.9e9 entries. If each
> entry takes 8 bytes (as it typically would on a 64 bit system), you
> would need close to 40 Gigabyte storage for this matrix. I'm not sure
> there is a laptop on the market with that amount of RAM.
>
>  What do you need such a large matrix for? If most of the elements
> are zero, you don't want a regular matrix to hold the data, but use
> some sort of sparse matrix implementation.
>
>     Take care
>     Oliver
>
> --
> Oliver Ruebenacker
> Bioinformatics Consultant (http://www.knowomics.com/wiki/Oliver_Ruebenacker)
> Knowomics, The Bioinformatics Network (http://www.knowomics.com)
> SBPAX: Turning Bio Knowledge into Math Models (http://www.sbpax.org)
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Need Help in K-fold validation in Decision tree

2012-05-21 Thread Steve Lianoglou
Hi,

On Mon, May 21, 2012 at 6:01 AM, santoshdvn  wrote:
> Hi ,
>
> I have built decision tree using  rpart . I want to do k Fold validation on
> the decision tree .
>
> Could you help how can i do that .. please tell the package which required
> for K fold validation.

I think you'll find the caret package, along with its vignettes, very helpful:

http://cran.r-project.org/web/packages/caret/index.html

If you google for "caret machine learning" you'll find many useful
links. For instance:

* The JSS Publication:
http://www.jstatsoft.org/v28/i05/paper

* Slides from a presentation of caret:
http://files.meetup.com/1542972/Max_caret_NYCPA.pdf


HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] pass objects into "..." (dot dot dot)

2012-05-15 Thread Steve Lianoglou
Hi,

On Tue, May 15, 2012 at 1:38 PM, Ben quant  wrote:
> Thank you for that. Sorry, I don't know how to use that to solve the issue.
> I need to pass a handful (an unknown length) of objects into "...". I see
> how you can get the count of what is in "...", but I'm not seeing how
> knowing the length in "..." will help me.

Hmm ... ok, I see. The interval_intersection function signature
suggests that this should work:

myIntersection <- function(...) {
  do.call(interval_intersection, unname(list(...)))
}

Or you can just make a list of intervals and do the same do.call mojo, ie:

do.call(interval_intersection, my.interval.list)

yay/nay?

-steve


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] pass objects into "..." (dot dot dot)

2012-05-15 Thread Steve Lianoglou
Hi,

On Tue, May 15, 2012 at 12:46 PM, Ben quant  wrote:
> Hello,
>
> Thanks in advance for any help!
>
> How do I pass an unknown number of objects into the "..." (dot dot dot)
> parameter? Put another way, is there some standard way to pass multiple
> objects into "..." to "fool" the function into thinking the objects are
> passed in separately/explicitly with common separation (like "x,y,z" when
> x, y and z are objects to be passed into "...")?

Calling `list(...)` will return a list as long as there are elements
caught in `...`

Does this help?

R> howMany <- function(...) {
  args <- list(...)
  cat("There are", length(args), "items passed in here\n")
}

R> howMany(1, 2, 3, 4)
There are 4 items passed in here

R> howMany(10, list(1:10))
There are 2 items passed in here

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] package ‘adehabitat’ is not available (for R version 2.15.0) on Mac OS X

2012-05-14 Thread Steve Lianoglou
Hi,

Did you simply try:

install.packages("adehabitat")

or, if that doesn't work, maybe:

install.packages("adehabitat", type="source")

Or change your cran mirror?

-steve

On Mon, May 14, 2012 at 5:44 PM, Kristi Glover
 wrote:
>
> Hi R- User,I tried to load package adehabitat for R version 2.15.0. But,  I 
> got error. even I download and save in local drive and tried to install from 
> the local file. but still i got the following errors
>  install.packages("adehabitat", repos='C:/adehabitat_1.8.10.tgz')
>
>
> package ‘adehabitat’ is not available (for R version 2.15.0)
>
>> install.packages("adehabitat", 
>> repos='http://cran.skazkaforyou.com/bin/macosx/leopard/contrib/2.15/adehabitat_1.8.10.tgz')
> Warning: unable to access index for repository 
> http://cran.skazkaforyou.com/bin/macosx/leopard/contrib/2.15/adehabitat_1.8.10.tgz/bin/macosx/leopard/contrib/2.15
> Warning message:
> package ‘adehabitat’ is not available (for R version 2.15.0)
> I need to install it as I have been using a script that has used this package 
> too besides other packages.
> Would any one help me how I can install the package on Mac OS X?
> cheers,
> KG
>
>
>
>
>
>        [[alternative HTML version deleted]]
>
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] range segment exclusion using range endpoints

2012-05-14 Thread Steve Lianoglou
> > > > s2_rng = c(0.77,10)
>> > > > s3_rng = c(25,35)
>> > > > s4_rng = c(70,80.3)
>> > > > s5_rng = c(90,95)
>> > > >
>> > > > # ex 2
>> > > > # x_rng = c(-50.5,100)
>> > > >
>> > > > # s1_rng = c(-75.3,30)
>> > > >
>> > > > # ex 3
>> > > > # x_rng = c(-75.3,30)
>> > > >
>> > > > # s1_rng = c(-50.5,100)
>> > > >
>> > > > # ex 4
>> > > > # x_rng = c(-100,100)
>> > > >
>> > > > # s1_rng = c(-105,105)
>> > > >
>> > > > # find all the names -- USE A LIST NEXT TIME
>> > > > sNames <- grep("s[0-9]+_rng", ls(), value = TRUE)
>> > > >
>> > > > # initial matrix with the 'x' endpoints
>> > > > queue <- rbind(c(x_rng[1], 1), c(x_rng[2], 1))
>> > > >
>> > > > # add the 's' end points to the list
>> > > > # this will be used to determine how many things are in a queue (or
>> > > areas that
>> > > > # overlap)
>> > > > for (i in sNames){
>> > > +     queue <- rbind(queue
>> > > +                 , c(get(i)[1], 1)  # enter queue
>> > > +                 , c(get(i)[2], -1)  # exit queue
>> > > +                 )
>> > > + }
>> > > > queue <- queue[order(queue[, 1]), ]  # sort
>> > > > queue <- cbind(queue, cumsum(queue[, 2]))  # of people in the queue
>> > > > print(queue)
>> > >         [,1] [,2] [,3]
>> > >  [1,] -100.00    1    1
>> > >  [2,]  -25.50    1    2
>> > >  [3,]    0.77    1    3
>> > >  [4,]   10.00   -1    2
>> > >  [5,]   25.00    1    3
>> > >  [6,]   30.00   -1    2
>> > >  [7,]   35.00   -1    1
>> > >  [8,]   70.00    1    2
>> > >  [9,]   80.30   -1    1
>> > > [10,]   90.00    1    2
>> > > [11,]   95.00   -1    1
>> > > [12,]  100.00    1    2
>> > > >
>> > > > # print out values where the last column is 1
>> > > > for (i in which(queue[, 3] == 1)){
>> > > +     cat("start:", queue[i, 1L], '  end:', queue[i + 1L, 1L], "\n")
>> > > + }
>> > > start: -100   end: -25.5
>> > > start: 35   end: 70
>> > > start: 80.3   end: 90
>> > > start: 95   end: 100
>> > > >
>> > > >
>> > > =
>> > >
>> > > On Sat, May 12, 2012 at 1:54 PM, Ben quant  wrote:
>> > > > Hello,
>> > > >
>> > > > I'm posting this again (with some small edits). I didn't get any
>> replies
>> > > > last time...hoping for some this time. :)
>> > > >
>> > > > Currently I'm only coming up with brute force solutions to this issue
>> > > > (loops). I'm wondering if anyone has a better way to do this. Thank
>> you
>> > > for
>> > > > your help in advance!
>> > > >
>> > > > The problem: I have endpoints of one x range (x_rng) and an unknown
>> > > number
>> > > > of s ranges (s[#]_rng) also defined by the range endpoints. I'd like
>> to
>> > > > remove the x ranges that overlap with the s ranges. The examples
>> below
>> > > > demonstrate what I mean.
>> > > >
>> > > > What is the best way to do this?
>> > > >
>> > > > Ex 1.
>> > > > For:
>> > > > x_rng = c(-100,100)
>> > > >
>> > > > s1_rng = c(-25.5,30)
>> > > > s2_rng = c(0.77,10)
>> > > > s3_rng = c(25,35)
>> > > > s4_rng = c(70,80.3)
>> > > > s5_rng = c(90,95)
>> > > >
>> > > > I would get:
>> > > > -100,-25.5
>> > > > 35,70
>> > > > 80.3,90
>> > > > 95,100
>> > > >
>> > > > Ex 2.
>> > > > For:
>> > > > x_rng = c(-50.5,100)
>> > > >
>> > > > s1_rng = c(-75.3,30)
>> > > >
>> > > > I would get:
>> > > > 30,100
>> > > >
>> > > > Ex 3.
>> > > > For:
>> > > > x_rng = c(-75.3,30)
>> > > >
>> > > > s1_rng = c(-50.5,100)
>> > > >
>> > > > I would get:
>> > > > -75.3,-50.5
>> > > >
>> > > > Ex 4.
>> > > > For:
>> > > > x_rng = c(-100,100)
>> > > >
>> > > > s1_rng = c(-105,105)
>> > > >
>> > > > I would get something like:
>> > > > NA,NA
>> > > > or...
>> > > > NA
>> > > >
>> > > > Ex 5.
>> > > > For:
>> > > > x_rng = c(-100,100)
>> > > >
>> > > > s1_rng = c(-100,100)
>> > > >
>> > > > I would get something like:
>> > > > -100,-100
>> > > > 100,100
>> > > > or just...
>> > > > -100
>> > > >  100
>> > > >
>> > > > PS - You may have noticed that in all of the examples I am including
>> the
>> > > s
>> > > > range endpoints in the desired results, which I can deal with later
>> in my
>> > > > program so its not a problem...  I think leaving in the s range
>> endpoints
>> > > > simplifies the problem.
>> > > >
>> > > > Thanks!
>> > > > Ben
>> > > >
>> > > >        [[alternative HTML version deleted]]
>> > > >
>> > > > __
>> > > > R-help@r-project.org mailing list
>> > > > https://stat.ethz.ch/mailman/listinfo/r-help
>> > > > PLEASE do read the posting guide
>> > > http://www.R-project.org/posting-guide.html
>> > > > and provide commented, minimal, self-contained, reproducible code.
>> > >
>> > >
>> > >
>> > > --
>> > > Jim Holtman
>> > > Data Munger Guru
>> > >
>> > > What is the problem that you are trying to solve?
>> > > Tell me what you want to do, not how you want to do it.
>> > >
>> >
>> >       [[alternative HTML version deleted]]
>> >
>> > __
>> > R-help@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Barplots inside loop - several data errors, workaround needed

2012-05-10 Thread Steve Lianoglou
Hi,

On Thu, May 10, 2012 at 8:35 PM, Lee  wrote:
> Looking at the documentation for try() I am not sure how it would be best
> applied in this situation. My background is not extensively programming.
> Would writing a function first be appropriate?
>
> Also, I'm not sure just a simple error catch would solve my first problem.
> I do, in fact, need it to plot the barplot based on the table which is
> created above. However, R doesn't like the lack of several columns.
>
> Further guidance would be appreciated.

Consider this block of code that you can run in your R workspace:

~
set.seed(123)
random.error <- runif(3, 0, 2) > 1.5
for (throws.error in random.error) {
  cat("I'm about to try something\n")
  result <- try({
cat("  I'm in the middle of trying something\n")
cat("  There is a chance it might result in an error\n")
if (throws.error) {
  stop("Error!")
}
  }, silent=TRUE)
  if (is(result, 'try-error')) {
cat("An error occurred while trying something, but I'm OK\n\n")
  } else {
cat("No error occurred while trying something\n\n")
  }
}

~

and the output it gives:

~
I'm about to try something
  I'm in the middle of trying something
  There is a chance it might result in an error
No error occurred while trying something

I'm about to try something
  I'm in the middle of trying something
  There is a chance it might result in an error
An error occurred while trying something, but I'm OK

I'm about to try something
  I'm in the middle of trying something
  There is a chance it might result in an error
No error occurred while trying something
~

Does that help?

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to replace NA with zero (0)

2012-05-03 Thread Steve Lianoglou
Also,

On Thu, May 3, 2012 at 1:57 PM, J Toll  wrote:
> On Thu, May 3, 2012 at 10:43 AM, Christopher Kelvin
>  wrote:
>
>> Is there a command i can issue to replace the NA with zero (0) even if it is 
>> after generating the data?
>
> Chris,
>
> I didn't try your example code, so this suggestion is far more
> general, but you might try something along the lines of:
>
> x[which(is.na(x))] <- 0

Random note from left field: the call to `which` is unnecessary here.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] rbind-ing numeric matrices

2012-05-01 Thread Steve Lianoglou
Hi,

On Tue, May 1, 2012 at 11:52 AM, Nick Switanek  wrote:
> Good morning,
>
> I'm running into trouble rbind-ing numeric matrices with differing numbers
> of rows. In particular, there seem to be issues whenever a one-row numeric
> matrix is involved.
>
> Assume A is a numeric matrix with 1 row and Y columns and B is a numeric
> matrix with X rows and Y columns. Let C be the result of rbinding A and B.
> Then C is a numeric matrix with X + 1 rows and Y columns, only instead of
> the rows of B being "stacked" beneath the row of A as expected, the first Y
> elements of the 1st column of B are placed in the 2nd row of C, the
> remaining values of B are discarded, and NULL values fill out the rest of
> the matrix C.
>
> The number of columns of A and B match. The colnames of A and B match. Both
> are numeric matrices. I've pored over the rbind/cbind documentation but
> can't identify why I'm getting this behavior from rbind. I'd be extremely
> grateful for your suggestions or thoughts.

If everything you say is true (and I'm understanding what you're
saying), there must be something else going on with your data.
Consider:

R> m1 <- matrix(-(1:5), nrow=1)
R> m2 <- matrix(1:20, ncol=5)
R> rbind(m1, m2)
 [,1] [,2] [,3] [,4] [,5]
[1,]   -1   -2   -3   -4   -5
[2,]159   13   17
[3,]26   10   14   18
[4,]37   11   15   19
[5,]48   12   16   20

Can you provide a small example of your data that reproduces the
problem you're seeing?

Construct these objects in your workspace and copy/paste the output of
dput on your m1 and m2 matrices so we can easily work w/ them.

Cheers,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Why does my R compiler repeat my program whenever it is compiled?

2012-04-30 Thread Steve Lianoglou
Hi,

This looks like an rstudio question that's not specific to R itself

There are support forums at rstudio.org which you should ask your question
on, ani think you'll get better help there



On Monday, April 30, 2012, jpm miao wrote:

> Hi,
>
>   I am using RStudio as my R editor.
>
>   After someday I accidentally hit something, the whole program is
> repeated in the Console whenever I compile it. How can I fix it so that the
> whole program won't be repeated in the future?
>
>   Thanks,
>
> miao
>
>[[alternative HTML version deleted]]
>
> __
> R-help@r-project.org  mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>


-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Merge function - Return NON matches

2012-04-26 Thread Steve Lianoglou
Hi,

As Sarah reiterated -- it'd *really* be helpful if you give us data we
can actually work with.

That having been said:

On Thu, Apr 26, 2012 at 4:12 PM, RHelpPlease  wrote:
> Hi again,
> I tried the sample code like this:
>
>> merged_clmno <- subset(bestPartAreadmin, !CLAIM_NO %in% hrc78_clm_no)
>> dim(merged_clmno)
> [1] 13068    93
>
> Note that:
>> dim(bestPartAreadmin)
> [1] 13068    93
>
> So, no change between the original data.frame (bestPartAreadmin) & the
> (should be) less-rows merged_clmno data.frame.

You're original email said you had a "list" that contains CLAIM_NO's
you want to exclude.

Is `hrc78_clm_no` this "list" -- does it only have claim_no's? passing
a list into the subset call after `%in%` won't work.

If you do `no <- unlist(hrc_78_clm_no`, do you get a character vector
of claim numbers you want to exclude? If so, then `subset(whatever,
!CLAIM_NO %in% no)` should work.

HTH,
-steve


>
> Any further help is most appreciated!
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Merge-function-Return-NON-matches-tp4590755p4590851.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Merge function - Return NON matches

2012-04-26 Thread Steve Lianoglou
Hi,

To increase the chances of you getting help on this one, please give
example data (a small data.frame, a small list) that you are trying to
do this on, and also show the desired output. Whip these variables up
in your R workspace and paste the output of `dput` for each into your
follow up email.

It's hard (for me, anyways) to get what you're after ... I'm guessing
something that ends up looking like this will end up being one
solution:

subset(your.df, !CLAIM_NO %in% `something`)

but it's hard for me to tell from where I'm setting.

-steve


On Thu, Apr 26, 2012 at 3:33 PM, RHelpPlease  wrote:
> Hi there,
> I wish to merge a common variable between a list and a data.frame & return
> rows via the data.frame where there is NO match.  Here are some details:
>
> The list, where the variable/col.name = CLAIM_NO
> CLAIM_NO
> 20
> 83
> 1440
> 4439
> 7002
> ...
>
>> dim(hrc78_clm_no)
> [1] 6678    1
>
> The data.frame, where there exists a variable with the same name, CLAIM_NO.
>> dim(bestPartAreadmin)
> [1] 13068    93
>
> I wish to merge the two together & only return a data.frame where there is
> NO match in the CLAIM_NO between both files.
>
> I've read & tried code via the "merge" function.  If "merge" can do this,
> I'm missing something with the available options.
>
> I'm figuring something like:
>
> clm_no_nomatch <- merge(hrc78_clm_no, bestPartAreadmin, by = "CLAIM_NO",  ..
> .. ..)
>
> Your help is most appreciated!
>
>
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/Merge-function-Return-NON-matches-tp4590755p4590755.html
> Sent from the R help mailing list archive at Nabble.com.
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] kernlab kpca code

2012-04-26 Thread Steve Lianoglou
Hi Jessica,

On Thu, Apr 26, 2012 at 11:59 AM, Jessica Streicher
 wrote:
> Hi!
>
> how do i get to the source code of kpca or even better predict.kpca(which it 
> tells me doesn't exist but should) ?

Probably you have to do kernlab:::predict.kpca from your R workspace,
but why not just download the source package and have at it?

http://cran.r-project.org/src/contrib/kernlab_0.9-14.tar.gz

HTH,
-steve

>
> (And if anyone has too much time:
> Now if i got that right, the @pcv attribute consists of the principal 
> components, and for kpca, these are defined as projections of some random 
> point x, which was transformed into the other feature space -> f(x), 
> projected onto the actual PC (eigenvector of Covariance). This can be 
> computed as the sum of the (eigenvectors of the Kernel matrix * the kernel 
> function(sample_i,x))
>
> Now assume i have some new points and want to project them, how can i do that 
> with only having @pcv?
> Wouldn't i rather need the eigenvectors of K?
> )
>
>
>
>
>
>
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] On the Design of the R Language

2012-04-25 Thread Steve Lianoglou
Hi,

On Wed, Apr 25, 2012 at 4:06 PM, Bert Gunter  wrote:
> Thanks Michael:
> Interesting!
>
> Is it legitimate to comment on this in this list? It would only be my
> opinions, not real R-Help stuff.

FWIW, I'd be interested in hearing opinions about it from R-folk ...

> Where would be a better place to post
> such UN-expert opinion?

I didn't realize you were also an expert on foreign affairs?  Nice.

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R shell script

2012-04-25 Thread Steve Lianoglou
Hi,

On Wed, Apr 25, 2012 at 11:11 AM, aoife doherty
 wrote:
> Thanks for replying.
>
> My problem is that i have say 50 input files, that i wanted to run a
> particular command on, get 50 output files, and then when i close R, have
> them in my directory?
>
> so for example if i say:
>
>>R
>
>>library(MASS)
>
>>list.files(pattern = ".out")
>
>>sapply(list.files(pattern  = *.out"), function(x) wilcox.test ( ... ) )
>
> << when i close R the outputs are still there>>>
>
> i thought this might be easier in a shell way?

In this case, just make your function write a text file -- you have to
figure out what you want to save and serialize it to text. Or you can
write as many output rds (or rda) files as you do tests, for instance:

filez <- list.files(pattern="*.out")
for (f in filez) {
  ## something to load the data in file `f` I presume
  w <- wilcox.test(... on the data you loaded ...)

  saveRDS(w, gsub('.out', '.rds', f) ## if you want to save the object
}

or
info <- lapply(filez, function(x) {
  ## load the file
  w <- wilcox.test(... on the data you loaded ...)
  data.frame(file.name=x, statistic=w$statistic, p.value=w$p.value,
... anything else you want?)
})
result <- do.call(rbind, info)
write.table(result, 'wilcox.results.txt', ...)

HTH,
-steve

>
>
>
> On Wed, Apr 25, 2012 at 4:03 PM, R. Michael Weylandt <
> michael.weyla...@gmail.com> wrote:
>
>> You can do this in bash but why not just do it in R directly? You probably
>> need
>>
>> list.files(pattern = ".out")
>>
>> to get started. Then just wrap your script in a function and pass it
>> to (s|l)apply something like:
>>
>> sapply(list.files(pattern  = *.out"), function(x) wilcox.test ( ... ) )
>>
>> Michael
>>
>> On Wed, Apr 25, 2012 at 6:47 AM, aoife doherty
>>  wrote:
>> > Hey guys,
>> > Does anyone have an example of a REALLY simple shell script in R.
>> >
>> > Basically i want to run this command:
>> >
>> > library(MASS)
>> >
>> wilcox.test(list1,list2,paired=TRUE,alternative=c("greater"),correct=TRUE,exact=FALSE)
>> >
>> > in a shell script something like this:
>> >
>> > #!/bin/bash
>> > R
>> > library(MASS)
>> > for i in *.out
>> > do
>> > wilcox.test($i,${i/out}.out2,paired=TRUE) >> $i.out
>> > done
>> >
>> >
>> > that i can run on a command line this this:
>> > sh R.sh
>> >
>> >
>> > because i've SO many files to run this command on.
>> >
>> >
>> > I've been googling, but i'm having trouble of just finding a simple
>> example
>> > explaining how to make this shell script.
>> >
>> > Any help appreciated :)
>> > Aoife
>> >
>> >        [[alternative HTML version deleted]]
>> >
>> > ______
>> > R-help@r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] R shell script

2012-04-25 Thread Steve Lianoglou
Check out the vignette for the optparse library:

http://cran.r-project.org/web/packages/optparse/vignettes/optparse.pdf

Super helpful library if you plan on making any semi-interesting
command line scripts w/ R.

-steve

On Wed, Apr 25, 2012 at 6:47 AM, aoife doherty
 wrote:
> Hey guys,
> Does anyone have an example of a REALLY simple shell script in R.
>
> Basically i want to run this command:
>
> library(MASS)
> wilcox.test(list1,list2,paired=TRUE,alternative=c("greater"),correct=TRUE,exact=FALSE)
>
> in a shell script something like this:
>
> #!/bin/bash
> R
> library(MASS)
> for i in *.out
> do
> wilcox.test($i,${i/out}.out2,paired=TRUE) >> $i.out
> done
>
>
> that i can run on a command line this this:
> sh R.sh
>
>
> because i've SO many files to run this command on.
>
>
> I've been googling, but i'm having trouble of just finding a simple example
> explaining how to make this shell script.
>
> Any help appreciated :)
> Aoife
>
>        [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to take ID of number > 7.

2012-04-23 Thread Steve Lianoglou
Small mistake in my subset example. I mean to remove the `$ID` part at
the end so that you can play with the whole subset-ted data.frame, and
not just get back the ID column. Instead of this:

R> interesting <- subset(DataFile, log2 >= 7)$ID

do this:

R> interesting <- subset(DataFile, log2 >= 7)

I guess you probably figured that out by now, but just wanted to point that out.

HTH,
-steve

On Sun, Apr 22, 2012 at 4:05 PM, Steve Lianoglou
 wrote:
> On Sun, Apr 22, 2012 at 7:03 AM, Yellow  wrote:
>> I figured out something new that I would like to see if I can do this more
>> easy with R then Excel.
>>
>> I have these huge files with data.
>> For example:
>>
>> DataFile.csv
>> ID Name log2
>> 1 Fantasy 5.651
>> 2 New 7.60518
>> 3 Finding 8.9532
>> 4 Looeka -0.248652
>> 5 Vani 0.3548
>>
>> With like header1: ID, header 2: Name, header 3: log2
>>
>> Now I need to get the $ID out who have a &log2 value higher then 7.
>>
>> I know ho to grab the $log2 values with 7+ numbers.
>>
>> Log2HigherSeven = DataFile$log2 [ DataFile$log2 >= 7]
>>
>> But how can I take thise ID numbers also?
>
> Seems like there were already a few suggestions in this thread, but
> I'm surprised no one has suggested the use of `subset` yet, see
> ?subset:
>
> R> interesting <- subset(DataFile, log2 >= 7)$ID
>
> Now play with the `interesting` data.frame to get the data you need
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] [BioC] Overlay Gene Expression on SNP (copy number) data

2012-04-23 Thread Steve Lianoglou
Hi,

On Mon, Apr 23, 2012 at 7:33 AM, Ekta Jain  wrote:
> Hello,
> Can anyone please suggest any packages in R that can be used to overlay gene 
> expression data on SNP (affymetrix) copy number ?

I guess you mean visually? If so, I'd suggest skimming through the
vignettes of the following packages to see which one might suit you
best:

* Gviz
* ggbio
* GenomeGraphs

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] PCA sensitive to outliers?

2012-04-22 Thread Steve Lianoglou
On Mon, Apr 23, 2012 at 12:01 AM, Michael  wrote:
> yes, but that is not a good Review or Survey... thx

But the packages listed there do have their own documentation and
vignettes. For instance the rrcov package seems to have a nice
vignette about its design as well as methods it implements, and
references to these methods for further reading:

http://cran.r-project.org/web/packages/rrcov/vignettes/rrcov.pdf

You'll see at least a few mentions of PCA, which will lead you to
other package/papers/etc.

Enjoy,

-steve

>
> On Sun, Apr 22, 2012 at 9:47 PM, Bert Gunter  wrote:
>
>> As I believe I already told you, look at the CRAN Robust task view.
>>
>> -- Bert
>>
>> On Sun, Apr 22, 2012 at 6:29 PM, Michael  wrote:
>> > Even in R, there are so many of "robust PCA"... any survey or review of
>> all
>> > these different methods?
>> >
>> > On Sun, Apr 22, 2012 at 6:58 PM, Joshua Wiley > >wrote:
>> >
>> >> On Sun, Apr 22, 2012 at 4:43 PM, Michael  wrote:
>> >> > I actually tried "robustPca" in "pcaMethods" on bioconductor.
>> >> >
>> >> > It keeps giving me the warning "Input data is not complete"...
>> >> >
>> >> > Reading into the function:
>> >> >
>> >> > When there is no "NA"s, it will give this warning...
>> >> >
>> >> > It seems that there is a bug in this code...
>> >> >
>> >> > Is it reliable at all?
>> >> >
>> >> > -
>> >> >
>> >> >
>> >> >> robustPcafunction (Matrix, nPcs = 2, verbose = interactive(), ...)
>> >> > {
>> >> >    nas <- is.na(Matrix)
>> >> >    if (!any(nas) & verbose) {
>> >> >        cat("Input data is not complete.\n")
>> >> >        cat("Scores, R2 and R2cum may be inaccurate, handle with
>> care\n")
>> >> >    }
>> >>
>> >> that seems to issue the notes when there are *not any missing* and
>> >> verbose is TRUE.  I would submit a bug report to the author.
>> >>
>> >> >
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > On Fri, Apr 20, 2012 at 9:58 AM, Kevin Wright 
>> wrote:
>> >> >
>> >> >> You can also have a look at the pcaMethods package on Bioconductor.
>> >> >>
>> >> >> Kevin
>> >> >>
>> >> >>
>> >> >>  On Thu, Apr 19, 2012 at 11:20 PM, Michael 
>> >> wrote:
>> >> >>
>> >> >>>  Hi all,
>> >> >>>
>> >> >>> I found that the PCA gave chaotic results when there are big changes
>> >> in a
>> >> >>> few data points.
>> >> >>>
>> >> >>> Are there "improved" versions of PCA in R that can help with this
>> >> problem?
>> >> >>>
>> >> >>> Please give me some pointers...
>> >> >>>
>> >> >>> Thank you!
>> >> >>>
>> >> >>>        [[alternative HTML version deleted]]
>> >> >>>
>> >> >>> __
>> >> >>> R-help@r-project.org mailing list
>> >> >>> https://stat.ethz.ch/mailman/listinfo/r-help
>> >> >>> PLEASE do read the posting guide
>> >> >>> http://www.R-project.org/posting-guide.html<
>> http://www.r-project.org/posting-guide.html>
>> >> <http://www.r-project.org/posting-guide.html>
>> >>  >>> and provide commented, minimal, self-contained, reproducible code.
>> >> >>>
>> >> >>
>> >> >>
>> >> >>
>> >> >> --
>> >> >> Kevin Wright
>> >> >>
>> >> >>
>> >> >
>> >> >        [[alternative HTML version deleted]]
>> >> >
>> >> > ______
>> >> > R-help@r-project.org mailing list
>> >> > https://stat.ethz.ch/mailman/listinfo/r-help
>> >> > PLEASE do read the posting guide
>> >> http://www.R-project.org/posting-guide.html<
>> http://www.r-project.org/posting-guide.html>
>&g

Re: [R] How to take ID of number > 7.

2012-04-22 Thread Steve Lianoglou
On Sun, Apr 22, 2012 at 7:03 AM, Yellow  wrote:
> I figured out something new that I would like to see if I can do this more
> easy with R then Excel.
>
> I have these huge files with data.
> For example:
>
> DataFile.csv
> ID Name log2
> 1 Fantasy 5.651
> 2 New 7.60518
> 3 Finding 8.9532
> 4 Looeka -0.248652
> 5 Vani 0.3548
>
> With like header1: ID, header 2: Name, header 3: log2
>
> Now I need to get the $ID out who have a &log2 value higher then 7.
>
> I know ho to grab the $log2 values with 7+ numbers.
>
> Log2HigherSeven = DataFile$log2 [ DataFile$log2 >= 7]
>
> But how can I take thise ID numbers also?

Seems like there were already a few suggestions in this thread, but
I'm surprised no one has suggested the use of `subset` yet, see
?subset:

R> interesting <- subset(DataFile, log2 >= 7)$ID

Now play with the `interesting` data.frame to get the data you need

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Reading first line before using read.table()

2012-04-01 Thread Steve Lianoglou
Hi,

On Sun, Apr 1, 2012 at 8:27 PM, Hurr  wrote:
> So far I have figured out that the following line
> reads our time series files into R OK.
> dtLs$dta <- read.table("C:/TryRRead/datFiles/JFeqfi4h.rta", header = TRUE,
> sep = ",", colClasses = "character")
> But I have to remove a main-title line so
> that the first line is the column titles line.
> This leads to having two sets of data files around when
> we would rather have just one set.
> How can I read just one line from the file to
> get the main title in before using the read.table() call?

Not sure I understand correctly, but would something like this do?

R> title.line <- readLines('file.rta', n=1)
R> dat <- read.table('file.rta', skip=1, header=TRUE, ...)

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SVM. How to use categorical attributes?

2012-03-28 Thread Steve Lianoglou
Sorry -- I should add that I'm pointing out the potential shogun
implementation because I suspect their implementation of a
bag-of-words -like kernel would use the kernel trick, so you won't
have to map all of your data explicitly into some huge feature space
that will blow your memory away.

I'm not 100% sure they have what you're looking for, but as I said ...
it's worth checking out.

-steve

On Wed, Mar 28, 2012 at 9:54 AM, Steve Lianoglou
 wrote:
> Hi,
>
> These suggestions still require you to explicitly compute your feature
> space or kernel matrix first, which might kill you memory wise.
>
> You might consider taking a look at the shogun toolbox:
>
> http://www.shogun-toolbox.org/
>
> With some digging, I'm pretty sure you'll find a bag-of-words type of
> kernel there (it's related to the spectrum kernel, which you can find
> for searching the code base for something like "commword") ... you
> might consider posting to their mailing list after you give it the
> "good old college try" of sorting this out for yourself for a bit.
>
> The R interface to the toolbox is a bit ... alien, though. I'm working
> on making a nicer one but it's not quite ready for public consumption.
>
> -steve
>
>
> On Wed, Mar 28, 2012 at 7:38 AM, Ulrich Bodenhofer
>  wrote:
>> Alex,
>>
>> To avoid the memory issue, you can directly use a "bag of words" kernel
>> (which corresponds to using the linear kernel on the sparse bag of words
>> matrix Steve suggested). Just a little toy example how this is done for two
>> :
>>
>>> x1 <- c("how", "to", "grow", "tree")
>>> x2 <- c("where", "to", "go", "weekend", "cinema")
>>> k12 <- length(intersect(x1, x2))
>>> k12
>> [1] 1
>>
>> If you run this for every pair of samples (additionally exploiting the
>> symmetry of the resulting matrix), you will get an L x L matrix of kernel
>> values (where L is the number of samples) without the need of having to
>> store the large bag of words matrix. That's exactly one of the beauties of
>> SVMs, in my humble opinion.
>>
>> Just as a side note: the result above is 1 because there is one overlap in
>> the two bags of words, the word "to". Maybe it is a good idea to remove such
>> unspecific words first and, moreover, to do word stemming, as is the
>> standard in analyses like the one you are aiming at.
>>
>> Best regards,
>> Ulrich
>>
>> --
>> View this message in context: 
>> http://r.789695.n4.nabble.com/SVM-How-to-use-categorical-attributes-tp4508460p4512034.html
>> Sent from the R help mailing list archive at Nabble.com.
>>
>> __
>> R-help@r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SVM. How to use categorical attributes?

2012-03-28 Thread Steve Lianoglou
Hi,

These suggestions still require you to explicitly compute your feature
space or kernel matrix first, which might kill you memory wise.

You might consider taking a look at the shogun toolbox:

http://www.shogun-toolbox.org/

With some digging, I'm pretty sure you'll find a bag-of-words type of
kernel there (it's related to the spectrum kernel, which you can find
for searching the code base for something like "commword") ... you
might consider posting to their mailing list after you give it the
"good old college try" of sorting this out for yourself for a bit.

The R interface to the toolbox is a bit ... alien, though. I'm working
on making a nicer one but it's not quite ready for public consumption.

-steve


On Wed, Mar 28, 2012 at 7:38 AM, Ulrich Bodenhofer
 wrote:
> Alex,
>
> To avoid the memory issue, you can directly use a "bag of words" kernel
> (which corresponds to using the linear kernel on the sparse bag of words
> matrix Steve suggested). Just a little toy example how this is done for two
> :
>
>> x1 <- c("how", "to", "grow", "tree")
>> x2 <- c("where", "to", "go", "weekend", "cinema")
>> k12 <- length(intersect(x1, x2))
>> k12
> [1] 1
>
> If you run this for every pair of samples (additionally exploiting the
> symmetry of the resulting matrix), you will get an L x L matrix of kernel
> values (where L is the number of samples) without the need of having to
> store the large bag of words matrix. That's exactly one of the beauties of
> SVMs, in my humble opinion.
>
> Just as a side note: the result above is 1 because there is one overlap in
> the two bags of words, the word "to". Maybe it is a good idea to remove such
> unspecific words first and, moreover, to do word stemming, as is the
> standard in analyses like the one you are aiming at.
>
> Best regards,
> Ulrich
>
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/SVM-How-to-use-categorical-attributes-tp4508460p4512034.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.



-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] SVM. How to use categorical attributes?

2012-03-27 Thread Steve Lianoglou
Hi,

On Tue, Mar 27, 2012 at 6:05 AM, Alekseiy Beloshitskiy
 wrote:
> Hi All,
>
> Here is the case. I want to build classification model (SVM). Some of 
> variables for this model are categorical attributes which represent words  
> (usually 3-10 words - query for search in google). For example:
> search_id | query_words                        |..| result
> ---+--+--+
> 1            | how,to,grow,tree                  |..| 4
> 2            | smartfone,htc,buy,price         |..| 7
> 3            | buy,house,realty,london         |..| 6
> 4            | where,to,go,weekend,cinema |..| 4
> ...
> As you can see, words in the query are disordered and may occur in different 
> queries. Total number of unique words for all queries is several thousands.
> The question is how to represent this variable (query_words) to use for SVM.
>
> Thank you for any advices!

One approach is to wire up a "bag of words" type of design matrix.

That is to say the matrix has as many columns as there are unique
words. Each row is an observation (query), and the words that appear
in the query have a value of 1 (or you can count the number of times
each word appears).

You can maybe get smarter and try to group like words together, but
... now you'll have two problems ...

Hope you have lots of data!

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] installing R 2.14.2

2012-03-27 Thread Steve Lianoglou
Hi,

On Tue, Mar 27, 2012 at 1:03 PM, Heba S  wrote:
>
> Hello,I  am trying to install a newer version of R (R 2.14.2) from this 
> linkhttp://cran.r-project.org/bin/macosx/
> However I am getting an error that it can not be installed on my computer. My 
> Mac is version 10.6.8. Can you please advise me what the problem. I need the 
> newer version to install the ggm package.

If you want any meaningful help, you'll have to provide the exact
error that you're getting, so please reproduce the error message
(verbatim) in your follow up email.

Also let us know when during the installation process the error occurs.

Thanks,

-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] lasso constraint

2012-03-27 Thread Steve Lianoglou
Hi,

On Tue, Mar 27, 2012 at 10:35 AM, yx78  wrote:
> In the package lasso2, there is a Prostate Data. To find coefficients in the
> prostate cancer example we could impose L1 constraint on the parameters.
>
> code is:
> data(Prostate)
>  p.mean <- apply(Prostate, 5,mean)
>  pros <- sweep(Prostate, 5, p.mean, "-")
>  p.std <- apply(pros, 5, var)
>  pros <- sweep(pros, 5, sqrt(p.std),"/")
>  pros[, "lpsa"] <- Prostate[, "lpsa"]
> l1ce(lpsa ~  . , pros, bound = 0.44)
>
> I can't figure out what dose 0.44 come from. On the paper it said it was
> from  generalized cross-validation and it is the optimal choice.

Yes, this is exactly how the "optimal" value for bound would be found.

Using the lasso2 package, you'll likely have to do a grid search over
possible values for `bound` in a cross validation setting and you pick
the one that fits the model best on the held out data over all your CV
folds.

If I were you, I'd use the glmnet package since it can calculate the
entire regularization path w/o having to do a grid search over the
bound (or lamda), making cross validation easier.

If you're confused about how you might use cross validation to find
the optimal value of the parameter(s) of the model you are building,
then it's time to pull yourself away from the keyboaRd and start doing
some reading, or (as Bert will likely tell you) consult your local
statistician.

HTH,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


  1   2   3   4   5   6   7   8   >