Re: [R] Data cleaning & Data preparation, what do R users want?

2017-11-30 Thread Dominik Schneider
I would agree that getting data into R from various sources is the biggest
pain point. Even if there is an api, the results are not always consistent
and you have to do lots of dimension checking to get it right. Or there
isn't an open api at all and you have to hack it by web scraping or
otherwise- http://enpiar.com/2017/08/11/one-hour-package/

On Thu, Nov 30, 2017 at 1:00 AM, Jim Lemon  wrote:

> Hi again,
> Typo in the last email. Should read "about 40 standard deviations".
>
> Jim
>
> On Thu, Nov 30, 2017 at 10:54 AM, Jim Lemon  wrote:
> > Hi Robert,
> > People want different levels of automation in the software they use.
> > What concerns many of us is the desire for the function
> > "figure-out-what-this-data-is-import-it-and-get-rid-of-bad-values".
> > Such users typically want something that justifies its use by being
> > written by someone who seems to know what they're doing and lots of
> > other people use it. One advantage of many R functions is their
> > modular construction. This encourages users to at least consider the
> > steps that are taken rather than just accept what comes out of that
> > long tube.
> >
> > Take the contentious problem of outlier identification. If I just let
> > the black box peel off some values, I don't know what I have lost. On
> > the other hand, if I import data and examine it with a summary
> > function, I may find that one woman has a height of 5.2 meters. I can
> > range check by looking up the Guinness Book of Records. It's an
> > outlier. I can estimate the probability of such a height.  Hmm, about
> > 4 standard deviations above the mean. It's an outlier. I can attempt a
> > Sherlock Holmes. "Watson, I conclude that an imperial measure (5'2")
> > has been recorded as a metric value". It's not an outlier.
> >
> > The more R gravitates toward "black box" functions, the more some
> > users are encouraged to let them do the work.You pays your money and
> > you takes your chances.
> >
> > Jim
> >
> >
> > On Thu, Nov 30, 2017 at 3:37 AM, Robert Wilkins 
> wrote:
> >> R has a very wide audience, clinical research, astronomy, psychology,
> and
> >> so on and so on.
> >> I would consider data analysis work to be three stages: data
> preparation,
> >> statistical analysis, and producing the report.
> >> This regards the process of getting the data ready for analysis and
> >> reporting, sometimes called "data cleaning" or "data munging" or "data
> >> wrangling".
> >>
> >> So as regards tools for data preparation, speaking to the highly diverse
> >> audience mentioned, here is my question:
> >>
> >> What do you want?
> >> Or are you already quite happy with the range of tools that is currently
> >> before you?
> >>
> >> [BTW,  I posed the same question last week to the r-devel list, and was
> >> advised that r-help might be a more suitable audience by one of the
> >> moderators.]
> >>
> >> Robert Wilkins
> >>
> >> [[alternative HTML version deleted]]
> >>
> >> __
> >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> >> https://stat.ethz.ch/mailman/listinfo/r-help
> >> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> >> and provide commented, minimal, self-contained, reproducible code.
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] How to create separate legend for each plot in the function of facet_wrap in ggplot2?

2017-11-10 Thread Dominik Schneider
That's not the point of facet_wrap so check out the cowplot package for
combining multiple ggplot objects (with legends) into one figure.

On Fri, Nov 10, 2017 at 10:21 AM, Marna Wagley 
wrote:

> Hi R users,
> I need to create more than 20 figures (one for each group) in one page. I
> have a  common  legend for 20 figures using the facet_wrap. However the
> range of the values among the groups are very wide. For example one group
> has the value of 0 to 3, but the values of some of the groups has ranged
> from 0 to 20 so that when I used a single common legend for all 20 figures,
> I could not display the contrast of the values in some of the figures.
> Therefore I wanted to create the figures with *a separate legend*.In this
> way, I can display the gradient of the values in each figure.  Any
> suggestions on how I can create it.
>
> The example is given below, *I wanted to create a separate legend with
> keeping legend inside of each of the figure*.
>
> library(ggplot2)
>
> dat<-structure(list(X = c(289.6, 289.7, 289.8, 289.9, 290, 290.1,
>
> 927.8, 927.9, 928, 928.1, 928.2, 928.3), Y = c(789.1, 789.2,
>
> 789.3, 789.4, 789.5, 789.6, 171.1, 171.2, 171.3, 171.4, 171.5,
>
> 171.6), value = c(0.05, 0.06, 0.07, 0.09, 0.1, 0.11, 0.06, 0.05,
>
> 0.05, 0.06, 0.1, 1.5), group = structure(c(1L, 1L, 1L, 1L, 1L,
>
> 1L, 2L, 2L, 2L, 2L, 2L, 2L), .Label = c("A", "B"), class = "factor")),
> .Names = c("X",
>
> "Y", "value", "group"), class = "data.frame", row.names = c(NA,
>
> -12L))
>
>
> AB<-ggplot(data = dat, aes(x = X, y = Y, color =  value)) +
> geom_point(size
> =2) +
>
> coord_equal() +  theme_bw()+ scale_color_gradientn(colours =
> terrain.colors(
> 7))
>
> AB+facet_wrap(~group,  scales="free")+theme(strip.text = element_text(size
> = 8))
>
>
>
>
> Thanks
>
>
> MW
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] h5r package: cannot find hdf5

2016-12-18 Thread Dominik Schneider
Sorry I'm not sure what that error means.
Dominik

On Sun, Dec 18, 2016 at 17:21 David Winsemius 
wrote:

>
>
> > On Dec 17, 2016, at 5:09 PM, Pedro Montenegro  wrote:
>
> >
>
> > I'm new to R and Linux, and I have an issue I didn't see solved on the
>
> > internet.
>
> >
>
> > I'm using Ubuntu Mate and installed R version 2.11 since the current
>
> > version does not support R-kinetics.
>
> > What happens is that I have hdf5 headers and libraries installed
>
> >
>
> > $ whereis hdf5
>
> > hdf5: /usr/include/hdf5
>
> >
>
> > But when I'm installing the package inside R (install.packages) or
> outside
>
> > R (R CMD INSTALL) the program reports the following error:
>
> >
>
> >> install.packages('h5r')
>
> > Warning in install.packages("h5r") :
>
>
>
> There is also a bioconductor package that provides an HDF5 interface:
>
>
>
> http://bioconductor.org/packages/release/bioc/html/rhdf5.html
>
>
>
> I have no idea whether it's config files might include a more extensive
> search for installed versions of HDF5. Unlike h5r it is available for all
> three major forks of R.
>
>
>
> --
>
> David.
>
> >  argument 'lib' is missing: using
>
> > '/home/pedro/R/x86_64-unknown-linux-gnu-library/2.11'
>
> > Warning message:
>
> > In getDependencies(pkgs, dependencies, available, lib) :
>
> >  package ‘h5r’ is not available
>
> >>
> install.packages('/home/pedro/h5r_1.4.7.tar.gz',repos=NULL,type='source')
>
> > Warning in install.packages("/home/pedro/h5r_1.4.7.tar.gz", repos =
> NULL,  :
>
> >  argument 'lib' is missing: using
>
> > '/home/pedro/R/x86_64-unknown-linux-gnu-library/2.11'
>
> > * installing *source* package ‘h5r’ ...
>
> > checking for gcc... gcc
>
> > checking for C compiler default output file name... a.out
>
> > checking whether the C compiler works... yes
>
> > checking whether we are cross compiling... no
>
> > checking for suffix of executables...
>
> > checking for suffix of object files... o
>
> > checking whether we are using the GNU C compiler... yes
>
> > checking whether gcc accepts -g... yes
>
> > checking for gcc option to accept ISO C89... none needed
>
> >
>
> >
>
> > checking for library containing inflate... -lz
>
> > checking for library containing H5open... no
>
> > configure: error: Can't find HDF5
>
> > ERROR: configuration failed for package ‘h5r’
>
> > * removing ‘/home/pedro/R/x86_64-unknown-linux-gnu-library/2.11/h5r’
>
> > Warning message:
>
> > In install.packages("/home/pedro/h5r_1.4.7.tar.gz", repos = NULL,  :
>
> >  installation of package '/home/pedro/h5r_1.4.7.tar.gz' had non-zero exit
>
> > status.
>
> >
>
> > I tried to reinstall HDF5, R and even the whole OS and do it all over
>
> > again. Also tried to add the library to the environment table, do it from
>
> > the library's directory and nothing works.
>
> >
>
> > Is there someone who had the same error and was able to solve it?
>
> > Is there someone who as a clue how to solve?
>
> >
>
> > I'm sorry if there is a similar post around, I've seen some but I don't
>
> > find one where the problem is solved.
>
> > Best regards!
>
> >
>
> > *Pedro*
>
> >
>
> >   [[alternative HTML version deleted]]
>
> >
>
> > __
>
> > R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>
> > https://stat.ethz.ch/mailman/listinfo/r-help
>
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>
> > and provide commented, minimal, self-contained, reproducible code.
>
>
>
> David Winsemius
>
> Alameda, CA, USA
>
>
>
> __
>
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>
> https://stat.ethz.ch/mailman/listinfo/r-help
>
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
>
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] h5r package: cannot find hdf5

2016-12-18 Thread Dominik Schneider
Pedro,
I've only worked with netcdf4 but I imagine your issue is similar to one's
I've had:
I think you can:
1. add your hdf5 lib directory to the LD_LIBRARY_PATH
http://grokbase.com/t/r/r-help/10at4wcjfq/r-ncdf4-package-installation-in-r
2. you can specify the direct path to lib and include directories for the
hdf5 library:

e.g.

install.packages("/home/user/Downloads/RNetCDF_1.6.1-2.tar.gz",
repos = NULL,
type="source",
dependencies=FALSE,
configure.args="--with-netcdf-include=/usr/local/netcdf-4.2.1-build/include
--with-netcdf-lib=/usr/local/netcdf-4.2.1-build/lib")

http://stackoverflow.com/questions/11319698/how-to-install-r-packages-rnetcdf-and-ncdf-on-ubuntu


On Sat, Dec 17, 2016 at 5:09 PM, Pedro Montenegro  wrote:

> I'm new to R and Linux, and I have an issue I didn't see solved on the
> internet.
>
> I'm using Ubuntu Mate and installed R version 2.11 since the current
> version does not support R-kinetics.
> What happens is that I have hdf5 headers and libraries installed
>
> $ whereis hdf5
> hdf5: /usr/include/hdf5
>
> But when I'm installing the package inside R (install.packages) or outside
> R (R CMD INSTALL) the program reports the following error:
>
> > install.packages('h5r')
> Warning in install.packages("h5r") :
>   argument 'lib' is missing: using
> '/home/pedro/R/x86_64-unknown-linux-gnu-library/2.11'
> Warning message:
> In getDependencies(pkgs, dependencies, available, lib) :
>   package ‘h5r’ is not available
> > install.packages('/home/pedro/h5r_1.4.7.tar.gz',repos=NULL,
> type='source')
> Warning in install.packages("/home/pedro/h5r_1.4.7.tar.gz", repos =
> NULL,  :
>   argument 'lib' is missing: using
> '/home/pedro/R/x86_64-unknown-linux-gnu-library/2.11'
> * installing *source* package ‘h5r’ ...
> checking for gcc... gcc
> checking for C compiler default output file name... a.out
> checking whether the C compiler works... yes
> checking whether we are cross compiling... no
> checking for suffix of executables...
> checking for suffix of object files... o
> checking whether we are using the GNU C compiler... yes
> checking whether gcc accepts -g... yes
> checking for gcc option to accept ISO C89... none needed
>
>
> checking for library containing inflate... -lz
> checking for library containing H5open... no
> configure: error: Can't find HDF5
> ERROR: configuration failed for package ‘h5r’
> * removing ‘/home/pedro/R/x86_64-unknown-linux-gnu-library/2.11/h5r’
> Warning message:
> In install.packages("/home/pedro/h5r_1.4.7.tar.gz", repos = NULL,  :
>   installation of package '/home/pedro/h5r_1.4.7.tar.gz' had non-zero exit
> status.
>
> I tried to reinstall HDF5, R and even the whole OS and do it all over
> again. Also tried to add the library to the environment table, do it from
> the library's directory and nothing works.
>
> Is there someone who had the same error and was able to solve it?
> Is there someone who as a clue how to solve?
>
> I'm sorry if there is a similar post around, I've seen some but I don't
> find one where the problem is solved.
> Best regards!
>
> *Pedro*
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question about using ggplot

2016-11-13 Thread Dominik Schneider
past versions of ggplot2 used to accept family as an argument directly, but
the latest ggplot2 (perhaps starting with v2?) requires method.args=list().
So the online sources you found using family directly were for an older
version of ggplot.

On Sun, Nov 13, 2016 at 8:18 AM,  wrote:

> Hi. I’m a student from South Korea, and I’m studying R by myself.
> While I am studying, I have a trouble dealing with ggplot(especially,
> about parameter ‘family’)
> > b <- biopsy
> > b$classn[b$class == "benign"] <- 0
> > b$classn[b$class == "malignant"] <- 1
> > ggplot(b, aes(x = V1, y = classn)) + geom_point(position =
> position_jitter(width = 0.3, height = 0.06), alpha = 0.4, shape = 21, size
> = 1.5) + stat_smooth(method = "glm", family = “bimomial")) #first code
> Warning: Ignoring unknown parameters: family
>
> While studying, I am wondering why there is a warning message. And also,
> there comes a wrong logistic model(Above picture). But in fact, I expect
> the below picture from the above code.
>
>
> And the below code yields ‘the model’ that I expected(below picture)
> ggplot(b, aes(x = V1, y = classn)) + geom_point(position =
> position_jitter(width = 0.3, height = 0.06), alpha = 0.4, shape = 21, size
> = 1.5) + stat_smooth(method = "glm", method.args = list(family =
> "binomial"))
>
>
>
>
> What’s wrong with the first code? When I searched about that from online,
> other people don’t seem to have a problem using family parameters.
> (I currently updated ggplot2, MASS, and sjPlot (Also, double checked it)
> Plus, I used code w/ or w/o using “”)
>
>
> I hope getting a good response from you.
> Thanks for reading my mail, and if my message has a rude expression, I’m
> sorry for my bad english skills..:(
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Output formatting in PDF

2016-10-11 Thread Dominik Schneider
You may be able to do everything you need with the cowplot package.

On Tue, Oct 11, 2016 at 4:26 AM, Preetam Pal  wrote:

> Hey Enrico,
> LaTex is not possible actually.
>
> On Tue, Oct 11, 2016 at 2:29 PM, Enrico Schumann 
> wrote:
>
> > On Tue, 11 Oct 2016, Preetam Pal  writes:
> >
> > > Hi,
> > >
> > > Can you please help me with the following output formatting:
> > > I am planning to include 2 plots and some general description in a
> > one-page
> > > PDF document, such that
> > >
> > >- I'll leave some appropriate margin on the PDF- say, 1.5 inches
> > >top,right, bottom and left (will decide based on overall appearance)
> > >- the 2 plots are placed side-by-side (looks best for comparison)
> > >- the margins for each plot can be 4 lines on the top and the
> bottom &
> > > 2 lines on the left and the right
> > >- each of these 2 plots would have time (0 to 260) along x-axis and
> > two
> > >time-series (daily USD-GBP and USD-EUR FX rates) on the y-axis,
> i.e. 2
> > >time-series would be plotted on each of the 2 graphs. I would need a
> > >different color for each plot to demarcate them
> > >- I need to add some text (eg: "Independent analysis of Exchange
> Rate
> > >dynamics") with reduced font size (not high priority though-just
> good
> > to
> > >have a different size)
> > >- The general discussion (may be a paragraph) would come right below
> > the
> > >2 plots - I can specify this text as an argument in a function, may
> > be. I
> > >am not sure how to arrange the entire PDF as per the format I
> > mentioned
> > >above
> > >
> > > I shall really appreciate any help with this - the time series analysis
> > is
> > > not difficult, I can manage that - however, I don't know how to manage
> > the
> > > formatting part though, so that the 1-pager output looks decently
> > > presentable. Thanks.
> > >
> > > Regards,
> > > Preetam
> >
> > If using LaTeX is an option, I would suggest
> > ?Sweave. There are many tutorials on the web that
> > should get you started.
> >
> >
> > --
> > Enrico Schumann
> > Lucerne, Switzerland
> > http://enricoschumann.net
> >
>
>
>
> --
> Preetam Pal
> (+91)-9432212774
> M-Stat 2nd Year, Room No. N-114
> Statistics Division,   C.V.Raman
> Hall
> Indian Statistical Institute, B.H.O.S.
> Kolkata.
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] To submit R jobs via SLURM

2016-10-03 Thread Dominik Schneider
I typically call Rscript inside an sbatch file.

*batch_r.sh*
#! /bin/bash

cd /home//aso_regression_project #make sure you change to your
correct working directory
Rscript scripts/run_splitsample-modeling.R


and on the commandline of the login node:
sbatch batch_r.sh


There are lots of slurm configurations you can specify. Google for them.
Just add these under the first line and modify as needed:
#SBATCH --qos=
#SBATCH --mem=
#SBATCH --mail-type=begin,end,abort
#SBATCH --mail-user=use...@email.com
#SBATCH --time=
#SBATCH --nodes=
#SBATCH --job-name=myjob
#SBATCH --output=/home//run-%A_%a.Sout #special output filename


Dominik


On Mon, Oct 3, 2016 at 3:03 AM, Sema Atasever  wrote:

> Dear Authorized Sir / Madam,
>
> I have an R script file in which it includes this lines:
>
> How can i to submit this R jobs via SLURM? Thanks in advance.
>
> *testscript.R*
> data=read.table("seqDist.50", header=FALSE)[-1]
> attach(data)
> d=as.matrix(data)
> library(cluster)
> cluster.pam = pam(d,6)
> table(cluster.pam$clustering)
>
> filenameclu = paste("outputfile", ".txt")
> write.table(cluster.pam$clustering, file=outputfile,sep=",")
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Faster Subsetting

2016-09-28 Thread Dominik Schneider
I regularly crunch through this amount of data with tidyverse. You can also
try the data.table package. They are optimized for speed, as long as you
have the memory.
Dominik

On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold  wrote:

> I have an extremely large data frame (~13 million rows) that resembles the
> structure of the object tmp below in the reproducible code. In my real
> data, the variable, 'id' may or may not be ordered, but I think that is
> irrelevant.
>
> I have a process that requires subsetting the data by id and then running
> each smaller data frame through a set of functions. One example below uses
> indexing and the other uses an explicit call to subset(), both return the
> same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of rows
> to evaluate the condition and this is expensive and a bottleneck in my
> code.  I'm curious if anyone can recommend an improvement that would
> somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Faster Subsetting

2016-09-28 Thread Dominik Schneider
I regularly crunch through this amount of data with tidyverse. You can also
try the data.table package. They are optimized for speed, as long as you
have the memory.
Dominik

On Wed, Sep 28, 2016 at 10:09 AM, Doran, Harold  wrote:

> I have an extremely large data frame (~13 million rows) that resembles the
> structure of the object tmp below in the reproducible code. In my real
> data, the variable, 'id' may or may not be ordered, but I think that is
> irrelevant.
>
> I have a process that requires subsetting the data by id and then running
> each smaller data frame through a set of functions. One example below uses
> indexing and the other uses an explicit call to subset(), both return the
> same result, but indexing is faster.
>
> Problem is in my real data, indexing must parse through millions of rows
> to evaluate the condition and this is expensive and a bottleneck in my
> code.  I'm curious if anyone can recommend an improvement that would
> somehow be less expensive and faster?
>
> Thank you
> Harold
>
>
> tmp <- data.frame(id = rep(1:200, each = 10), foo = rnorm(2000))
>
> idList <- unique(tmp$id)
>
> ### Fast, but not fast enough
> system.time(replicate(500, tmp[which(tmp$id == idList[1]),]))
>
> ### Not fast at all, a big bottleneck
> system.time(replicate(500, subset(tmp, id == idList[1])))
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] ggplot2: geom_segment does not produce the color I desire?

2016-09-17 Thread Dominik Schneider
ggplot will assign, or map if you will, the color based on the default
color scale when color is specified with the mapping argument such as
mapping = aes(color=...). You have two options:

1. if you want the color of your arrow to be based on a column in your
data, then manually scale the color with
scale_colour_manual(values=c('green')):
ggplot()+
geom_segment(mapping = aes(x = as.Date(test[,"date"]), y = y1, xend =
as.Date(test[,"date"]), yend = y2, color=co), data=test, arrow=arrow())+
scale_colour_manual(values=c('green'))

2. If the color doesn't need to be "mapped" based on your data, then you
can simply specify colour *outside* the aes() like this:
ggplot()+
geom_segment(mapping = aes(x = as.Date(test[,"date"]), y = y1, xend =
as.Date(test[,"date"]), yend = y2), color='green', data=test, arrow=arrow())

keep in mind that only the first option will produce a legend, if you need
that.



On Friday, September 16, 2016, John  wrote:

> Hi,
>
>I have a dataset "test". I try to produce a "green" arrow but it gives a
> "red" arrow (as attached). Could someone tell me how I can fix it? Thanks,
>
> > test
> dateco   y1   y2
> 5 2011-11-28 green 196.6559 1.600267
> > dput(test)
> structure(list(date = structure(15306, class = "Date"), co = "green",
> y1 = 196.655872, y2 = 1.600267), .Names = c("date", "co",
> "y1", "y2"), class = "data.frame", row.names = 5L)
> > ggplot()+geom_segment(mapping = aes(x = as.Date(test[,"date"]), y =
> y1, xend = as.Date(test[,"date"]), yend = y2, color=co), data=test,
> arrow=arrow())
> >
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] glmnet vignette question

2016-09-17 Thread Dominik Schneider
> Is there a way to extract MSE for a lambda, e.g. lambda.1se?
nevermind this specific question. it's now obvious. However my overall
question stands.

On Fri, Sep 16, 2016 at 10:10 AM, Dominik Schneider <
dominik.schnei...@colorado.edu> wrote:

> I'm doing some linear modeling and am new to the ridge/lasso/elasticnet
> procedures. In my case I have N>>p (p=15 based on variables used in past
> literature and some physical reasoning) so my understanding is that I
> should be interested in ridge regression to avoid the issue of
> multicollinearity of predictors.  Lasso is useful when p>>N.
>
> In the past I have performed step-wise regression with stepAIC in both
> directions to choose my variables and then used VIF to determine if any of
> these variables are correlated. My understanding is that ridge regression
> is a more robust approach for this workflow.
>
> Reading the glmnet_beta vignette, it describes the alpha parameter where
> alpha=1 is a lasso regression and alpha=0 is a ridge regression. Farther
> down the authors suggest a 10 fold validation to determine an alpha value
> and based on the plots shown, say that alpha=1 does the best here. However,
> all the models look like they approach the same MSE and alpha=0 is the
> lowest curve for all lambda (but maybe this second point doesn't matter?).
> With my data I get a very similar looking set of curves so I'm trying to
> decide if I should stick with alpha=1 instead of alpha=0. Is there a way to
> extract MSE for a lambda, e.g. lambda.1se?
>
> Any advice or clarification is appreciated. Thanks.
> Dominik
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] glmnet vignette question

2016-09-17 Thread Dominik Schneider
I'm doing some linear modeling and am new to the ridge/lasso/elasticnet
procedures. In my case I have N>>p (p=15 based on variables used in past
literature and some physical reasoning) so my understanding is that I
should be interested in ridge regression to avoid the issue of
multicollinearity of predictors.  Lasso is useful when p>>N.

In the past I have performed step-wise regression with stepAIC in both
directions to choose my variables and then used VIF to determine if any of
these variables are correlated. My understanding is that ridge regression
is a more robust approach for this workflow.

Reading the glmnet_beta vignette, it describes the alpha parameter where
alpha=1 is a lasso regression and alpha=0 is a ridge regression. Farther
down the authors suggest a 10 fold validation to determine an alpha value
and based on the plots shown, say that alpha=1 does the best here. However,
all the models look like they approach the same MSE and alpha=0 is the
lowest curve for all lambda (but maybe this second point doesn't matter?).
With my data I get a very similar looking set of curves so I'm trying to
decide if I should stick with alpha=1 instead of alpha=0. Is there a way to
extract MSE for a lambda, e.g. lambda.1se?

Any advice or clarification is appreciated. Thanks.
Dominik

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] physical constraint with gam

2016-05-16 Thread Dominik Schneider
Thanks for the clarification!

On Sat, May 14, 2016 at 1:24 AM, Simon Wood  wrote:

> On 12/05/16 02:29, Dominik Schneider wrote:
>
> Hi again,
> I'm looking for some clarification on 2 things.
> 1. On that last note, I realize that s(x1,x2) would be the other obvious
> interaction to compare with - and I see that you recommend te(x1,x2) if
> they are not on the same scale.
>
> - yes that's right, s(x1,x2) gives an isotropic smooth, which is usually
> only appropriate if x1 and x2 are naturally on the same scale.
>
> 2. If s(x1,by=x1) gives you a "parameter" value similar to a GLM when you
> plot s(x1):x1, why does my function above return the same yhat as
> predict(mdl,type='response') ?  Shouldn't each of the terms need to be
> multiplied by the variable value before applying
> rowSums()+attr(sterms,'constant') ??
>
> predict returns s(x1)*x1 (plot.gam just plots s(x1), because in general
> s(x1,by=x2) is not smooth). If you want to get s(x1) on its own you need to
> do something like this:
>
> x2 <- x1 ## copy x1
> m <- gam(y~s(x1,by=x2)) ## model implementing s(x1,by=x1) using copy of x1
> predict(m,data.frame(x1=x1,x2=rep(1,length(x2))),type="terms") ## now
> predicted s(x1)*x2 = s(x1)
>
> best,
> Simon
>
>
> Thanks again
> Dominik
>
> On Wed, May 11, 2016 at 10:11 AM, Dominik Schneider <
> dominik.schnei...@colorado.edu> wrote:
>
>> Hi Simon, Thanks for this explanation.
>> To make sure I understand, another way of explaining the y axis in my
>> original example is that it is the contribution to snowdepth relative to
>> the other variables (the example only had fsca, but my actual case has a
>> couple others). i.e. a negative s(fsca) of -0.5 simply means snowdepth 0.5
>> units below the intercept+s(x_i), where s(x_i) could also be negative in
>> the case where total snowdepth is less than the intercept value.
>>
>> The use of by=fsca is really useful for interpreting the marginal impact
>> of the different variables. With my actual data, the term s(fsca):fsca is
>> never negative, which is much more intuitive. Is it appropriate to compare
>> magnitudes of e.g. s(x2):x2 / mean(x2) and s(x2):x2 / mean(x2)  where
>> mean(x_i) are the mean of the actual data?
>>
>> Lastly, how would these two differ: s(x1,by=x2); or
>> s(x1,by=x1)*s(x2,by=x2) since interactions are surely present and i'm not
>> sure if a linear combination is enough.
>>
>> Thanks!
>> Dominik
>>
>>
>> On Wed, May 11, 2016 at 3:11 AM, Simon Wood < 
>> simon.w...@bath.edu> wrote:
>>
>>> The spline having a positive value is not the same as a glm coefficient
>>> having a positive value. When you plot a smooth, say s(x), that is
>>> equivalent to plotting the line 'beta * x' in a GLM. It is not equivalent
>>> to plotting 'beta'. The smooths in a gam are (usually) subject to
>>> `sum-to-zero' identifiability constraints to avoid confounding via the
>>> intercept, so they are bound to be negative over some part of the covariate
>>> range. For example, if I have a model y ~ s(x) + s(z), I can't estimate the
>>> mean level for s(x) and the mean level for s(z) as they are completely
>>> confounded, and confounded with the model intercept term.
>>>
>>> I suppose that if you want to interpret the smooths as glm parameters
>>> varying with the covariate they relate to then you can do, by setting the
>>> model up as a varying coefficient model, using the `by' argument to 's'...
>>>
>>> gam(snowdepth~s(fsca,by=fsca),data=dat)
>>>
>>>
>>> this model is `snowdepth_i = f(fsca_i) * fsca_i + e_i' . s(fsca,by=fsca)
>>> is not confounded with the intercept, so no constraint is needed or
>>> applied, and you can now interpret the smooth like a local GLM coefficient.
>>>
>>> best,
>>> Simon
>>>
>>>
>>>
>>>
>>> On 11/05/16 01:30, Dominik Schneider wrote:
>>>
>>>> Hi,
>>>> Just getting into using GAM using the mgcv package. I've generated some
>>>> models and extracted the splines for each of the variables and started
>>>> visualizing them. I'm noticing that one of my variables is physically
>>>> unrealistic.
>>>>
>>>> In the example below, my interpretation of the following plot is that
>>>> the
>>>> y-axis is basically the equivalent of a "parameter" value of a GLM; in
>>>> GAM
>&g

Re: [R] physical constraint with gam

2016-05-11 Thread Dominik Schneider
Hi again,
I'm looking for some clarification on 2 things.
1. On that last note, I realize that s(x1,x2) would be the other obvious
interaction to compare with - and I see that you recommend te(x1,x2) if
they are not on the same scale.
2. If s(x1,by=x1) gives you a "parameter" value similar to a GLM when you
plot s(x1):x1, why does my function above return the same yhat as
predict(mdl,type='response') ?  Shouldn't each of the terms need to be
multiplied by the variable value before applying
rowSums()+attr(sterms,'constant') ??
Thanks again
Dominik

On Wed, May 11, 2016 at 10:11 AM, Dominik Schneider <
dominik.schnei...@colorado.edu> wrote:

> Hi Simon, Thanks for this explanation.
> To make sure I understand, another way of explaining the y axis in my
> original example is that it is the contribution to snowdepth relative to
> the other variables (the example only had fsca, but my actual case has a
> couple others). i.e. a negative s(fsca) of -0.5 simply means snowdepth 0.5
> units below the intercept+s(x_i), where s(x_i) could also be negative in
> the case where total snowdepth is less than the intercept value.
>
> The use of by=fsca is really useful for interpreting the marginal impact
> of the different variables. With my actual data, the term s(fsca):fsca is
> never negative, which is much more intuitive. Is it appropriate to compare
> magnitudes of e.g. s(x2):x2 / mean(x2) and s(x2):x2 / mean(x2)  where
> mean(x_i) are the mean of the actual data?
>
> Lastly, how would these two differ: s(x1,by=x2); or
> s(x1,by=x1)*s(x2,by=x2) since interactions are surely present and i'm not
> sure if a linear combination is enough.
>
> Thanks!
> Dominik
>
>
> On Wed, May 11, 2016 at 3:11 AM, Simon Wood  wrote:
>
>> The spline having a positive value is not the same as a glm coefficient
>> having a positive value. When you plot a smooth, say s(x), that is
>> equivalent to plotting the line 'beta * x' in a GLM. It is not equivalent
>> to plotting 'beta'. The smooths in a gam are (usually) subject to
>> `sum-to-zero' identifiability constraints to avoid confounding via the
>> intercept, so they are bound to be negative over some part of the covariate
>> range. For example, if I have a model y ~ s(x) + s(z), I can't estimate the
>> mean level for s(x) and the mean level for s(z) as they are completely
>> confounded, and confounded with the model intercept term.
>>
>> I suppose that if you want to interpret the smooths as glm parameters
>> varying with the covariate they relate to then you can do, by setting the
>> model up as a varying coefficient model, using the `by' argument to 's'...
>>
>> gam(snowdepth~s(fsca,by=fsca),data=dat)
>>
>>
>> this model is `snowdepth_i = f(fsca_i) * fsca_i + e_i' . s(fsca,by=fsca)
>> is not confounded with the intercept, so no constraint is needed or
>> applied, and you can now interpret the smooth like a local GLM coefficient.
>>
>> best,
>> Simon
>>
>>
>>
>>
>> On 11/05/16 01:30, Dominik Schneider wrote:
>>
>>> Hi,
>>> Just getting into using GAM using the mgcv package. I've generated some
>>> models and extracted the splines for each of the variables and started
>>> visualizing them. I'm noticing that one of my variables is physically
>>> unrealistic.
>>>
>>> In the example below, my interpretation of the following plot is that the
>>> y-axis is basically the equivalent of a "parameter" value of a GLM; in
>>> GAM
>>> this value can change as the functional relationship changes between x
>>> and
>>> y. In my case, I am predicting snowdepth based on the fractional snow
>>> covered area. In no case will snowdepth realistically decrease for a unit
>>> increase in fsca so my question is: *Is there a way to constrain the
>>> spline
>>> to positive values? *
>>>
>>> Thanks
>>> Dominik
>>>
>>> library(mgcv)
>>> library(dplyr)
>>> library(ggplot2)
>>> extract_splines=function(mdl){
>>>sterms=predict(mdl,type='terms')
>>>datplot=cbind(sterms,mdl$model) %>% tbl_df
>>>datplot$intercept=attr(sterms,'constant')
>>>datplot$yhat=rowSums(sterms)+attr(sterms,'constant')
>>>return(datplot)
>>> }
>>> dat=data_frame(snowdepth=runif(100,min =
>>> 0.001,max=6.7),fsca=runif(100,0.01,.99))
>>> mdl=gam(snowdepth~s(fsca),data=dat)
>>> termdF=extract_splines(mdl)
>>> ggplot(termdF)+
&g

Re: [R] physical constraint with gam

2016-05-11 Thread Dominik Schneider
Hi Simon, Thanks for this explanation.
To make sure I understand, another way of explaining the y axis in my
original example is that it is the contribution to snowdepth relative to
the other variables (the example only had fsca, but my actual case has a
couple others). i.e. a negative s(fsca) of -0.5 simply means snowdepth 0.5
units below the intercept+s(x_i), where s(x_i) could also be negative in
the case where total snowdepth is less than the intercept value.

The use of by=fsca is really useful for interpreting the marginal impact of
the different variables. With my actual data, the term s(fsca):fsca is
never negative, which is much more intuitive. Is it appropriate to compare
magnitudes of e.g. s(x2):x2 / mean(x2) and s(x2):x2 / mean(x2)  where
mean(x_i) are the mean of the actual data?

Lastly, how would these two differ: s(x1,by=x2); or s(x1,by=x1)*s(x2,by=x2)
since interactions are surely present and i'm not sure if a linear
combination is enough.

Thanks!
Dominik


On Wed, May 11, 2016 at 3:11 AM, Simon Wood  wrote:

> The spline having a positive value is not the same as a glm coefficient
> having a positive value. When you plot a smooth, say s(x), that is
> equivalent to plotting the line 'beta * x' in a GLM. It is not equivalent
> to plotting 'beta'. The smooths in a gam are (usually) subject to
> `sum-to-zero' identifiability constraints to avoid confounding via the
> intercept, so they are bound to be negative over some part of the covariate
> range. For example, if I have a model y ~ s(x) + s(z), I can't estimate the
> mean level for s(x) and the mean level for s(z) as they are completely
> confounded, and confounded with the model intercept term.
>
> I suppose that if you want to interpret the smooths as glm parameters
> varying with the covariate they relate to then you can do, by setting the
> model up as a varying coefficient model, using the `by' argument to 's'...
>
> gam(snowdepth~s(fsca,by=fsca),data=dat)
>
>
> this model is `snowdepth_i = f(fsca_i) * fsca_i + e_i' . s(fsca,by=fsca)
> is not confounded with the intercept, so no constraint is needed or
> applied, and you can now interpret the smooth like a local GLM coefficient.
>
> best,
> Simon
>
>
>
>
> On 11/05/16 01:30, Dominik Schneider wrote:
>
>> Hi,
>> Just getting into using GAM using the mgcv package. I've generated some
>> models and extracted the splines for each of the variables and started
>> visualizing them. I'm noticing that one of my variables is physically
>> unrealistic.
>>
>> In the example below, my interpretation of the following plot is that the
>> y-axis is basically the equivalent of a "parameter" value of a GLM; in GAM
>> this value can change as the functional relationship changes between x and
>> y. In my case, I am predicting snowdepth based on the fractional snow
>> covered area. In no case will snowdepth realistically decrease for a unit
>> increase in fsca so my question is: *Is there a way to constrain the
>> spline
>> to positive values? *
>>
>> Thanks
>> Dominik
>>
>> library(mgcv)
>> library(dplyr)
>> library(ggplot2)
>> extract_splines=function(mdl){
>>sterms=predict(mdl,type='terms')
>>datplot=cbind(sterms,mdl$model) %>% tbl_df
>>datplot$intercept=attr(sterms,'constant')
>>datplot$yhat=rowSums(sterms)+attr(sterms,'constant')
>>return(datplot)
>> }
>> dat=data_frame(snowdepth=runif(100,min =
>> 0.001,max=6.7),fsca=runif(100,0.01,.99))
>> mdl=gam(snowdepth~s(fsca),data=dat)
>> termdF=extract_splines(mdl)
>> ggplot(termdF)+
>>geom_line(aes(x=fsca,y=`s(fsca)`))
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
> --
> Simon Wood, School of Mathematics, University of Bristol BS8 1TW UK
> +44 (0)117 33 18273 http://www.maths.bris.ac.uk/~sw15190
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] physical constraint with gam

2016-05-11 Thread Dominik Schneider
Hi,
Just getting into using GAM using the mgcv package. I've generated some
models and extracted the splines for each of the variables and started
visualizing them. I'm noticing that one of my variables is physically
unrealistic.

In the example below, my interpretation of the following plot is that the
y-axis is basically the equivalent of a "parameter" value of a GLM; in GAM
this value can change as the functional relationship changes between x and
y. In my case, I am predicting snowdepth based on the fractional snow
covered area. In no case will snowdepth realistically decrease for a unit
increase in fsca so my question is: *Is there a way to constrain the spline
to positive values? *

Thanks
Dominik

library(mgcv)
library(dplyr)
library(ggplot2)
extract_splines=function(mdl){
  sterms=predict(mdl,type='terms')
  datplot=cbind(sterms,mdl$model) %>% tbl_df
  datplot$intercept=attr(sterms,'constant')
  datplot$yhat=rowSums(sterms)+attr(sterms,'constant')
  return(datplot)
}
dat=data_frame(snowdepth=runif(100,min =
0.001,max=6.7),fsca=runif(100,0.01,.99))
mdl=gam(snowdepth~s(fsca),data=dat)
termdF=extract_splines(mdl)
ggplot(termdF)+
  geom_line(aes(x=fsca,y=`s(fsca)`))

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.