Re: [R] unique dates per ID

2016-11-14 Thread Jim Lemon
Hi Farnoosh,
Try this:

for(id in unique(df$Subject)) {
 whichsub<-df$Subject==id
 if(exists("newdf"))
  newdf<-rbind(newdf,df[whichsub,][which(!duplicated(df$dates[whichsub])),])
 else newdf<-df[whichsub,][which(!duplicated(df$dates[whichsub])),]
}

Jim


On Tue, Nov 15, 2016 at 9:38 AM, Farnoosh Sheikhi via R-help
 wrote:
> Hi,
> I have a data set like below:
> Subject<- c("2", "2", "2", "3", "3", "3", "4", "4", "5", "5", "5", 
> "5")dates<-c("2011-01-01", "2011-01-01", "2011-01-03" ,"2011-01-04", 
> "2011-01-05", "2011-01-06" ,"2011-01-07", "2011-01-07", "2011-01-09" 
> ,"2011-01-10" ,"2011-01-11" ,"2011-01-11")deps<-c("A", "B", "CC", 
> "C", "CC", "A", "F", "DD", "A", "F", "FF", "D")df <- data.frame(Subject, 
> dates, deps); df
> I want to choose unique dates per ID in a way there are not duplicate dates 
> per ID. I don't mind what department to pick. I really appreciate any help. 
> Best,Farnoosh
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] unique dates per ID

2016-11-14 Thread Ulrik Stervbo
Hi Farnoosh,

you can use unique in the R-base or distinct from the dplyr library.

Best
Ulrik

On Tue, 15 Nov 2016 at 06:59 Farnoosh Sheikhi via R-help <
r-help@r-project.org> wrote:

> Hi,
> I have a data set like below:
> Subject<- c("2", "2", "2", "3", "3", "3", "4", "4", "5", "5", "5",
> "5")dates<-c("2011-01-01", "2011-01-01", "2011-01-03" ,"2011-01-04",
> "2011-01-05", "2011-01-06" ,"2011-01-07", "2011-01-07", "2011-01-09"
> ,"2011-01-10" ,"2011-01-11" ,"2011-01-11")deps<-c("A", "B", "CC",
> "C", "CC", "A", "F", "DD", "A", "F", "FF", "D")df <- data.frame(Subject,
> dates, deps); df
> I want to choose unique dates per ID in a way there are not duplicate
> dates per ID. I don't mind what department to pick. I really appreciate any
> help. Best,Farnoosh
>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] unique dates per ID

2016-11-14 Thread Farnoosh Sheikhi via R-help
Hi, 
I have a data set like below:
Subject<- c("2", "2", "2", "3", "3", "3", "4", "4", "5", "5", "5", 
"5")dates<-c("2011-01-01", "2011-01-01", "2011-01-03" ,"2011-01-04", 
"2011-01-05", "2011-01-06" ,"2011-01-07", "2011-01-07", "2011-01-09" 
,"2011-01-10"         ,"2011-01-11" ,"2011-01-11")deps<-c("A", "B", "CC", "C", 
"CC", "A", "F", "DD", "A", "F", "FF", "D")df <- data.frame(Subject, dates, 
deps); df
I want to choose unique dates per ID in a way there are not duplicate dates per 
ID. I don't mind what department to pick. I really appreciate any help. 
Best,Farnoosh


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [FORGED] Re: [FORGED] How to remove box in Venn plots (Vennerable package, uses grid) - similar to bty="n" in standard plots

2016-11-14 Thread Paul Murrell

Hi

Glad I could help.

Here's a way you could get rid of the rectangle in the first place ...

library(Vennerable)
groups<-list(set1=1:100, set2=80:120)
V<-Venn(groups)
C<-compute.Venn(V)
X11(w=7,h=7)
grid.newpage()

plot(C, show=list(Universe=FALSE))

It required crawling through the 'Vennerable' source code a bit to find 
that 'show' argument, but that appears to do the trick.


Paul

On 15/11/16 02:08, DE LAS HERAS Jose wrote:

Hi,


The grid.ls() and grid.remove() approach worked beautifully to remove
the box, thank you! Because the box is the first thing to be drawn, it
is the first object shown by grid.ls(), so I can easily add a line of
code to automatically remove the box. Result!


Although I'd still like to know how one chooses not to plot that box in
the first place. I really must study a little the grid package. I've
survived using base R plots and they work very nicely, but it looks like
you can do a lot of cool stuff with grid.


In case you might know the answer and feel like adding a comment here
(I'm already very happy with the grid.ls() approach, thanks! :)) this is
a simple code example without making it look pretty or anything:




library(Vennerable)
groups<-list(set1=1:100, set2=80:120)
V<-Venn(groups)
C<-compute.Venn(V)
X11(w=7,h=7)
grid.newpage()
plot(C)

class(C)
[1] "VennDrawing"
attr(,"package")
[1] "Vennerable"

I don't want the black box around the diagram.
I was able to overwrite it with a white box like this:

vp=viewport(x=0.5, y=0.5, width=0.95, height=0.75)
pushViewport(vp)
grid.rect(gp=gpar(lty=1, col="white", lwd=15))
upViewport() # needs to be executed to return focus upwards

but as the box has different dimensions depending on the actual sets
being drawn, the width and height must be found empirically each time. A
bit boring it you need to produce a bunch of figures at once.

I was wondering what parameter I could include in the call to 'plot'
that would prevent the box from being drawn. Usually this is achieved
with bty="n", but the method for plotting a "VennDrawing" structure uses
the grid package and I'm lost there at the moment.

Thank you for your help again, grid.ls() etc is a very cool and
flexible approach

Jose





*From:* Paul Murrell 
*Sent:* 13 November 2016 19:57
*To:* DE LAS HERAS Jose; R-help@r-project.org
*Subject:* Re: [FORGED] [R] How to remove box in Venn plots (Vennerable
package, uses grid) - similar to bty="n" in standard plots

Hi

Can you supply some example code?

You might get some joy from grid.ls() to identify the box followed by
grid.remove() to get rid of it;  some example code would allow me to
provide more detailed advice.

Paul

On 12/11/16 05:12, DE LAS HERAS Jose wrote:

I'm using the package Vennerable to make Venn diagrams, but it always
makes a box around the diagram.

Using standard R plots I could eliminate that by indicating


bty="n"


but it seems Vennerable uses the Grid package to generate its plots
and I'm not really familiar enough with Grid. I was looking at the
documentation but I can't seem to find a way to achieve that. I'd
even be happy drawing a white rectangle with wide lines to overplot
the box, but there must be a way to not draw the box in the first
place.


Anybody knows how?


Jose

--

Dr. Jose I. de las Heras The Wellcome Trust Centre for Cell Biology
Swann Building Max Born Crescent University of Edinburgh Edinburgh
EH9 3BF UK

Phone: +44 (0)131 6507090

Fax:  +44 (0)131 6507360



The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



__ R-help@r-project.org
mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the

R-help -- Main R Mailing List: Primary help - Homepage - SfS

stat.ethz.ch
The main R mailing list, for announcements about the development of R
and the availability of new code, questions and answers about problems
and solutions using R ...




posting guide http://www.R-project.org/posting-guide.html and provide
commented, minimal, self-contained, reproducible code.



--
Dr Paul Murrell
Department of Statistics
The University of Auckland
Private Bag 92019
Auckland
New Zealand
64 9 3737599 x85392
p...@stat.auckland.ac.nz
http://www.stat.auckland.ac.nz/~paul/
Paul Murrell's Home Page 
www.stat.auckland.ac.nz
Department. My department home page. Research. The home page for R: A
language and environment for computing and graphics (my R graphics todo
list).





The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.



--
Dr Paul Murrell
Department of Statistics
The University of Auckland
Private Bag 92019
Auckland
New Zealand
64 9 3737599 x85392
p...@stat.auckland.ac.nz
http://www.stat.auckland.ac.nz/~paul/


[R] Revolutions blog: October 2016 roundup

2016-11-14 Thread David Smith via R-help
Since 2008, Microsoft (formerly Revolution Analytics) staff and guests have 
written about R every weekday at the
Revolutions blog: http://blog.revolutionanalytics.com
and every month I post a summary of articles from the previous month of 
particular interest to readers of r-help.

And in case you missed them, here are some articles related to R from the month 
of October:

A brief summary of the R 3.3.2 release: 
http://blog.revolutionanalytics.com/2016/10/r-332-now-available.html

"Data Science with SQL Server 2016", a free E-book featuring several in-depth R 
examples, is now available for download:
http://blog.revolutionanalytics.com/2016/10/data-science-with-sql-server-2016.html

The ReporterRs package makes it easy to insert R output, tables and graphics 
into Word and Powerpoint templates:
http://blog.revolutionanalytics.com/2016/10/reporters.html

R-hub, an on-line service to build and check R packages on multiple platforms, 
is now in public beta test:
http://blog.revolutionanalytics.com/2016/10/r-hub-public-beta.html

A style guide for R programs from Graham Williams, creator of rattle:
http://blog.revolutionanalytics.com/2016/10/sharing-r-code-with-style.html

The Economist used R and the Emotion API to track emotions of the US 
presidential candidates during the debates:
http://blog.revolutionanalytics.com/2016/10/debate-emotions.html

A new R Graph Gallery by Yan Holtz contains hundreds of data charts and their R 
code:
http://blog.revolutionanalytics.com/2016/10/the-r-graph-gallery-is-back.html

R Tools for Visual Studio 0.5 adds support for publishing R code as a SQL 
Server stored procedure:
http://blog.revolutionanalytics.com/2016/10/rtvs-05-now-available.html

After an accident, a data scientist estimates the value of a written-off 
vehicle with R:
http://blog.revolutionanalytics.com/2016/10/car-valuation.html

The "Team Data Science Process" and two new open-source projects from 
Microsoft: a visualization and exploration
framework; and a statistical reporting tool based on caret:
http://blog.revolutionanalytics.com/2016/10/the-team-data-science-process.html

An R function for "tilegrams", like US maps with states scaled to electoral 
college votes:
http://blog.revolutionanalytics.com/2016/10/tilegrams-in-r.html

Upcoming data science courses in Zurich, Oslo and Stockholm:
http://blog.revolutionanalytics.com/2016/10/practical-data-science.html

A tutorial on using R on Spark with SparkR, sparklyr, and RevoScaleR:
http://blog.revolutionanalytics.com/2016/10/tutorial-scalable-r-on-spark.html

An animated globe showing the impact of climate change, created with R:
http://blog.revolutionanalytics.com/2016/10/warming-globe.html

The ggiraph package makes it easy to add interactivity to ggplot2 graphics on 
the web:
http://blog.revolutionanalytics.com/2016/10/make-ggplot-graphics2-interactive-with-ggiraph.html

The haven package supports reading SAS, SPSS, Stata and other data file formats 
into R:
http://blog.revolutionanalytics.com/2016/10/import-data-to-r-from-other-statistics-tools-with-haven.html

More than half of published papers in Psychology contain at least one 
statistical reporting error, the statcheck package
reveals: http://blog.revolutionanalytics.com/2016/10/statcheck.html

Build data pipelines with Azure Data Factory and Microsoft R Server:
http://blog.revolutionanalytics.com/2016/10/r-server-data-factory.html

R used to analyze the scripts of "The Simpsons", and create a chart in the 
cartoon's unique style:
http://blog.revolutionanalytics.com/2016/10/homer-not-bart-is-the-star-of-the-simpsons.html

General interest stories (not related to R) in the past month included: rules 
for rulers
(http://blog.revolutionanalytics.com/2016/10/because-its-friday-dictators.html),
 a Hitchcock-Kubrick video mashup
(http://blog.revolutionanalytics.com/2016/10/because-its-friday-hitchcock-vs-kubrick.html),
 the Earth from the Moon
(http://blog.revolutionanalytics.com/2016/10/because-its-friday-earthrise.html),
 and the Dear Data project
(http://blog.revolutionanalytics.com/2016/10/because-its-friday-dear-data.html).

If you're looking for more articles about R, you can find summaries from 
previous months at
http://blog.revolutionanalytics.com/roundups/. You can receive daily blog posts 
via email using services like
blogtrottr.com.

As always, thanks for the comments and please keep sending suggestions to me at 
david...@microsoft.com or via Twitter
(I'm @revodavid).

Cheers,
# David

-- 
David M Smith 
R Community Lead, Microsoft  
Tel: +1 (312) 9205766 (Chicago IL, USA)
Twitter: @revodavid | Blog:  http://blog.revolutionanalytics.com

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Frequency of a character in a string

2016-11-14 Thread Hervé Pagès



On 11/14/2016 12:44 PM, Bert Gunter wrote:

(Sheepishly)...

Yes, thank you Hervé. It would have been nice if I had given correct
soutions. Fixed = TRUE could not have of course worked with ["a"]
character class!

Here's what I found with a 10 element vector each member of which is a
1e5 length string:


system.time((lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))

   user  system elapsed
  0.013   0.000   0.013


system.time(nchar(gsub("[^a]", "", x,fixed = FALSE)))

   user  system elapsed
  0.251   0.000   0.252
## WA slower



system.time(nchar(x) - nchar(gsub("a", "", x,fixed = TRUE)))

   user  system elapsed
  0.007   0.000   0.007
## twice as fast



Clearly and unsurprisingly, the message is to avoid fixed = FALSE;
after that, it seems mostly to be: who cares?!


Another message is to pay attention to the "cost" of generating a
big intermediate objects like the list returned by strsplit(). On a
big character vector made of 5000 strings of about 1e5 random letters
each, the strsplit-based solution uses more than 2Gb of RAM on my
Ubuntu system. The gsub( , fixed=TRUE) solution uses less than 1Gb.

Cheers,
H.




Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Nov 14, 2016 at 12:26 PM, Hervé Pagès  wrote:

Hi,

FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
or strsplit( , fixed=TRUE):

  set.seed(1)
  Vec <- paste(sample(letters, 500, replace = TRUE), collapse = "")

  system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
  #  user  system elapsed
  # 0.585   0.000   0.586

  system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
  #  user  system elapsed
  # 0.061   0.000   0.061

  system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
  #  user  system elapsed
  # 0.039   0.000   0.039

  identical(res1, res2)
  # [1] TRUE
  identical(res1, res3)
  # [1] TRUE

The gsub( , fixed=TRUE) solution also uses slightly less memory than the
strsplit( , fixed=TRUE) solution.

Cheers,
H.


On 11/14/2016 11:55 AM, Charles C. Berry wrote:


On Mon, 14 Nov 2016, Marc Schwartz wrote:




On Nov 14, 2016, at 11:26 AM, Charles C. Berry  wrote:

On Mon, 14 Nov 2016, Bert Gunter wrote:


[stuff deleted]


Hi,

Both gsub() and strsplit() are using regex based pattern matching
internally. That being said, they are ultimately calling .Internal
code, so both are pretty fast.

For comparison:

## Create a 1,000,000 character vector
set.seed(1)
Vec <- paste(sample(letters, 100, replace = TRUE), collapse = "")


nchar(Vec)


[1] 100

## Split the vector into single characters and tabulate


table(strsplit(Vec, split = "")[[1]])



   a b c d e f g h i j k l
38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
   m n o p q r s t u v w x
38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
   y z
38265 38299


## Get just the count of "a"


table(strsplit(Vec, split = "")[[1]])["a"]


   a
38664


nchar(gsub("[^a]", "", Vec))


[1] 38664


## Check performance


system.time(table(strsplit(Vec, split = "")[[1]])["a"])


  user  system elapsed
 0.100   0.007   0.107


system.time(nchar(gsub("[^a]", "", Vec)))


  user  system elapsed
 0.270   0.001   0.272


So, the above would suggest that using strsplit() is somewhat faster
than using gsub(). However, as Chuck notes, in the absence of more
exhaustive benchmarking, the difference may or may not be more
generalizable.




Whether splitting on fixed strings rather than treating them as
regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
what you split:

First repeating what Marc did...


system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])


   user  system elapsed
  0.132   0.010   0.139


system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])


   user  system elapsed
  0.130   0.010   0.138

... fixed=TRUE hardly matters. But the idiom I proposed...


system.time(sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=TRUE)) - 1))


   user  system elapsed
  0.017   0.000   0.018


system.time(sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=FALSE)) - 1))


   user  system elapsed
  0.104   0.000   0.104





... is 5 times faster with fixed=TRUE for this case.

This result matchea Marc's count:


sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)


[1] 38664





Chuck

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



--
Hervé Pagès

Program in Computational 

Re: [R] Issues with the way Apply handled NA's

2016-11-14 Thread David L Carlson
This behavior is documented in the manual page:

> prod(NULL)
[1] 1

You can check for an empty vector as follows:

plabor <- structure(list(colA = c(6, NA, 3, 4), colB = c(25, NA, 2, 7), 
colC = c(3, NA, 19, NA)), .Names = c("colA", "colB", "colC"),
class = "data.frame", row.names = c(NA, -4L))
# Use dput() to send data to the list

plabor$colD = apply(plabor[c("colA","colB","colC")], 1, prod, na.rm=TRUE)
# Use TRUE and FALSE since the abbreviations T and F are not reserved and
# could be redefined

vals <- apply(plabor[c("colA","colB","colC")],1,function(x) length(na.omit(x)))
vals
# [1] 3 0 3 2
plabor$colD <- ifelse(vals>0, plabor$colD, NA)
plabor

#   colA colB colC colD
# 16   253  450
# 2   NA   NA   NA   NA
# 332   19  114
# 447   NA   28

-
David L Carlson
Department of Anthropology
Texas A University
College Station, TX 77840-4352


-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Olu Ola via 
R-help
Sent: Monday, November 14, 2016 2:52 PM
To: R-help Mailing List
Subject: [R] Issues with the way Apply handled NA's

 Hello,I have a data set called plabor and have the following format:

| ColA | ColB | Colc |
| 6 | 25 | 3 |
| NA | NA | NA |
| 3 | 2 | 19 |
| 4 | 7 | NA |


I wanted to find the product of the three columns for each of the rows and I 
used the apply function follows:
plabor$colD = apply(plabor[c("colA","colB","colc")],1,prod,na.rm=T)
The result are as follows:

| ColA | ColB | Colc | colD |
| 6 | 25 | 3 | 450 |
| NA | NA | NA | 1 |
| 3 | 2 | 19 | 114 |
| 4 | 7 | NA | 28 |


The second row results is 1 instead of being ignored.
How do I deal with this issue because I do not want to exclude these data 
points with all NA's?
Regards

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics

2016-11-14 Thread Sidoti, Salvatore A.
Fascinating! So it appears that I can simply take the geometric mean of all 4 
metrics (unscaled), including weight, then designate that value as a relative 
measure of "size" within my sample population. The justification for using the 
geometric mean is shown by the high correlation between PC1 and the size values:

pc1 gm
pc1  1.000 -0.8458024
gm  -0.8458024  1.000

Pearson's product-moment correlation
data:  pc1 and gm
t = -10.869, df = 47, p-value = 2.032e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9104585 -0.7407939
sample estimates:
   cor 
-0.8458024

Salvatore A. Sidoti
PhD Student
Behavioral Ecology

-Original Message-
From: David L Carlson [mailto:dcarl...@tamu.edu] 
Sent: Monday, November 14, 2016 11:07 AM
To: Sidoti, Salvatore A. ; Jim Lemon 
; r-help mailing list 
Subject: RE: [R] Principle Component Analysis: Ranking Animal Size Based On 
Combined Metrics

The first principal component should be your estimate of "size" since it 
captures the correlations between all 4 variables. The second principle 
component must be orthogonal to the first so that if the first is "size", the 
second pc is independent of size, perhaps some measure of "shape". As would be 
expected, the first principal component is highly correlated with the geometric 
mean of the three linear measurements and moderately correlated with weight:

> gm <- apply(df[, -1], 1, prod)^(1/3)
> pc1 <- prcomp(df, scale.=TRUE)$x[, 1]
> plot(pc1, gm)
> cor(cbind(pc1, gm, wgt=df$weight))
   pc1 gmwgt
pc1  1.000 -0.9716317 -0.5943594
gm  -0.9716317  1.000  0.3967369
wgt -0.5943594  0.3967369  1.000

-
David L Carlson
Department of Anthropology
Texas A University
College Station, TX 77840-4352

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Sidoti, 
Salvatore A.
Sent: Sunday, November 13, 2016 7:38 PM
To: Jim Lemon; r-help mailing list
Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On 
Combined Metrics

Hi Jim,

Nice to see you again! First of all, apologies to all for bending the rules a 
bit with respect to the mailing list. I know this is a list for R programming 
specifically, and I have received some great advice in this regard in the past. 
I just thought this was an interesting applied problem that would generate some 
discussion about PCA in R.

Yes, that is an excellent question! Indeed, why not just volume? Since this is 
still a work in progress and we have not published as of yet, I would rather 
not be more specific about the type of animal at this time ;>}. Nonetheless, I 
can say that the animals I study change "size" depending on their feeding and 
hydration state. The abdomen in particular undergoes drastic size changes. That 
being said, there are key anatomical features that remain fixed in the adult.

Now, there *might* be a way to work volume into the PCA. Although volume is not 
a reliable metric since the abdomen size is so changeable while the animal is 
alive, but what about preserved specimens? I have many that have been 
marinating in ethanol for months. Wouldn't the tissues have equilibrated by 
now? Probably... I could measure volume by displacement or suspension, I 
suppose.

In the meantime, here's a few thoughts:

1)  Use the contribution % (known as C% hereafter) of each variable on 
principle components 1 and 2.

2)  The total contribution of a variable that explains the variations 
retained by PC1 an PC2 is calculated by:

sum(C%1 * eigenvalue1, C%2 * eigenvalue2)

3) Scale() to mean-center the columns of the data set.

4) Use these total contributions as the weights of an arithmetic mean.

For example, we have an animal with the following data (mean-centered):
weight: 1.334
interoc:-0.225
clength:0.046
cwidth: -0.847

The contributions of these variables on PC1 and PC2 are (% changed to 
proportions):
weight: 0.556
interoc:0.357
clength:0.493
cwidth: 0.291

To calculate size:
1.334(0.556) - 0.225(0.357) + 0.046(0.493) - 0.847(0.291) = 0.43758 Then divide 
by the sum of the weights:
0.43758 / 1.697 = 0.257855 = "animal size"

This value can then be used to rank the animal according to its size for 
further analysis...

Does this sound like a reasonable application of my PCA data?

Salvatore A. Sidoti
PhD Student
Behavioral Ecology

-Original Message-
From: Jim Lemon [mailto:drjimle...@gmail.com]
Sent: Sunday, November 13, 2016 3:53 PM
To: Sidoti, Salvatore A. ; r-help mailing list 

Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On 
Combined Metrics

Hi Salvatore,
If by "size" you mean volume, why not directly measure the volume of your 
animals? They 

Re: [R] Question about expression parser for "return" statement

2016-11-14 Thread Wolf, Steven
Just to add on a bit, please note that the return is superfluous.  If you write 
this:


normalDensityFunction = function(x, Mean, Variance) {

# no  "return" value given at all

(1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance)

}

normalDensityFunction(2,0,1)

...you get the right answer again.

This is not "best practices", and Duncan will probably give you 10 reasons why 
you should never do it this way.  But if the parentheses behavior bothers you 
enough, you can subvert it.  This probably won't work so well if you try to 
make any more complicated output.

Caveat Emptor.

-SW


--
Steven Wolf, PhD
Assistant Professor
Department of Physics
STEM CoRE -- STEM Collaborative for Research in Education
http://www.ecu.edu/cs-acad/aa/StemCore
East Carolina University
Phone: 252-737-5229




On Sun, 2016-11-13 at 13:35 -0500, Duncan Murdoch wrote:

On 13/11/2016 7:58 AM, Duncan Murdoch wrote:


On 13/11/2016 6:47 AM, Duncan Murdoch wrote:


On 13/11/2016 12:50 AM, Dave DeBarr wrote:


I've noticed that if I don't include parentheses around the intended return
value for the "return" statement, R will assume the first parenthetical
expression is the intended return value ... even if that parenthetical
expression is only part of a larger expression.

Is this intentional?



Yes, return is just a function call that has side effects.  As far as
the parser is concerned,

return ((1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance))

is basically the same as

f((1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance))



By the way, out of curiosity I took a look at the source of CRAN
packages to see if this actually occurs.  It turns out that "return" is
used as a variable name often enough to make automatic tests tricky, so
I don't know the answer to my question.  However, I did turn up a number
of cases where people have code like this:

 if (name == "") return;

(from the bio.infer package), which never calls return(), so doesn't
actually do what the author likely intended



I searched the R sources and the sources of CRAN packages, and found
this is a reasonably common problem:  it's in 111 packages, including
one in base R.  I'll be emailing the maintainers to let them know.

I'll see about putting a check for this into R CMD check.

Duncan Murdoch




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] I have a python API script that works and would like to translate it to R

2016-11-14 Thread Alemu Tadesse
Hi R-Geeks,

I have a python rest API  script that works very well.  I am learning R and
would like to translate it to R. I am wondering if there is a person who
uses API and knows both langues (Python and R) and willing to help me so
that I can share the Python script.

Thanks,

AT

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Zenga - inequality index - Do you know any package to compute it?

2016-11-14 Thread Jorge Cimentada
A simple google search directs me to the 'convey' package in CRAN which has
a function called svyzenga. Here
 for more
details. Maybe that's what you want.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Zenga - inequality index - Do you know any package to compute it?

2016-11-14 Thread fe_jasa

-- 
><><><><><><><><><><><><><><>
João Sousa Andrade
jasa04011...@gmail.com
><><><><><><><><><><><><><><>






[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] [FORGED] Issues with the way Apply handled NA's

2016-11-14 Thread Rolf Turner

On 15/11/16 09:52, Olu Ola via R-help wrote:

 Hello,I have a data set called plabor and have the following format:

| ColA | ColB | Colc |
| 6 | 25 | 3 |
| NA | NA | NA |
| 3 | 2 | 19 |
| 4 | 7 | NA |


I wanted to find the product of the three columns for each of the rows and I 
used the apply function follows:
plabor$colD = apply(plabor[c("colA","colB","colc")],1,prod,na.rm=T)
The result are as follows:

| ColA | ColB | Colc | colD |
| 6 | 25 | 3 | 450 |
| NA | NA | NA | 1 |
| 3 | 2 | 19 | 114 |
| 4 | 7 | NA | 28 |


The second row results is 1 instead of being ignored.
How do I deal with this issue because I do not want to exclude these data 
points with all NA's?


What do you mean by "ignored"?  If you really want rows of your matrix 
that are all NA to be omitted from consideration, delete such rows from 
your matrix a priori.


If you want the product of such rows to be NA rather than 1, use a 
"customised" function rather than prod() in your apply, with an 
appropriate if-else construction in the customised function.


It's very easy, and I won't tell you the details because I think it's 
time you actually learned something about R (given that you are using 
R), and one learns by doing.


cheers,

Rolf Turner

--
Technical Editor ANZJS
Department of Statistics
University of Auckland
Phone: +64-9-373-7599 ext. 88276

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Frequency of a character in a string

2016-11-14 Thread William Dunlap via R-help
Here is another variant, v3, and a change to your first example
so it returns the same value as your second example.

> set.seed(1001)
> x <- sapply(1:100,
function(x)paste0(sample(letters,rpois(1,1e5),rep=TRUE),collapse = ""))
> system.time(v1 <- lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) -
1)
   user  system elapsed
   0.470.000.49
> system.time(v2 <- nchar(gsub("[^a]", "", x)))
   user  system elapsed
   2.530.002.53
> system.time(v3 <- nchar(x) - nchar(gsub("a", "", x, fixed=TRUE)))
   user  system elapsed
   0.080.000.08
>
> all.equal(v1,v2)
[1] TRUE
> all.equal(v1,v3)
[1] TRUE


Bill Dunlap
TIBCO Software
wdunlap tibco.com

On Mon, Nov 14, 2016 at 12:23 PM, Bert Gunter 
wrote:

> Chuck, Marc, and anyone else who still has interest in this odd little
> discussion ...
>
> Yes, and with fixed = TRUE my approach took 1/3 as much time as
> Chuck's with a 10 element vector each element of which is a character
> string of length 1e5:
>
> > set.seed(1001)
> > x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse
> = ""))
>
> > system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) -
> 1))
>user  system elapsed
>   0.012   0.000   0.012
> > system.time(nchar(gsub("[^a]", "", x,fixed = TRUE)))
>user  system elapsed
>   0.004   0.000   0.004
>
> Best,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry 
> wrote:
> > On Mon, 14 Nov 2016, Marc Schwartz wrote:
> >
> >>
> >>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry 
> wrote:
> >>>
> >>> On Mon, 14 Nov 2016, Bert Gunter wrote:
> >>>
> > [stuff deleted]
> >
> >
> >> Hi,
> >>
> >> Both gsub() and strsplit() are using regex based pattern matching
> >> internally. That being said, they are ultimately calling .Internal
> code, so
> >> both are pretty fast.
> >>
> >> For comparison:
> >>
> >> ## Create a 1,000,000 character vector
> >> set.seed(1)
> >> Vec <- paste(sample(letters, 100, replace = TRUE), collapse = "")
> >>
> >>> nchar(Vec)
> >>
> >> [1] 100
> >>
> >> ## Split the vector into single characters and tabulate
> >>>
> >>> table(strsplit(Vec, split = "")[[1]])
> >>
> >>
> >>a b c d e f g h i j k l
> >> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
> >>m n o p q r s t u v w x
> >> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
> >>y z
> >> 38265 38299
> >>
> >>
> >> ## Get just the count of "a"
> >>>
> >>> table(strsplit(Vec, split = "")[[1]])["a"]
> >>
> >>a
> >> 38664
> >>
> >>> nchar(gsub("[^a]", "", Vec))
> >>
> >> [1] 38664
> >>
> >>
> >> ## Check performance
> >>>
> >>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
> >>
> >>   user  system elapsed
> >>  0.100   0.007   0.107
> >>
> >>> system.time(nchar(gsub("[^a]", "", Vec)))
> >>
> >>   user  system elapsed
> >>  0.270   0.001   0.272
> >>
> >>
> >> So, the above would suggest that using strsplit() is somewhat faster
> than
> >> using gsub(). However, as Chuck notes, in the absence of more exhaustive
> >> benchmarking, the difference may or may not be more generalizable.
> >
> >
> >
> > Whether splitting on fixed strings rather than treating them as
> > regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
> > what you split:
> >
> > First repeating what Marc did...
> >
> >> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
> >
> >user  system elapsed
> >   0.132   0.010   0.139
> >>
> >> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
> >
> >user  system elapsed
> >   0.130   0.010   0.138
> >
> > ... fixed=TRUE hardly matters. But the idiom I proposed...
> >
> >> system.time(sum(lengths(strsplit(paste0("X", Vec,
> "X"),"a",fixed=TRUE)) -
> >> 1))
> >
> >user  system elapsed
> >   0.017   0.000   0.018
> >>
> >> system.time(sum(lengths(strsplit(paste0("X", Vec,
> "X"),"a",fixed=FALSE)) -
> >> 1))
> >
> >user  system elapsed
> >   0.104   0.000   0.104
> >>
> >>
> >
> > ... is 5 times faster with fixed=TRUE for this case.
> >
> > This result matchea Marc's count:
> >
> >> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
> >
> > [1] 38664
> >>
> >>
> >
> > Chuck
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- 

Re: [R] Text categories based on the sentences

2016-11-14 Thread Jim Lemon
Hi Venky,
Unfortunately the MindReader package produces the following:

1. I want ice cream Desire
2. I like banana very much   Pleasure
3. Tomorrow i will eat chicken   Expectation
4. Yesterday i went to birthday party Reminiscence
5. I lost my mobile last week   Disappointment

I recall that there is a computer named Watson that might be of assistance.

Jim

On Tue, Nov 15, 2016 at 12:16 AM, Venky  wrote:
> Hi team,
>
> I have data set contains one variable "*Description*"
>
> *Description**  Category*
>
> 1. i want ice cream food
> 2. i like banana very much  fruit
> 3. tomorrow i will eat chicken  food
> 4. yesterday i went to birthday partyfestival
> 5. i lost my mobile last week   mobile
>
> Please remember that i have only "*Description*" Variables only.How can i
> get the categories column based on the sentences of *Description *column.
>
> kindly do the needful help for that
>
> Advance in Thanks.
>
>
>
>
>
>
> Thanks and Regards
> Venkatesan
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Issues with the way Apply handled NA's

2016-11-14 Thread Olu Ola via R-help
 Hello,I have a data set called plabor and have the following format:

| ColA | ColB | Colc |
| 6 | 25 | 3 |
| NA | NA | NA |
| 3 | 2 | 19 |
| 4 | 7 | NA |


I wanted to find the product of the three columns for each of the rows and I 
used the apply function follows:
plabor$colD = apply(plabor[c("colA","colB","colc")],1,prod,na.rm=T)
The result are as follows:

| ColA | ColB | Colc | colD |
| 6 | 25 | 3 | 450 |
| NA | NA | NA | 1 |
| 3 | 2 | 19 | 114 |
| 4 | 7 | NA | 28 |


The second row results is 1 instead of being ignored.
How do I deal with this issue because I do not want to exclude these data 
points with all NA's?
Regards

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Frequency of a character in a string

2016-11-14 Thread Bert Gunter
(Sheepishly)...

Yes, thank you Hervé. It would have been nice if I had given correct
soutions. Fixed = TRUE could not have of course worked with ["a"]
character class!

Here's what I found with a 10 element vector each member of which is a
1e5 length string:

> system.time((lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))
   user  system elapsed
  0.013   0.000   0.013

> system.time(nchar(gsub("[^a]", "", x,fixed = FALSE)))
   user  system elapsed
  0.251   0.000   0.252
## WA slower


> system.time(nchar(x) - nchar(gsub("a", "", x,fixed = TRUE)))
   user  system elapsed
  0.007   0.000   0.007
## twice as fast



Clearly and unsurprisingly, the message is to avoid fixed = FALSE;
after that, it seems mostly to be: who cares?!


Cheers,
Bert



Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Nov 14, 2016 at 12:26 PM, Hervé Pagès  wrote:
> Hi,
>
> FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
> or strsplit( , fixed=TRUE):
>
>   set.seed(1)
>   Vec <- paste(sample(letters, 500, replace = TRUE), collapse = "")
>
>   system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
>   #  user  system elapsed
>   # 0.585   0.000   0.586
>
>   system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
>   #  user  system elapsed
>   # 0.061   0.000   0.061
>
>   system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
>   #  user  system elapsed
>   # 0.039   0.000   0.039
>
>   identical(res1, res2)
>   # [1] TRUE
>   identical(res1, res3)
>   # [1] TRUE
>
> The gsub( , fixed=TRUE) solution also uses slightly less memory than the
> strsplit( , fixed=TRUE) solution.
>
> Cheers,
> H.
>
>
> On 11/14/2016 11:55 AM, Charles C. Berry wrote:
>>
>> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>>
>>>
 On Nov 14, 2016, at 11:26 AM, Charles C. Berry  wrote:

 On Mon, 14 Nov 2016, Bert Gunter wrote:

>> [stuff deleted]
>>
>>> Hi,
>>>
>>> Both gsub() and strsplit() are using regex based pattern matching
>>> internally. That being said, they are ultimately calling .Internal
>>> code, so both are pretty fast.
>>>
>>> For comparison:
>>>
>>> ## Create a 1,000,000 character vector
>>> set.seed(1)
>>> Vec <- paste(sample(letters, 100, replace = TRUE), collapse = "")
>>>
 nchar(Vec)
>>>
>>> [1] 100
>>>
>>> ## Split the vector into single characters and tabulate

 table(strsplit(Vec, split = "")[[1]])
>>>
>>>
>>>a b c d e f g h i j k l
>>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>>>m n o p q r s t u v w x
>>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>>>y z
>>> 38265 38299
>>>
>>>
>>> ## Get just the count of "a"

 table(strsplit(Vec, split = "")[[1]])["a"]
>>>
>>>a
>>> 38664
>>>
 nchar(gsub("[^a]", "", Vec))
>>>
>>> [1] 38664
>>>
>>>
>>> ## Check performance

 system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>>>
>>>   user  system elapsed
>>>  0.100   0.007   0.107
>>>
 system.time(nchar(gsub("[^a]", "", Vec)))
>>>
>>>   user  system elapsed
>>>  0.270   0.001   0.272
>>>
>>>
>>> So, the above would suggest that using strsplit() is somewhat faster
>>> than using gsub(). However, as Chuck notes, in the absence of more
>>> exhaustive benchmarking, the difference may or may not be more
>>> generalizable.
>>
>>
>>
>> Whether splitting on fixed strings rather than treating them as
>> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
>> what you split:
>>
>> First repeating what Marc did...
>>
>>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
>>
>>user  system elapsed
>>   0.132   0.010   0.139
>>>
>>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
>>
>>user  system elapsed
>>   0.130   0.010   0.138
>>
>> ... fixed=TRUE hardly matters. But the idiom I proposed...
>>
>>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>>> "X"),"a",fixed=TRUE)) - 1))
>>
>>user  system elapsed
>>   0.017   0.000   0.018
>>>
>>> system.time(sum(lengths(strsplit(paste0("X", Vec,
>>> "X"),"a",fixed=FALSE)) - 1))
>>
>>user  system elapsed
>>   0.104   0.000   0.104
>>>
>>>
>>
>> ... is 5 times faster with fixed=TRUE for this case.
>>
>> This result matchea Marc's count:
>>
>>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
>>
>> [1] 38664
>>>
>>>
>>
>> Chuck
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
> --
> Hervé Pagès
>

Re: [R] Frequency of a character in a string

2016-11-14 Thread Hervé Pagès

Hi,

FWIW using gsub( , fixed=TRUE) is faster than using gsub( , fixed=FALSE)
or strsplit( , fixed=TRUE):

  set.seed(1)
  Vec <- paste(sample(letters, 500, replace = TRUE), collapse = "")

  system.time(res1 <- nchar(gsub("[^a]", "", Vec)))
  #  user  system elapsed
  # 0.585   0.000   0.586

  system.time(res2 <- lengths(strsplit(Vec,"a",fixed=TRUE)) - 1L)
  #  user  system elapsed
  # 0.061   0.000   0.061

  system.time(res3 <- nchar(Vec) - nchar(gsub("a", "", Vec, fixed=TRUE)))
  #  user  system elapsed
  # 0.039   0.000   0.039

  identical(res1, res2)
  # [1] TRUE
  identical(res1, res3)
  # [1] TRUE

The gsub( , fixed=TRUE) solution also uses slightly less memory than the
strsplit( , fixed=TRUE) solution.

Cheers,
H.


On 11/14/2016 11:55 AM, Charles C. Berry wrote:

On Mon, 14 Nov 2016, Marc Schwartz wrote:




On Nov 14, 2016, at 11:26 AM, Charles C. Berry  wrote:

On Mon, 14 Nov 2016, Bert Gunter wrote:


[stuff deleted]


Hi,

Both gsub() and strsplit() are using regex based pattern matching
internally. That being said, they are ultimately calling .Internal
code, so both are pretty fast.

For comparison:

## Create a 1,000,000 character vector
set.seed(1)
Vec <- paste(sample(letters, 100, replace = TRUE), collapse = "")


nchar(Vec)

[1] 100

## Split the vector into single characters and tabulate

table(strsplit(Vec, split = "")[[1]])


   a b c d e f g h i j k l
38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
   m n o p q r s t u v w x
38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
   y z
38265 38299


## Get just the count of "a"

table(strsplit(Vec, split = "")[[1]])["a"]

   a
38664


nchar(gsub("[^a]", "", Vec))

[1] 38664


## Check performance

system.time(table(strsplit(Vec, split = "")[[1]])["a"])

  user  system elapsed
 0.100   0.007   0.107


system.time(nchar(gsub("[^a]", "", Vec)))

  user  system elapsed
 0.270   0.001   0.272


So, the above would suggest that using strsplit() is somewhat faster
than using gsub(). However, as Chuck notes, in the absence of more
exhaustive benchmarking, the difference may or may not be more
generalizable.



Whether splitting on fixed strings rather than treating them as
regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
what you split:

First repeating what Marc did...


system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])

   user  system elapsed
  0.132   0.010   0.139

system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])

   user  system elapsed
  0.130   0.010   0.138

... fixed=TRUE hardly matters. But the idiom I proposed...


system.time(sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=TRUE)) - 1))

   user  system elapsed
  0.017   0.000   0.018

system.time(sum(lengths(strsplit(paste0("X", Vec,
"X"),"a",fixed=FALSE)) - 1))

   user  system elapsed
  0.104   0.000   0.104




... is 5 times faster with fixed=TRUE for this case.

This result matchea Marc's count:


sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)

[1] 38664




Chuck

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide
http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpa...@fredhutch.org
Phone:  (206) 667-5791
Fax:(206) 667-1319

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Frequency of a character in a string

2016-11-14 Thread Bert Gunter
Chuck, Marc, and anyone else who still has interest in this odd little
discussion ...

Yes, and with fixed = TRUE my approach took 1/3 as much time as
Chuck's with a 10 element vector each element of which is a character
string of length 1e5:

> set.seed(1001)
> x <- sapply(1:10, function(x)paste0(sample(letters,1e5,rep=TRUE),collapse = 
> ""))

> system.time(sum(lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1))
   user  system elapsed
  0.012   0.000   0.012
> system.time(nchar(gsub("[^a]", "", x,fixed = TRUE)))
   user  system elapsed
  0.004   0.000   0.004

Best,
Bert


Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Nov 14, 2016 at 11:55 AM, Charles C. Berry  wrote:
> On Mon, 14 Nov 2016, Marc Schwartz wrote:
>
>>
>>> On Nov 14, 2016, at 11:26 AM, Charles C. Berry  wrote:
>>>
>>> On Mon, 14 Nov 2016, Bert Gunter wrote:
>>>
> [stuff deleted]
>
>
>> Hi,
>>
>> Both gsub() and strsplit() are using regex based pattern matching
>> internally. That being said, they are ultimately calling .Internal code, so
>> both are pretty fast.
>>
>> For comparison:
>>
>> ## Create a 1,000,000 character vector
>> set.seed(1)
>> Vec <- paste(sample(letters, 100, replace = TRUE), collapse = "")
>>
>>> nchar(Vec)
>>
>> [1] 100
>>
>> ## Split the vector into single characters and tabulate
>>>
>>> table(strsplit(Vec, split = "")[[1]])
>>
>>
>>a b c d e f g h i j k l
>> 38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
>>m n o p q r s t u v w x
>> 38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
>>y z
>> 38265 38299
>>
>>
>> ## Get just the count of "a"
>>>
>>> table(strsplit(Vec, split = "")[[1]])["a"]
>>
>>a
>> 38664
>>
>>> nchar(gsub("[^a]", "", Vec))
>>
>> [1] 38664
>>
>>
>> ## Check performance
>>>
>>> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
>>
>>   user  system elapsed
>>  0.100   0.007   0.107
>>
>>> system.time(nchar(gsub("[^a]", "", Vec)))
>>
>>   user  system elapsed
>>  0.270   0.001   0.272
>>
>>
>> So, the above would suggest that using strsplit() is somewhat faster than
>> using gsub(). However, as Chuck notes, in the absence of more exhaustive
>> benchmarking, the difference may or may not be more generalizable.
>
>
>
> Whether splitting on fixed strings rather than treating them as
> regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
> what you split:
>
> First repeating what Marc did...
>
>> system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])
>
>user  system elapsed
>   0.132   0.010   0.139
>>
>> system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])
>
>user  system elapsed
>   0.130   0.010   0.138
>
> ... fixed=TRUE hardly matters. But the idiom I proposed...
>
>> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) -
>> 1))
>
>user  system elapsed
>   0.017   0.000   0.018
>>
>> system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) -
>> 1))
>
>user  system elapsed
>   0.104   0.000   0.104
>>
>>
>
> ... is 5 times faster with fixed=TRUE for this case.
>
> This result matchea Marc's count:
>
>> sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)
>
> [1] 38664
>>
>>
>
> Chuck

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Frequency of a character in a string

2016-11-14 Thread Charles C. Berry

On Mon, 14 Nov 2016, Marc Schwartz wrote:




On Nov 14, 2016, at 11:26 AM, Charles C. Berry  wrote:

On Mon, 14 Nov 2016, Bert Gunter wrote:


[stuff deleted]


Hi,

Both gsub() and strsplit() are using regex based pattern matching 
internally. That being said, they are ultimately calling .Internal code, 
so both are pretty fast.


For comparison:

## Create a 1,000,000 character vector
set.seed(1)
Vec <- paste(sample(letters, 100, replace = TRUE), collapse = "")


nchar(Vec)

[1] 100

## Split the vector into single characters and tabulate

table(strsplit(Vec, split = "")[[1]])


   a b c d e f g h i j k l
38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621
   m n o p q r s t u v w x
38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310
   y z
38265 38299


## Get just the count of "a"

table(strsplit(Vec, split = "")[[1]])["a"]

   a
38664


nchar(gsub("[^a]", "", Vec))

[1] 38664


## Check performance

system.time(table(strsplit(Vec, split = "")[[1]])["a"])

  user  system elapsed
 0.100   0.007   0.107


system.time(nchar(gsub("[^a]", "", Vec)))

  user  system elapsed
 0.270   0.001   0.272


So, the above would suggest that using strsplit() is somewhat faster 
than using gsub(). However, as Chuck notes, in the absence of more 
exhaustive benchmarking, the difference may or may not be more 
generalizable.



Whether splitting on fixed strings rather than treating them as
regex'es (i.e.`fixed=TRUE') makes a big difference seems to depend on
what you split:

First repeating what Marc did...


system.time(table(strsplit(Vec, split = "",fixed=TRUE)[[1]])["a"])

   user  system elapsed
  0.132   0.010   0.139 

system.time(table(strsplit(Vec, split = "",fixed=FALSE)[[1]])["a"])

   user  system elapsed
  0.130   0.010   0.138

... fixed=TRUE hardly matters. But the idiom I proposed...


system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=TRUE)) - 1))

   user  system elapsed
  0.017   0.000   0.018 

system.time(sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1))

   user  system elapsed
  0.104   0.000   0.104




... is 5 times faster with fixed=TRUE for this case.

This result matchea Marc's count:


sum(lengths(strsplit(paste0("X", Vec, "X"),"a",fixed=FALSE)) - 1)

[1] 38664




Chuck

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics

2016-11-14 Thread David L Carlson
Usually you want to use the geometric mean on variables measured on the same 
scale, but in your case, transforming weight didn't change much. Adding cube 
root transformation as another approach (since weight should increase as the 
cube of the linear measures), the correlations with the 3 linear measurements 
are about the same for the transformed values and the 3 transformations are 
very strongly correlated:

> wgt.log <- log(df$weight)
> wgt.cube <- df$weight^(1/3)
> cor(cbind(weight=df$weight, wgt.log, wgt.cube), df[, -1])
   interoccwidth   clength
weight   0.3048239 0.2545593 0.4884446
wgt.log  0.3096511 0.2807528 0.4841830
wgt.cube 0.3077714 0.2724312 0.4863528
> cor(cbind(weight=df$weight, wgt.log, wgt.cube))
weight   wgt.log  wgt.cube
weight   1.000 0.9862879 0.9939102
wgt.log  0.9862879 1.000 0.9984574
wgt.cube 0.9939102 0.9984574 1.000

David C

-Original Message-
From: Sidoti, Salvatore A. [mailto:sidoti...@buckeyemail.osu.edu] 
Sent: Monday, November 14, 2016 11:41 AM
To: David L Carlson; Jim Lemon; r-help mailing list
Subject: RE: [R] Principle Component Analysis: Ranking Animal Size Based On 
Combined Metrics

Fascinating! So it appears that I can simply take the geometric mean of all 4 
metrics (unscaled), including weight, then designate that value as a relative 
measure of "size" within my sample population. The justification for using the 
geometric mean is shown by the high correlation between PC1 and the size values:

pc1 gm
pc1  1.000 -0.8458024
gm  -0.8458024  1.000

Pearson's product-moment correlation
data:  pc1 and gm
t = -10.869, df = 47, p-value = 2.032e-14
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 -0.9104585 -0.7407939
sample estimates:
   cor 
-0.8458024

Salvatore A. Sidoti
PhD Student
Behavioral Ecology

-Original Message-
From: David L Carlson [mailto:dcarl...@tamu.edu] 
Sent: Monday, November 14, 2016 11:07 AM
To: Sidoti, Salvatore A. ; Jim Lemon 
; r-help mailing list 
Subject: RE: [R] Principle Component Analysis: Ranking Animal Size Based On 
Combined Metrics

The first principal component should be your estimate of "size" since it 
captures the correlations between all 4 variables. The second principle 
component must be orthogonal to the first so that if the first is "size", the 
second pc is independent of size, perhaps some measure of "shape". As would be 
expected, the first principal component is highly correlated with the geometric 
mean of the three linear measurements and moderately correlated with weight:

> gm <- apply(df[, -1], 1, prod)^(1/3)
> pc1 <- prcomp(df, scale.=TRUE)$x[, 1]
> plot(pc1, gm)
> cor(cbind(pc1, gm, wgt=df$weight))
   pc1 gmwgt
pc1  1.000 -0.9716317 -0.5943594
gm  -0.9716317  1.000  0.3967369
wgt -0.5943594  0.3967369  1.000

-
David L Carlson
Department of Anthropology
Texas A University
College Station, TX 77840-4352

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Sidoti, 
Salvatore A.
Sent: Sunday, November 13, 2016 7:38 PM
To: Jim Lemon; r-help mailing list
Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On 
Combined Metrics

Hi Jim,

Nice to see you again! First of all, apologies to all for bending the rules a 
bit with respect to the mailing list. I know this is a list for R programming 
specifically, and I have received some great advice in this regard in the past. 
I just thought this was an interesting applied problem that would generate some 
discussion about PCA in R.

Yes, that is an excellent question! Indeed, why not just volume? Since this is 
still a work in progress and we have not published as of yet, I would rather 
not be more specific about the type of animal at this time ;>}. Nonetheless, I 
can say that the animals I study change "size" depending on their feeding and 
hydration state. The abdomen in particular undergoes drastic size changes. That 
being said, there are key anatomical features that remain fixed in the adult.

Now, there *might* be a way to work volume into the PCA. Although volume is not 
a reliable metric since the abdomen size is so changeable while the animal is 
alive, but what about preserved specimens? I have many that have been 
marinating in ethanol for months. Wouldn't the tissues have equilibrated by 
now? Probably... I could measure volume by displacement or suspension, I 
suppose.

In the meantime, here's a few thoughts:

1)  Use the contribution % (known as C% hereafter) of each variable on 
principle components 1 and 2.

2)  The total contribution of a variable that explains the variations 
retained by PC1 an PC2 is calculated by:

sum(C%1 * eigenvalue1, C%2 * eigenvalue2)

3) Scale() to mean-center the columns of 

Re: [R] Question about expression parser for "return" statement

2016-11-14 Thread Jeff Newmiller
Sorry, I missed the operation-after-function call aspect of the OP question.

However, I think my policy of avoiding the return function as much as possible 
serves as an effective antibugging strategy for this problem, in addition to 
its other benefits.
-- 
Sent from my phone. Please excuse my brevity.

On November 14, 2016 2:12:49 AM PST, Duncan Murdoch  
wrote:
>On 13/11/2016 9:42 PM, Jeff Newmiller wrote:
>> I find your response here inconsistent... either including `return`
>causes a "wasted" function call to occur (same result achieved slower)
>or the parser has an optimization in it to prevent the wasted function
>call (only behaviorally the same).
>
>I don't understand what you are finding inconsistent.  I wasn't talking
>
>about wasting anything.  I was just saying that expressions like
>
>return (a)*b
>
>are evaluated by calling return(a) first, because return() is a 
>function, and then they'll never get to the multiplication.
>
>BTW, there don't appear to be many instances of this particular bug in 
>CRAN packages, though I don't have a reliable test for it yet.  The
>most 
>common error seems to be using just "return", as mentioned before.  The
>
>fix for that is to add parens, e.g. "return()".  The next most common
>is 
>something like
>
>invisible(return(x))
>
>which returns x before making it invisible.  The fix for this is to use
>
>return(invisible(x))
>
>
>> I carefully avoid using the return function in R. Both because using
>it before the end of a function usually makes the logic harder to
>follow and because I am under the impression that using it at the end
>of the function is a small but pointless waste of CPU cycles. That some
>people might be prone to writing a C-like use of "return;" which causes
>a function object to be returned only increases my aversion to using
>it.
>
>Sometimes it is fine to use return(x), but it shouldn't be used
>routinely.
>
>Duncan Murdoch
>
>> -- Sent from my phone. Please excuse my brevity. On November 13, 2016
>> 3:47:10 AM PST, Duncan Murdoch  wrote:
>>> >On 13/11/2016 12:50 AM, Dave DeBarr wrote:
 >> I've noticed that if I don't include parentheses around the
>intended
>>> >return
 >> value for the "return" statement, R will assume the first
>>> >parenthetical
 >> expression is the intended return value ... even if that
>>> >parenthetical
 >> expression is only part of a larger expression.
 >>
 >> Is this intentional?
>>> >
>>> >Yes, return is just a function call that has side effects.  As far
>as
>>> >the parser is concerned,
>>> >
>>> >return ((1/sqrt(2*pi*Variance))*exp(-(1/2)*((x -
>Mean)^2)/Variance))
>>> >
>>> >is basically the same as
>>> >
>>> >f((1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance))
>>> >
>>> >Duncan Murdoch

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Frequency of a character in a string

2016-11-14 Thread Marc Schwartz

> On Nov 14, 2016, at 11:26 AM, Charles C. Berry  wrote:
> 
> On Mon, 14 Nov 2016, Bert Gunter wrote:
> 
>> Yes, but it need some help, since nchar gives the length of the
>> *entire* string; e.g.
>> 
>> ## to count "a" 's  :
>> 
>>> x <-(c("abbababba","bbabbabbaaaba"))
>>> nchar(gsub("[^a]","",x))
>> [1] 4 6
>> 
>> This is one of about 8 zillion ways to do this in base R if you don't
>> want to use a specialized package.
>> 
>> Just for curiosity: Can anyone comment on what is the most efficient
>> way to do this using base R pattern matching?
>> 
> 
> Most efficient? There probably is no uniformly most efficient way to do this 
> as the timing will depend on the distribution of "a" in the atoms of any 
> vector as well as the length of the vector.
> 
> But here is one way to avoid the regular expression matching:
> 
> lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1
> 
> 
> Chuck
> 


Hi,

Both gsub() and strsplit() are using regex based pattern matching internally. 
That being said, they are ultimately calling .Internal code, so both are pretty 
fast.

For comparison:

## Create a 1,000,000 character vector
set.seed(1)
Vec <- paste(sample(letters, 100, replace = TRUE), collapse = "")

> nchar(Vec)
[1] 100

## Split the vector into single characters and tabulate 
> table(strsplit(Vec, split = "")[[1]])

a b c d e f g h i j k l 
38664 38442 38282 38496 38540 38623 38548 38288 38143 38493 38184 38621 
m n o p q r s t u v w x 
38306 38725 38705 38144 38529 38809 38575 38355 38386 38364 38904 38310 
y z 
38265 38299 


## Get just the count of "a"
> table(strsplit(Vec, split = "")[[1]])["a"]
a 
38664 

> nchar(gsub("[^a]", "", Vec))
[1] 38664


## Check performance
> system.time(table(strsplit(Vec, split = "")[[1]])["a"])
   user  system elapsed 
  0.100   0.007   0.107 

> system.time(nchar(gsub("[^a]", "", Vec)))
   user  system elapsed 
  0.270   0.001   0.272 


So, the above would suggest that using strsplit() is somewhat faster than using 
gsub(). However, as Chuck notes, in the absence of more exhaustive 
benchmarking, the difference may or may not be more generalizable.

Regards,

Marc Schwartz

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about expression parser for "return" statement

2016-11-14 Thread Wolf, Steven
I stand corrected.  I have been chided in the past for not explicitly returning 
my output by someone claiming it is not best practices.

-Steve

On Mon, 2016-11-14 at 12:22 -0500, Duncan Murdoch wrote:

On 14/11/2016 11:26 AM, Wolf, Steven wrote:


Just to add on a bit, please note that the return is superfluous.  If
you write this:

normalDensityFunction = function(x, Mean, Variance) {
 # no  "return" value given at all
 (1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance)
}
normalDensityFunction(2,0,1)

...you get the right answer again.

This is not "best practices", and Duncan will probably give you 10
reasons why you should never do it this way.  But if the parentheses
behavior bothers you enough, you can subvert it.  This probably won't
work so well if you try to make any more complicated output.



Why do you say that's not best practice?  I would say that's preferable
to an explicit return().

Duncan



Caveat Emptor.

-SW



[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Frequency of a character in a string

2016-11-14 Thread Charles C. Berry

On Mon, 14 Nov 2016, Bert Gunter wrote:


Yes, but it need some help, since nchar gives the length of the
*entire* string; e.g.

## to count "a" 's  :


x <-(c("abbababba","bbabbabbaaaba"))
nchar(gsub("[^a]","",x))

[1] 4 6

This is one of about 8 zillion ways to do this in base R if you don't
want to use a specialized package.

Just for curiosity: Can anyone comment on what is the most efficient
way to do this using base R pattern matching?



Most efficient? There probably is no uniformly most efficient way to do 
this as the timing will depend on the distribution of "a" in the atoms of 
any vector as well as the length of the vector.


But here is one way to avoid the regular expression matching:

lengths(strsplit(paste0("X", x, "X"),"a",fixed=TRUE)) - 1


Chuck

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about expression parser for "return" statement

2016-11-14 Thread Duncan Murdoch

On 14/11/2016 11:26 AM, Wolf, Steven wrote:
Just to add on a bit, please note that the return is superfluous.  If 
you write this:


normalDensityFunction = function(x, Mean, Variance) {
 # no  "return" value given at all
 (1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance)
}
normalDensityFunction(2,0,1)

...you get the right answer again.

This is not "best practices", and Duncan will probably give you 10 
reasons why you should never do it this way.  But if the parentheses 
behavior bothers you enough, you can subvert it.  This probably won't 
work so well if you try to make any more complicated output.


Why do you say that's not best practice?  I would say that's preferable 
to an explicit return().


Duncan


Caveat Emptor.

-SW

--
Steven Wolf, PhD
Assistant Professor
Department of Physics
STEM CoRE -- STEM Collaborative for Research in Education
http://www.ecu.edu/cs-acad/aa/StemCore
East Carolina University
Phone: 252-737-5229



On Sun, 2016-11-13 at 13:35 -0500, Duncan Murdoch wrote:

On 13/11/2016 7:58 AM, Duncan Murdoch wrote:

On 13/11/2016 6:47 AM, Duncan Murdoch wrote:

On 13/11/2016 12:50 AM, Dave DeBarr wrote:
I've noticed that if I don't include parentheses around the 
intended return value for the "return" statement, R will assume 
the first parenthetical expression is the intended return value 
... even if that parenthetical expression is only part of a larger 
expression. Is this intentional? 
Yes, return is just a function call that has side effects. As far 
as the parser is concerned, return 
((1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance)) is 
basically the same as f((1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - 
Mean)^2)/Variance)) 
By the way, out of curiosity I took a look at the source of CRAN 
packages to see if this actually occurs. It turns out that "return" 
is used as a variable name often enough to make automatic tests 
tricky, so I don't know the answer to my question. However, I did 
turn up a number of cases where people have code like this: if (name 
== "") return; (from the bio.infer package), which never calls 
return(), so doesn't actually do what the author likely intended 



I searched the R sources and the sources of CRAN packages, and found
this is a reasonably common problem:  it's in 111 packages, including
one in base R.  I'll be emailing the maintainers to let them know.

I'll see about putting a check for this into R CMD check.

Duncan Murdoch




__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about ‘The R Project’.

2016-11-14 Thread Hadley Wickham
>> We have a question about ‘The R Project’.
>>
>> It looks like it’s an open source software, but the document from the 
>> website shows that it’s free of use not free of price.
>>
>> Please, confirm us the if it cost fees to use it for commercial use.
>>
>> If needed, could you inform us the price for it, too?
>>
>> Best regards,
>> Jane Kim.
>>
>
> Can I use R for commercial purposes?
> https://cran.r-project.org/doc/FAQ/R-FAQ.html#Can-I-use-R-for-commercial-purposes_003f
>  
> 
>
> If you mean RStudio you have to pay for commercial use. RStudio and R are 
> different.
> https://www.rstudio.com/pricing/ 

That's not true, as RStudio is also open source.  You don't have to
pay to use it commercially, but you might want to pay to use it
commercial because we provide additional features of use to people in
bigger companies.

Hadley

-- 
http://hadley.nz

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Question about ‘The R Project’.

2016-11-14 Thread John McKown
On Mon, Nov 14, 2016 at 2:00 AM, 김세희  wrote:

> Hello,
>
> I’m Jane Kim from Zenith and Company.
>
> We have a question about ‘The R Project’.
>
> It looks like it’s an open source software, but the document from the
> website shows that it’s free of use not free of price.
>
> Please, confirm us the if it cost fees to use it for commercial use.


> If needed, could you inform us the price for it, too?
>
> Best regards,
> Jane Kim.
>
>
​According to
https://cran.r-project.org/doc/FAQ/R-FAQ.html#Can-I-use-R-for-commercial-purposes_003f​
, R is released under the "GNU Public License" version 2.0. This means it
is "libre" (the source is available). Also, it is "gratis" (it is free from
cost to use, although it is permissible to charge a one-time "distribution"
fee - such as might be done if you want someone to mail you a DVD with the
source on it.)

​[quote]

2.11 Can I use R for commercial purposes?

R is released under the GNU General Public License (GPL), version 2
. If you have any
questions regarding the legality of using R in any particular situation you
should bring it up with your legal counsel. We are in no position to offer
legal advice.

It is the opinion of the R Core Team that one can use R for commercial
purposes (e.g., in business or in consulting). The GPL, like all Open
Source licenses, permits all and any use of the package. It only restricts
distribution of R or of other programs containing code from R. This is made
clear in clause 6 (“No Discrimination Against Fields of Endeavor”) of the Open
Source Definition :

The license must not restrict anyone from making use of the program in a
specific field of endeavor. For example, it may not restrict the program
from being used in a business, or from being used for genetic research.

It is also explicitly stated in clause 0 of the GPL, which says in part

Activities other than copying, distribution and modification are not
covered by this License; they are outside its scope. The act of running the
Program is not restricted, and the output from the Program is covered only
if its contents constitute a work based on the Program.

Most add-on packages, including all recommended ones, also explicitly allow
commercial use in this way. A few packages are restricted to
“non-commercial use”; you should contact the author to clarify whether
these may be used or seek the advice of your legal counsel.

None of the discussion in this section constitutes legal advice. The R Core
Team does not provide legal advice under any circumstances.

[quote/]​



-- 
Heisenberg may have been here.

Unicode: http://xkcd.com/1726/

Maranatha! <><
John McKown

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Frequency of a character in a string

2016-11-14 Thread Bert Gunter
Yes, but it need some help, since nchar gives the length of the
*entire* string; e.g.

## to count "a" 's  :

> x <-(c("abbababba","bbabbabbaaaba"))
> nchar(gsub("[^a]","",x))
[1] 4 6

This is one of about 8 zillion ways to do this in base R if you don't
want to use a specialized package.

Just for curiosity: Can anyone comment on what is the most efficient
way to do this using base R pattern matching?

Cheers,
Bert
Bert Gunter

"The trouble with having an open mind is that people keep coming along
and sticking things into it."
-- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )


On Mon, Nov 14, 2016 at 5:52 AM, Brijesh Mishra
 wrote:
> ?nchar in the base R should also help...
>
> On Mon, Nov 14, 2016 at 2:26 PM, Ismail SEZEN  wrote:
>
>>
>> > On 14 Nov 2016, at 11:44, Ferri Leberl  wrote:
>> >
>> >
>> > Dear All,
>> > Is there a function to count the occurences of a certain character in a
>> string resp. in a vector of strings?
>> > Thank you in advance!
>> > Yours, Ferri
>> >
>>
>> library(stringr)
>> ?str_count
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/
>> posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about ‘The R Project’.

2016-11-14 Thread Ismail SEZEN

> On 14 Nov 2016, at 11:00, 김세희  wrote:
> 
> Hello,
> 
> I’m Jane Kim from Zenith and Company.
> 
> We have a question about ‘The R Project’.
> 
> It looks like it’s an open source software, but the document from the website 
> shows that it’s free of use not free of price.
> 
> Please, confirm us the if it cost fees to use it for commercial use.
> 
> If needed, could you inform us the price for it, too?
> 
> Best regards,
> Jane Kim.
> 

Can I use R for commercial purposes?
https://cran.r-project.org/doc/FAQ/R-FAQ.html#Can-I-use-R-for-commercial-purposes_003f
 


If you mean RStudio you have to pay for commercial use. RStudio and R are 
different.
https://www.rstudio.com/pricing/ 




[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Principle Component Analysis: Ranking Animal Size Based On Combined Metrics

2016-11-14 Thread David L Carlson
The first principal component should be your estimate of "size" since it 
captures the correlations between all 4 variables. The second principle 
component must be orthogonal to the first so that if the first is "size", the 
second pc is independent of size, perhaps some measure of "shape". As would be 
expected, the first principal component is highly correlated with the geometric 
mean of the three linear measurements and moderately correlated with weight:

> gm <- apply(df[, -1], 1, prod)^(1/3)
> pc1 <- prcomp(df, scale.=TRUE)$x[, 1]
> plot(pc1, gm)
> cor(cbind(pc1, gm, wgt=df$weight))
   pc1 gmwgt
pc1  1.000 -0.9716317 -0.5943594
gm  -0.9716317  1.000  0.3967369
wgt -0.5943594  0.3967369  1.000

-
David L Carlson
Department of Anthropology
Texas A University
College Station, TX 77840-4352

-Original Message-
From: R-help [mailto:r-help-boun...@r-project.org] On Behalf Of Sidoti, 
Salvatore A.
Sent: Sunday, November 13, 2016 7:38 PM
To: Jim Lemon; r-help mailing list
Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On 
Combined Metrics

Hi Jim,

Nice to see you again! First of all, apologies to all for bending the rules a 
bit with respect to the mailing list. I know this is a list for R programming 
specifically, and I have received some great advice in this regard in the past. 
I just thought this was an interesting applied problem that would generate some 
discussion about PCA in R.

Yes, that is an excellent question! Indeed, why not just volume? Since this is 
still a work in progress and we have not published as of yet, I would rather 
not be more specific about the type of animal at this time ;>}. Nonetheless, I 
can say that the animals I study change "size" depending on their feeding and 
hydration state. The abdomen in particular undergoes drastic size changes. That 
being said, there are key anatomical features that remain fixed in the adult.

Now, there *might* be a way to work volume into the PCA. Although volume is not 
a reliable metric since the abdomen size is so changeable while the animal is 
alive, but what about preserved specimens? I have many that have been 
marinating in ethanol for months. Wouldn't the tissues have equilibrated by 
now? Probably... I could measure volume by displacement or suspension, I 
suppose.

In the meantime, here's a few thoughts:

1)  Use the contribution % (known as C% hereafter) of each variable on 
principle components 1 and 2.

2)  The total contribution of a variable that explains the variations 
retained by PC1 an PC2 is calculated by:

sum(C%1 * eigenvalue1, C%2 * eigenvalue2)

3) Scale() to mean-center the columns of the data set.

4) Use these total contributions as the weights of an arithmetic mean.

For example, we have an animal with the following data (mean-centered):
weight: 1.334
interoc:-0.225
clength:0.046
cwidth: -0.847

The contributions of these variables on PC1 and PC2 are (% changed to 
proportions):
weight: 0.556
interoc:0.357
clength:0.493
cwidth: 0.291

To calculate size:
1.334(0.556) - 0.225(0.357) + 0.046(0.493) - 0.847(0.291) = 0.43758
Then divide by the sum of the weights:
0.43758 / 1.697 = 0.257855 = "animal size"

This value can then be used to rank the animal according to its size for 
further analysis...

Does this sound like a reasonable application of my PCA data?

Salvatore A. Sidoti
PhD Student
Behavioral Ecology

-Original Message-
From: Jim Lemon [mailto:drjimle...@gmail.com] 
Sent: Sunday, November 13, 2016 3:53 PM
To: Sidoti, Salvatore A. ; r-help mailing list 

Subject: Re: [R] Principle Component Analysis: Ranking Animal Size Based On 
Combined Metrics

Hi Salvatore,
If by "size" you mean volume, why not directly measure the volume of your 
animals? They appear to be fairly small. Sometimes working out what the 
critical value actually means can inform the way to measure it.

Jim


On Sun, Nov 13, 2016 at 4:46 PM, Sidoti, Salvatore A.
 wrote:
> Let's say I perform 4 measurements on an animal: three are linear 
> measurements in millimeters and the fourth is its weight in milligrams. So, 
> we have a data set with mixed units.
>
> Based on these four correlated measurements, I would like to obtain one 
> "score" or value that describes an individual animal's size. I considered 
> simply taking the geometric mean of these 4 measurements, and that would give 
> me a "score" - larger values would be for larger animals, etc.
>
> However, this assumes that all 4 of these measurements contribute equally to 
> an animal's size. Of course, more than likely this is not the case. I then 
> performed a PCA to discover how much influence each variable had on the 
> overall data set. I was hoping to use this analysis to refine my original 
> approach.
>
> I honestly do not know 

[R] [R-pkgs] Major update of package actuar

2016-11-14 Thread Vincent Goulet
Dear useRs,

I'm happy to announce a substantial update of package actuar that bumps the 
version number to 2.0-0. This release focuses on additional support for 
continuous and discrete distributions, new functions to simulate data from 
compound models and mixtures, and revised and improved documentation.

A slightly shortened version of the NEWS file follows:

NEW FEATURES

• Support for the inverse Gaussian distribution. The pdf, cdf and
  quantile functions are C (read: faster) implementations of otherwise 
  equivalent functions in package ‘statmod’.

• Support for the Gumbel extreme value distribution.

• Extended range of admissible values for many limited
  expected value functions thanks to new C-level functions
  ‘expint’, ‘betaint’ and ‘gammaint’. These provide special
  integrals presented in the introduction of Appendix A of
  Klugman et al. (2012); see also ‘vignette("distributions")’.

  Affected functions are: ‘levtrbeta’, ‘levgenpareto’,
  ‘levburr’, ‘levinvburr’, ‘levpareto’, ‘levinvpareto’,
  ‘levllogis’, ‘levparalogis’, ‘levinvparalogis’ in the
  Transformed Beta family, and ‘levinvtrgamma’, ‘levinvgamma’,
  ‘levinvweibull’ in the Transformed Gamma family.

• Functions ‘expint’, ‘betaint’ and ‘gammaint’ to compute
  the special integrals mentioned above. These are merely
  convenience R interfaces to the C level functions. They are
  _not_ exported by the package.

• Support for the Poisson-inverse Gaussian discrete distribution.

• Support for the logarithmic (or log-series) and zero-modified
  logarithmic distributions.

• Support for the zero-truncated and zero-modified Poisson
  distributions.

• Support for the zero-truncated and zero-modified negative 
  binomial distributions.

• Support for the zero-truncated and zero-modified geometric
  distributions.

• Support for the zero-truncated and zero-modified binomial
  distributions.

• New vignette ‘"distributions"’ that reviews in great detail the
  continuous and discrete distributions provided in the
  package, along with implementation details.

• ‘aggregateDist’ now accepts ‘"zero-truncated binomial"’,
  ‘"zero-truncated geometric"’, ‘"zero-truncated negative
  binomial"’, ‘"zero-truncated poisson"’, ‘"zero-modified
  binomial"’, ‘"zero-modified geometric"’, ‘"zero-modified
  negative binomial"’, ‘"zero-modified poisson"’ and
  ‘"zero-modified logarithmic"’ for argument ‘model.freq’ with
  the ‘"recursive"’ method.

• New function ‘rmixture’ to generate random variates from
  discrete mixtures, that is from random variables with
  densities of the form f(x) = p_1 f_1(x) + ... + p_n f_n(x).

• New function ‘rcompound’ to generate random variates from (non
  hierarchical) compound models of the form S = X_1 + ... + X_N.
  Function ‘simul’ could already do that, but ‘rcompound’ is
  substantially faster for non hierarchical models.

• New function 'rcomppois' that is a simplified version of
  ‘rcompound’ for the very common compound Poisson case.

• Function ‘simul’ now accepts an atomic (named or not) vector for
  argument ‘nodes’ when simulating from a non hierarchical
  compound model. But really, one should use ‘rcompound’ for
  such cases.

• New alias ‘rcomphierarc’ for ‘simul’ that better fits within
  the usual naming scheme of random generation functions.

• Functions ‘grouped.data’ and ‘ogive’ now accept individual
  data in argument. The former will group the data using
  ‘hist’ (therefore, all the algorithms to compute the number
  of breakpoints available in ‘hist’ are also available in
  ‘grouped.data’). ‘ogive’ will first create a grouped data
  object and then compute the ogive.

  While there is no guarantee that the two functions are
  backward compatible (the number and position of the
  arguments have changed), standard calls should not be
  affected.

USER VISIBLE CHANGES

• The material on probability laws in vignette ‘"lossdist"’
  has been moved to the new vignette ‘"distributions"’ (see
  the previous section).

• The first argument of the ‘mgf’ functions has changed
  from ‘x’ to ‘t’. This is a more common notation for moment
  generating functions.

• In ‘aggregateDist’ with the ‘"recursive"’ method, if the
  length of ‘p0’ is greater than one, only the first element
  is used, with a warning.

• ‘aggregateDist’ with the ‘"recursive"’ method and
  ‘model.freq = "logarithmic"’ now uses the new ‘dlogarithmic’
  family of functions. Therefore, parametrization has changed
  from the one of Klugman et al. (2012) to the standard
  parametrization for the logarithmic distribution. Basically,
  any value of ‘prob’ for the logarithmic parameter in
  previous versions of ‘actuar’ should now be ‘1 - prob’.

• The aim of vignette ‘"simulation"’ is changed from
  “simulation of compound hierarchical models” to “simulation
  of insurance data with ‘actuar’” as it also covers the new
  functions ‘rmixture’ and ‘rcompound’.

• Vignette ‘"lossdist"’ is renamed to ‘"modeling"’ and it is
  revised to cover the new 

[R] Question about ‘The R Project’.

2016-11-14 Thread 김세희
Hello,

I��m Jane Kim from Zenith and Company.

We have a question about ��The R Project��.

It looks like it��s an open source software, but the document from the website 
shows that it��s free of use not free of price.

Please, confirm us the if it cost fees to use it for commercial use.

If needed, could you inform us the price for it, too?

Best regards,
Jane Kim.


Zenith��     �� �ϰų�  �ʽ��ϴ�. �� 
���Ͽ��� �� ���ԵǾ�  �� ��, ÷�� �� ��  ��  
�ǹ��� ��ȣ�� �ް� ��  �̿��� �� �ϴ�.  �ڰ� �ƴ�  �� 
�� , , ���縦 �ؼ��� �ȵǸ� �� �߸� ���۵�    �߽��ڿ��� 
�˷��ֽð� �� �ֽñ� �ٶ��ϴ�.
The Zenith does not collect and distribute personal information without 
the consent of the recipient. This e-mail may contain confidential information 
and content of this e-mail (including any attachments) is strictly confidential 
and may be commercially sensitive. If you are not the named addressee, you 
should not disseminate, distribute or copy this e-mail. Please notify the 
sender immediately by e-mail if you have received it by mistake and delete it 
from your system.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Function argument and scope

2016-11-14 Thread Bernardo Doré
Thank you all for replying so quickly.

@Jim
You are right, I ran into that. You can see as.character() being called to
remedy the situation you described. I dropped the factors from the data
frame in a line outside the function. Creating the dataframe with
stringsAsFactors = F is the easiest way to go. Tks.

@Thomas, @Jeremiah
R really creates a copy and forgets the reference. That was clearly visible
in the debugger. In an earlier version of this I was returning a list of
the compute values. I will go back to that.

Tks a lot for helping me with this.

Bernardo Doré



>
> On Sun, Nov 13, 2016 at 2:09 PM, Bernardo Doré  wrote:
>
>> Hello list,
>>
>> my first post but I've been using this list as a help source for a while.
>> Couldn't live without it.
>>
>> I am writing a function that takes a dataframe as an argument and in the
>> end I intend to assign the result of some computation back to the
>> dataframe. This is what I have so far:
>>
>> myFunction <- function(x){
>>   y <- x[1,1]
>>   z <- strsplit(as.character(y), split = " ")
>>   if(length(z[[1]] > 1)){
>> predictedWord <- z[[1]][length(z[[1]])]
>> z <- z[[1]][-c(length(z[[1]]))]
>> z <- paste(z, collapse = " ")
>>   }
>>   x[1,1] <- z
>> }
>>
>> And lets say I create my dataframe like this:
>> test <- data.frame(var1=c("a","b","c"),var2=c("d","e","f"))
>>
>> and then call
>> myFunction(test)
>>
>> The problem is when I assign x[1,1] to y in the first operation inside the
>> function, x becomes a dataframe inside the function scope and loses the
>> reference to the dataframe "test" passed as argument. In the end when I
>> assign z to what should be row 1 and column 1 of the "test" dataframe, it
>> assigns to x inside the function scope and no modification is made on
>> "test".
>>
>> I hope the problem statement is clear.
>>
>> Thank you,
>>
>> Bernardo Doré
>>
>> [[alternative HTML version deleted]]
>>
>> __
>> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide http://www.R-project.org/posti
>> ng-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>
>
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Re: [R] Frequency of a character in a string

2016-11-14 Thread Brijesh Mishra
?nchar in the base R should also help...

On Mon, Nov 14, 2016 at 2:26 PM, Ismail SEZEN  wrote:

>
> > On 14 Nov 2016, at 11:44, Ferri Leberl  wrote:
> >
> >
> > Dear All,
> > Is there a function to count the occurences of a certain character in a
> string resp. in a vector of strings?
> > Thank you in advance!
> > Yours, Ferri
> >
>
> library(stringr)
> ?str_count
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Text categories based on the sentences

2016-11-14 Thread Venky
Hi team,

I have data set contains one variable "*Description*"

*Description**  Category*

1. i want ice cream food
2. i like banana very much  fruit
3. tomorrow i will eat chicken  food
4. yesterday i went to birthday partyfestival
5. i lost my mobile last week   mobile

Please remember that i have only "*Description*" Variables only.How can i
get the categories column based on the sentences of *Description *column.

kindly do the needful help for that

Advance in Thanks.






Thanks and Regards
Venkatesan

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] [FORGED] How to remove box in Venn plots (Vennerable package, uses grid) - similar to bty="n" in standard plots

2016-11-14 Thread DE LAS HERAS Jose
Hi,


The grid.ls() and grid.remove() approach worked beautifully to remove the box, 
thank you! Because the box is the first thing to be drawn, it is the first 
object shown by grid.ls(), so I can easily add a line of code to automatically 
remove the box. Result!


Although I'd still like to know how one chooses not to plot that box in the 
first place. I really must study a little the grid package. I've survived using 
base R plots and they work very nicely, but it looks like you can do a lot of 
cool stuff with grid.


In case you might know the answer and feel like adding a comment here (I'm 
already very happy with the grid.ls() approach, thanks! :)) this is a simple 
code example without making it look pretty or anything:




library(Vennerable)
groups<-list(set1=1:100, set2=80:120)
V<-Venn(groups)
C<-compute.Venn(V)
X11(w=7,h=7)
grid.newpage()
plot(C)

class(C)
[1] "VennDrawing"
attr(,"package")
[1] "Vennerable"

I don't want the black box around the diagram.
I was able to overwrite it with a white box like this:

vp=viewport(x=0.5, y=0.5, width=0.95, height=0.75)
pushViewport(vp)
grid.rect(gp=gpar(lty=1, col="white", lwd=15))
upViewport() # needs to be executed to return focus upwards

but as the box has different dimensions depending on the actual sets being 
drawn, the width and height must be found empirically each time. A bit boring 
it you need to produce a bunch of figures at once.

I was wondering what parameter I could include in the call to 'plot' that would 
prevent the box from being drawn. Usually this is achieved with bty="n", but 
the method for plotting a "VennDrawing" structure uses the grid package and I'm 
lost there at the moment.

Thank you for your help again, grid.ls() etc is a very cool and flexible 
approach

Jose





From: Paul Murrell 
Sent: 13 November 2016 19:57
To: DE LAS HERAS Jose; R-help@r-project.org
Subject: Re: [FORGED] [R] How to remove box in Venn plots (Vennerable package, 
uses grid) - similar to bty="n" in standard plots

Hi

Can you supply some example code?

You might get some joy from grid.ls() to identify the box followed by
grid.remove() to get rid of it;  some example code would allow me to
provide more detailed advice.

Paul

On 12/11/16 05:12, DE LAS HERAS Jose wrote:
> I'm using the package Vennerable to make Venn diagrams, but it always
> makes a box around the diagram.
>
> Using standard R plots I could eliminate that by indicating
>
>
> bty="n"
>
>
> but it seems Vennerable uses the Grid package to generate its plots
> and I'm not really familiar enough with Grid. I was looking at the
> documentation but I can't seem to find a way to achieve that. I'd
> even be happy drawing a white rectangle with wide lines to overplot
> the box, but there must be a way to not draw the box in the first
> place.
>
>
> Anybody knows how?
>
>
> Jose
>
> --
>
> Dr. Jose I. de las Heras The Wellcome Trust Centre for Cell Biology
> Swann Building Max Born Crescent University of Edinburgh Edinburgh
> EH9 3BF UK
>
> Phone: +44 (0)131 6507090
>
> Fax:  +44 (0)131 6507360
>
>
>
> The University of Edinburgh is a charitable body, registered in
> Scotland, with registration number SC005336.
>
>
>
> __ R-help@r-project.org
> mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the
R-help -- Main R Mailing List: Primary help - Homepage - 
SfS
stat.ethz.ch
The main R mailing list, for announcements about the development of R and the 
availability of new code, questions and answers about problems and solutions 
using R ...



> posting guide http://www.R-project.org/posting-guide.html and provide
> commented, minimal, self-contained, reproducible code.
>

--
Dr Paul Murrell
Department of Statistics
The University of Auckland
Private Bag 92019
Auckland
New Zealand
64 9 3737599 x85392
p...@stat.auckland.ac.nz
http://www.stat.auckland.ac.nz/~paul/
Paul Murrell's Home Page
www.stat.auckland.ac.nz
Department. My department home page. Research. The home page for R: A language 
and environment for computing and graphics (my R graphics todo list).



The University of Edinburgh is a charitable body, registered in
Scotland, with registration number SC005336.
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

[R] Discarding Models in Caret During Model Training

2016-11-14 Thread Lorenzo Isella

Dear All,
Maybe some of you has come across this problem.
Let's say that you use caret for hyperparameter tuning.
You train several models and you then select the best performing one
according to some performance metric.
My problem is that, sometimes, I would like to tune really many models
(in the order of hundreds of them). Time is not a problem, but I run
out of memory.
My question is: for any model, its performance is calculated while it
is running. I am only interested in the best performing model (or, to
keep it large, let's say in the 5 best performing models).
Would it be possible to script something that ranks the models while
they are being generated and automatically updates the list of the
best performing 5 models and deletes all the others (for which,
frankly speaking, I have no use).
Is there a flaw in my idea? That would not save me time, but a lot of
memory for sure.
Any suggestion is helpful.
Regards

Lorenzo

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Question about expression parser for "return" statement

2016-11-14 Thread Duncan Murdoch

On 13/11/2016 9:42 PM, Jeff Newmiller wrote:

I find your response here inconsistent... either including `return` causes a 
"wasted" function call to occur (same result achieved slower) or the parser has 
an optimization in it to prevent the wasted function call (only behaviorally the same).


I don't understand what you are finding inconsistent.  I wasn't talking 
about wasting anything.  I was just saying that expressions like


return (a)*b

are evaluated by calling return(a) first, because return() is a 
function, and then they'll never get to the multiplication.


BTW, there don't appear to be many instances of this particular bug in 
CRAN packages, though I don't have a reliable test for it yet.  The most 
common error seems to be using just "return", as mentioned before.  The 
fix for that is to add parens, e.g. "return()".  The next most common is 
something like


invisible(return(x))

which returns x before making it invisible.  The fix for this is to use

return(invisible(x))



I carefully avoid using the return function in R. Both because using it before the end of 
a function usually makes the logic harder to follow and because I am under the impression 
that using it at the end of the function is a small but pointless waste of CPU cycles. 
That some people might be prone to writing a C-like use of "return;" which 
causes a function object to be returned only increases my aversion to using it.


Sometimes it is fine to use return(x), but it shouldn't be used routinely.

Duncan Murdoch


-- Sent from my phone. Please excuse my brevity. On November 13, 2016
3:47:10 AM PST, Duncan Murdoch  wrote:

>On 13/11/2016 12:50 AM, Dave DeBarr wrote:

>> I've noticed that if I don't include parentheses around the intended

>return

>> value for the "return" statement, R will assume the first

>parenthetical

>> expression is the intended return value ... even if that

>parenthetical

>> expression is only part of a larger expression.
>>
>> Is this intentional?

>
>Yes, return is just a function call that has side effects.  As far as
>the parser is concerned,
>
>return ((1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance))
>
>is basically the same as
>
>f((1/sqrt(2*pi*Variance))*exp(-(1/2)*((x - Mean)^2)/Variance))
>
>Duncan Murdoch


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Frequency of a character in a string

2016-11-14 Thread Ismail SEZEN

> On 14 Nov 2016, at 11:44, Ferri Leberl  wrote:
> 
> 
> Dear All,
> Is there a function to count the occurences of a certain character in a 
> string resp. in a vector of strings?
> Thank you in advance!
> Yours, Ferri
> 

library(stringr)
?str_count

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


[R] Frequency of a character in a string

2016-11-14 Thread Ferri Leberl

Dear All,
Is there a function to count the occurences of a certain character in a string 
resp. in a vector of strings?
Thank you in advance!
Yours, Ferri

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] question on mean, sum

2016-11-14 Thread Jim Lemon
Hi mokuram,

As others have noted, you will profit from a bit more knowledge about
"extraction":

sum(mtcars)
[1] 13942.2

This works because you have "extracted" the first column of the
"mtcars" data frame _as a data frame_

mtcars[1]
mpg
Mazda RX4   21.0
Mazda RX4 Wag   21.0
...
Volvo 142E  21.4

is.data.frame(mtcars[1])
[1] TRUE

mean(mtcars)
[1] NA
Warning message:
In mean.default(mtcars) :
argument is not numeric or logical: returning NA

Here you are trying to take the mean of the entire data frame. While
the mean function will "coerce" a single column to a vector, it won't
do so for an entire data frame. However, if you coerce the data frame
to a matrix:

mean(as.matrix(mtcars))
[1] 39.60853

That's the good news. The bad news is that the result is meaningless.
The same thing goes for:

sd(as.matrix(mtcars))
[1] 84.20792

It is almost always a good idea to read the error messages carefully
and try to understand them.

Jim


On Mon, Nov 14, 2016 at 2:01 PM,   wrote:
>  Hi,
> I am working on functions such as sum(), mean() ...
>> sum(mtcars)[1] 13942.2> mean(mtcars)[1] NAWarning message:In 
>> mean.default(mtcars) : NA> sd(mtcars)Error in is.data.frame(x) : ()'double'
> why got different reply?Is this a BUG for the current version of R?my version 
> info:version.string R version 3.3.1 (2016-06-21)
>
> Thank you very much for the help.
> mokuram
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R-es] Grupo de Usuarios de R de Madrid - Reunión 10-Nov...

2016-11-14 Thread miguel.angel.rodriguez.muinos
Muchas gracias, Carlos!


El 12/11/2016 a las 1:10, Carlos Ortega escribió:
> Hola,
>
> Por si es de vuestro interés.
>
> El material (videos y presentaciones) de la reunión del pasado jueves del
> Grupo de Madrid ya están disponibles aquí:
>
> http://madrid.r-es.org/39-jueves-10-de-noviembre-2016/
>
> Gracias,
> Carlos Ortega









Nota: A información contida nesta mensaxe e os seus posibles documentos 
adxuntos é privada e confidencial e está dirixida únicamente ó seu 
destinatario/a. Se vostede non é o/a destinatario/a orixinal desta mensaxe, por 
favor elimínea. A distribución ou copia desta mensaxe non está autorizada.

Nota: La información contenida en este mensaje y sus posibles documentos 
adjuntos es privada y confidencial y está dirigida únicamente a su 
destinatario/a. Si usted no es el/la destinatario/a original de este mensaje, 
por favor elimínelo. La distribución o copia de este mensaje no está autorizada.

See more languages: http://www.sergas.es/aviso-confidencialidad

___
R-help-es mailing list
R-help-es@r-project.org
https://stat.ethz.ch/mailman/listinfo/r-help-es


Re: [R] Function argument and scope

2016-11-14 Thread jeremiah rounds
Hi,

Didn't bother to run the code because someone else said it might do what
you intended, and also your problem description was complete unto itself.

The issue is that R copies on change.  You are thinking like you have a
reference, which you do not.  That is not very R like in style, but it
certainly can be accomplished if you want via change of input class (See
new.env()).  A typical R style would be to make the modifications to the
input argument, return it, and then assign it back to the input object.

e.g.
test = myFunction(test)

If you really have some reason to want to change the data.frame in a
function without re-assigning it then check out data.table, which has that
as a side effect of how it operates.

Thanks,



On Sun, Nov 13, 2016 at 2:09 PM, Bernardo Doré  wrote:

> Hello list,
>
> my first post but I've been using this list as a help source for a while.
> Couldn't live without it.
>
> I am writing a function that takes a dataframe as an argument and in the
> end I intend to assign the result of some computation back to the
> dataframe. This is what I have so far:
>
> myFunction <- function(x){
>   y <- x[1,1]
>   z <- strsplit(as.character(y), split = " ")
>   if(length(z[[1]] > 1)){
> predictedWord <- z[[1]][length(z[[1]])]
> z <- z[[1]][-c(length(z[[1]]))]
> z <- paste(z, collapse = " ")
>   }
>   x[1,1] <- z
> }
>
> And lets say I create my dataframe like this:
> test <- data.frame(var1=c("a","b","c"),var2=c("d","e","f"))
>
> and then call
> myFunction(test)
>
> The problem is when I assign x[1,1] to y in the first operation inside the
> function, x becomes a dataframe inside the function scope and loses the
> reference to the dataframe "test" passed as argument. In the end when I
> assign z to what should be row 1 and column 1 of the "test" dataframe, it
> assigns to x inside the function scope and no modification is made on
> "test".
>
> I hope the problem statement is clear.
>
> Thank you,
>
> Bernardo Doré
>
> [[alternative HTML version deleted]]
>
> __
> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/
> posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.