Re: [R] Referencing variable names rather than column numbers

2009-12-05 Thread baptiste auguie
Hi,

Try this,

cor(pollute[ ,c(Pollution,Temp,Industry)])

and ?[ in particular,
Character vectors will be matched to the names of the object 

HTH,

baptiste

2009/12/5 John-Paul Ferguson ferguson_john-p...@gsb.stanford.edu:
 I apologize for how basic a question this is. I am a Stata user who
 has begun using R, and the syntax differences still trip me up. The
 most basic questions, involving as they do general terms, can be the
 hardest to find solutions for through search.

 Assume for the moment that I have a dataset that contains seven
 variables: Pollution, Temp, Industry, Population, Wind, Rain and
 Wet.days. (This actual dataset is taken from Michael Crawley's
 Statistics: An Introduction Using R and is available as
 pollute.txt in
 http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.)
 Assume I have attached pollute. Then

 cor(pollute)

 will give me the correlation table for these seven variables. If I
 would prefer only to see the correlations between, say, Pollution,
 Temp and Industry, I can get that with

 cor(pollute[,1:3])

 or with

 cor(pollute[1:3])

 Similarly, I can see the correlations between Temp, Population and Rain with

 cor(pollute[,c(2,4,6)])

 or with

 cor(pollute[c(2,4,6)])

 This is fine for a seven-variable dataset. When I have 250 variables,
 though, I start to pale at looking up column indexes over and over. I
 know from reading the list archives that I can extract the column
 index of Industry, for example, by typing

 which(Industry==names(pollute))

 but doing that before each command seems dire. Trained to using Stata
 as I am, I am inclined to check the correlation of the first three or
 the second, fourth and sixth columns by substituting the column names
 for the column indexes--something like the following:

 cor(pollute[Pollution:Industry])
 cor(pollute[c(Temp,Population,Rain)])

 These however throw errors.

 I know that many commands in R are perfectly happy to take variable
 names--the regression models, for example--but that some do not. And
 so I ask you two general questions:

 1. Is there a syntax for referring to variable names rather than
 column indexes in situations like these?
 2. Is there something that I should look for in a command's help file
 that often indicates whether it can take column names rather than
 indexes?

 Again, apologies for asking something that has likely been asked
 before. I would appreciate any suggestions that you have.

 Best,
 John-Paul Ferguson
 Assistant Professor of Organizational Behavior
 Stanford University Graduate School of Business
 518 Memorial Way, K313
 Stanford, CA 94305

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Referencing variable names rather than column numbers

2009-12-05 Thread Ista Zahn
As baptiste noted, you can do

cor(pollute[ ,c(Pollution,Temp,Industry)]).

But

cor(pollute[,Pollution:Industry])

will not work. For that you can do

cor(pollute[ 
,which(names(pollute)==Pollution):which(names(pollute)==Industry)])

-Ista

On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson
ferguson_john-p...@gsb.stanford.edu wrote:
 I apologize for how basic a question this is. I am a Stata user who
 has begun using R, and the syntax differences still trip me up. The
 most basic questions, involving as they do general terms, can be the
 hardest to find solutions for through search.

 Assume for the moment that I have a dataset that contains seven
 variables: Pollution, Temp, Industry, Population, Wind, Rain and
 Wet.days. (This actual dataset is taken from Michael Crawley's
 Statistics: An Introduction Using R and is available as
 pollute.txt in
 http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.)
 Assume I have attached pollute. Then

 cor(pollute)

 will give me the correlation table for these seven variables. If I
 would prefer only to see the correlations between, say, Pollution,
 Temp and Industry, I can get that with

 cor(pollute[,1:3])

 or with

 cor(pollute[1:3])

 Similarly, I can see the correlations between Temp, Population and Rain with

 cor(pollute[,c(2,4,6)])

 or with

 cor(pollute[c(2,4,6)])

 This is fine for a seven-variable dataset. When I have 250 variables,
 though, I start to pale at looking up column indexes over and over. I
 know from reading the list archives that I can extract the column
 index of Industry, for example, by typing

 which(Industry==names(pollute))

 but doing that before each command seems dire. Trained to using Stata
 as I am, I am inclined to check the correlation of the first three or
 the second, fourth and sixth columns by substituting the column names
 for the column indexes--something like the following:

 cor(pollute[Pollution:Industry])
 cor(pollute[c(Temp,Population,Rain)])

 These however throw errors.

 I know that many commands in R are perfectly happy to take variable
 names--the regression models, for example--but that some do not. And
 so I ask you two general questions:

 1. Is there a syntax for referring to variable names rather than
 column indexes in situations like these?
 2. Is there something that I should look for in a command's help file
 that often indicates whether it can take column names rather than
 indexes?

 Again, apologies for asking something that has likely been asked
 before. I would appreciate any suggestions that you have.

 Best,
 John-Paul Ferguson
 Assistant Professor of Organizational Behavior
 Stanford University Graduate School of Business
 518 Memorial Way, K313
 Stanford, CA 94305

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.




-- 
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Referencing variable names rather than column numbers

2009-12-05 Thread Jorge Ivan Velez
Dear John-Paul,

Take a look at https://stat.ethz.ch/pipermail/r-help/2009-July/204027.html It
contains different ways to do (in part) what you want.

HTH,
Jorge

On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson  wrote:

 I apologize for how basic a question this is. I am a Stata user who
 has begun using R, and the syntax differences still trip me up. The
 most basic questions, involving as they do general terms, can be the
 hardest to find solutions for through search.

 Assume for the moment that I have a dataset that contains seven
 variables: Pollution, Temp, Industry, Population, Wind, Rain and
 Wet.days. (This actual dataset is taken from Michael Crawley's
 Statistics: An Introduction Using R and is available as
 pollute.txt in
 http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.)
 Assume I have attached pollute. Then

 cor(pollute)

 will give me the correlation table for these seven variables. If I
 would prefer only to see the correlations between, say, Pollution,
 Temp and Industry, I can get that with

 cor(pollute[,1:3])

 or with

 cor(pollute[1:3])

 Similarly, I can see the correlations between Temp, Population and Rain
 with

 cor(pollute[,c(2,4,6)])

 or with

 cor(pollute[c(2,4,6)])

 This is fine for a seven-variable dataset. When I have 250 variables,
 though, I start to pale at looking up column indexes over and over. I
 know from reading the list archives that I can extract the column
 index of Industry, for example, by typing

 which(Industry==names(pollute))

 but doing that before each command seems dire. Trained to using Stata
 as I am, I am inclined to check the correlation of the first three or
 the second, fourth and sixth columns by substituting the column names
 for the column indexes--something like the following:

 cor(pollute[Pollution:Industry])
 cor(pollute[c(Temp,Population,Rain)])

 These however throw errors.

 I know that many commands in R are perfectly happy to take variable
 names--the regression models, for example--but that some do not. And
 so I ask you two general questions:

 1. Is there a syntax for referring to variable names rather than
 column indexes in situations like these?
 2. Is there something that I should look for in a command's help file
 that often indicates whether it can take column names rather than
 indexes?

 Again, apologies for asking something that has likely been asked
 before. I would appreciate any suggestions that you have.

 Best,
 John-Paul Ferguson
 Assistant Professor of Organizational Behavior
 Stanford University Graduate School of Business
 518 Memorial Way, K313
 Stanford, CA 94305

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide
 http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.


[[alternative HTML version deleted]]

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Referencing variable names rather than column numbers

2009-12-05 Thread Marc Schwartz
Alternatively, you can use subset(), which supports the : operator  
for the 'select' argument:


 cor(subset(iris, select = Sepal.Length:Petal.Length))
 Sepal.Length Sepal.Width Petal.Length
Sepal.Length1.000  -0.11756980.8717538
Sepal.Width-0.1175698   1.000   -0.4284401
Petal.Length0.8717538  -0.42844011.000


which is equivalent to:

 cor(iris[, 1:3])
 Sepal.Length Sepal.Width Petal.Length
Sepal.Length1.000  -0.11756980.8717538
Sepal.Width-0.1175698   1.000   -0.4284401
Petal.Length0.8717538  -0.42844011.000


So for the pollute data:

  cor(subset(pollute, select = Pollution:Industry))

should work.

Note also that the 'select' argument to subset can take non-contiguous  
column names:


# Skip 'Sepal.Width'
 cor(subset(iris, select = c(Sepal.Length, Petal.Length:Petal.Width)))
 Sepal.Length Petal.Length Petal.Width
Sepal.Length1.0000.8717538   0.8179411
Petal.Length0.87175381.000   0.9628654
Petal.Width 0.81794110.9628654   1.000

So you have the option of specifying, by name, multiple series of  
contiguous and non-contiguous column names.


See ?subset

HTH,

Marc Schwartz


On Dec 5, 2009, at 10:43 AM, Ista Zahn wrote:


As baptiste noted, you can do

cor(pollute[ ,c(Pollution,Temp,Industry)]).

But

cor(pollute[,Pollution:Industry])

will not work. For that you can do

cor 
(pollute 
[ ,which 
(names(pollute)==Pollution):which(names(pollute)==Industry)])


-Ista

On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson
ferguson_john-p...@gsb.stanford.edu wrote:

I apologize for how basic a question this is. I am a Stata user who
has begun using R, and the syntax differences still trip me up. The
most basic questions, involving as they do general terms, can be the
hardest to find solutions for through search.

Assume for the moment that I have a dataset that contains seven
variables: Pollution, Temp, Industry, Population, Wind, Rain and
Wet.days. (This actual dataset is taken from Michael Crawley's
Statistics: An Introduction Using R and is available as
pollute.txt in
http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.)
Assume I have attached pollute. Then

cor(pollute)

will give me the correlation table for these seven variables. If I
would prefer only to see the correlations between, say, Pollution,
Temp and Industry, I can get that with

cor(pollute[,1:3])

or with

cor(pollute[1:3])

Similarly, I can see the correlations between Temp, Population and  
Rain with


cor(pollute[,c(2,4,6)])

or with

cor(pollute[c(2,4,6)])

This is fine for a seven-variable dataset. When I have 250 variables,
though, I start to pale at looking up column indexes over and over. I
know from reading the list archives that I can extract the column
index of Industry, for example, by typing

which(Industry==names(pollute))

but doing that before each command seems dire. Trained to using Stata
as I am, I am inclined to check the correlation of the first three or
the second, fourth and sixth columns by substituting the column names
for the column indexes--something like the following:

cor(pollute[Pollution:Industry])
cor(pollute[c(Temp,Population,Rain)])

These however throw errors.

I know that many commands in R are perfectly happy to take variable
names--the regression models, for example--but that some do not. And
so I ask you two general questions:

1. Is there a syntax for referring to variable names rather than
column indexes in situations like these?
2. Is there something that I should look for in a command's help file
that often indicates whether it can take column names rather than
indexes?

Again, apologies for asking something that has likely been asked
before. I would appreciate any suggestions that you have.

Best,
John-Paul Ferguson
Assistant Professor of Organizational Behavior
Stanford University Graduate School of Business
518 Memorial Way, K313
Stanford, CA 94305

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.





--
Ista Zahn
Graduate student
University of Rochester
Department of Clinical and Social Psychology
http://yourpsyche.org

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Referencing variable names rather than column numbers

2009-12-05 Thread David Winsemius


On Dec 5, 2009, at 11:30 AM, baptiste auguie wrote:


Hi,

Try this,

cor(pollute[ ,c(Pollution,Temp,Industry)])

and ?[ in particular,
Character vectors will be matched to the names of the object 


John-Paul;

In the time it took me to compose this, I see that others have already  
pointed out all of what I had written so it only remains to offer yet- 
another-R-method for ranges of column names.


You could have defined a targets vector of names if you know the  
starting and ending position:


?Extract   # or equivalently ?[
targets - names(pollute)[1:3]# colnames is an equivalent function  
for dataframe objects

targets
pollute[ , targets]

--

Best;
David.




HTH,

baptiste

2009/12/5 John-Paul Ferguson ferguson_john-p...@gsb.stanford.edu:

I apologize for how basic a question this is. I am a Stata user who
has begun using R, and the syntax differences still trip me up. The
most basic questions, involving as they do general terms, can be the
hardest to find solutions for through search.

Assume for the moment that I have a dataset that contains seven
variables: Pollution, Temp, Industry, Population, Wind, Rain and
Wet.days. (This actual dataset is taken from Michael Crawley's
Statistics: An Introduction Using R and is available as
pollute.txt in
http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.)
Assume I have attached pollute. Then

cor(pollute)

will give me the correlation table for these seven variables. If I
would prefer only to see the correlations between, say, Pollution,
Temp and Industry, I can get that with

cor(pollute[,1:3])

or with

cor(pollute[1:3])

Similarly, I can see the correlations between Temp, Population and  
Rain with


cor(pollute[,c(2,4,6)])

or with

cor(pollute[c(2,4,6)])

This is fine for a seven-variable dataset. When I have 250 variables,
though, I start to pale at looking up column indexes over and over. I
know from reading the list archives that I can extract the column
index of Industry, for example, by typing

which(Industry==names(pollute))

but doing that before each command seems dire. Trained to using Stata
as I am, I am inclined to check the correlation of the first three or
the second, fourth and sixth columns by substituting the column names
for the column indexes--something like the following:

cor(pollute[Pollution:Industry])
cor(pollute[c(Temp,Population,Rain)])

These however throw errors.

I know that many commands in R are perfectly happy to take variable
names--the regression models, for example--but that some do not. And
so I ask you two general questions:

1. Is there a syntax for referring to variable names rather than
column indexes in situations like these?
2. Is there something that I should look for in a command's help file
that often indicates whether it can take column names rather than
indexes?

Again, apologies for asking something that has likely been asked
before. I would appreciate any suggestions that you have.

Best,
John-Paul Ferguson
Assistant Professor of Organizational Behavior
Stanford University Graduate School of Business
518 Memorial Way, K313
Stanford, CA 94305

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.



__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


David Winsemius, MD
Heritage Laboratories
West Hartford, CT

__
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Referencing variable names rather than column numbers

2009-12-05 Thread John-Paul Ferguson
Holy Cats, those were four quick responses! And the question,
basically, is answered:

1. When in doubt, try quoting column names where you would try using
unquoted column indexes.
2. Subset() seems, overall, the most flexible analog to Stata's
variable-referencing syntax.

I appreciate the help. I'm encouraging several of my PhD students to
pick up R, given the research that they are doing, but it seems wrong
to make them do that without learning it myself. Humbling to be back
at this level of basic interface interaction, but very good to know
that a resource like this list exists.

Best,
John-Paul

2009/12/5 Marc Schwartz marc_schwa...@me.com:
 Alternatively, you can use subset(), which supports the : operator
 for the 'select' argument:

   cor(subset(iris, select = Sepal.Length:Petal.Length))
              Sepal.Length Sepal.Width Petal.Length
 Sepal.Length    1.000  -0.1175698    0.8717538
 Sepal.Width    -0.1175698   1.000   -0.4284401
 Petal.Length    0.8717538  -0.4284401    1.000


 which is equivalent to:

   cor(iris[, 1:3])
              Sepal.Length Sepal.Width Petal.Length
 Sepal.Length    1.000  -0.1175698    0.8717538
 Sepal.Width    -0.1175698   1.000   -0.4284401
 Petal.Length    0.8717538  -0.4284401    1.000


 So for the pollute data:

   cor(subset(pollute, select = Pollution:Industry))

 should work.

 Note also that the 'select' argument to subset can take non-contiguous
 column names:

 # Skip 'Sepal.Width'
   cor(subset(iris, select = c(Sepal.Length, Petal.Length:Petal.Width)))
              Sepal.Length Petal.Length Petal.Width
 Sepal.Length    1.000    0.8717538   0.8179411
 Petal.Length    0.8717538    1.000   0.9628654
 Petal.Width     0.8179411    0.9628654   1.000

 So you have the option of specifying, by name, multiple series of
 contiguous and non-contiguous column names.

 See ?subset

 HTH,

 Marc Schwartz


 On Dec 5, 2009, at 10:43 AM, Ista Zahn wrote:

 As baptiste noted, you can do

 cor(pollute[ ,c(Pollution,Temp,Industry)]).

 But

 cor(pollute[,Pollution:Industry])

 will not work. For that you can do

 cor
 (pollute
 [ ,which
 (names(pollute)==Pollution):which(names(pollute)==Industry)])

 -Ista

 On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson
 ferguson_john-p...@gsb.stanford.edu wrote:
 I apologize for how basic a question this is. I am a Stata user who
 has begun using R, and the syntax differences still trip me up. The
 most basic questions, involving as they do general terms, can be the
 hardest to find solutions for through search.

 Assume for the moment that I have a dataset that contains seven
 variables: Pollution, Temp, Industry, Population, Wind, Rain and
 Wet.days. (This actual dataset is taken from Michael Crawley's
 Statistics: An Introduction Using R and is available as
 pollute.txt in
 http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.)
 Assume I have attached pollute. Then

 cor(pollute)

 will give me the correlation table for these seven variables. If I
 would prefer only to see the correlations between, say, Pollution,
 Temp and Industry, I can get that with

 cor(pollute[,1:3])

 or with

 cor(pollute[1:3])

 Similarly, I can see the correlations between Temp, Population and
 Rain with

 cor(pollute[,c(2,4,6)])

 or with

 cor(pollute[c(2,4,6)])

 This is fine for a seven-variable dataset. When I have 250 variables,
 though, I start to pale at looking up column indexes over and over. I
 know from reading the list archives that I can extract the column
 index of Industry, for example, by typing

 which(Industry==names(pollute))

 but doing that before each command seems dire. Trained to using Stata
 as I am, I am inclined to check the correlation of the first three or
 the second, fourth and sixth columns by substituting the column names
 for the column indexes--something like the following:

 cor(pollute[Pollution:Industry])
 cor(pollute[c(Temp,Population,Rain)])

 These however throw errors.

 I know that many commands in R are perfectly happy to take variable
 names--the regression models, for example--but that some do not. And
 so I ask you two general questions:

 1. Is there a syntax for referring to variable names rather than
 column indexes in situations like these?
 2. Is there something that I should look for in a command's help file
 that often indicates whether it can take column names rather than
 indexes?

 Again, apologies for asking something that has likely been asked
 before. I would appreciate any suggestions that you have.

 Best,
 John-Paul Ferguson
 Assistant Professor of Organizational Behavior
 Stanford University Graduate School of Business
 518 Memorial Way, K313
 Stanford, CA 94305

 __
 R-help@r-project.org mailing list
 https://stat.ethz.ch/mailman/listinfo/r-help
 PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
 and provide commented, minimal, self-contained, reproducible code.