Re: [R] Referencing variable names rather than column numbers
Hi, Try this, cor(pollute[ ,c(Pollution,Temp,Industry)]) and ?[ in particular, Character vectors will be matched to the names of the object HTH, baptiste 2009/12/5 John-Paul Ferguson ferguson_john-p...@gsb.stanford.edu: I apologize for how basic a question this is. I am a Stata user who has begun using R, and the syntax differences still trip me up. The most basic questions, involving as they do general terms, can be the hardest to find solutions for through search. Assume for the moment that I have a dataset that contains seven variables: Pollution, Temp, Industry, Population, Wind, Rain and Wet.days. (This actual dataset is taken from Michael Crawley's Statistics: An Introduction Using R and is available as pollute.txt in http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.) Assume I have attached pollute. Then cor(pollute) will give me the correlation table for these seven variables. If I would prefer only to see the correlations between, say, Pollution, Temp and Industry, I can get that with cor(pollute[,1:3]) or with cor(pollute[1:3]) Similarly, I can see the correlations between Temp, Population and Rain with cor(pollute[,c(2,4,6)]) or with cor(pollute[c(2,4,6)]) This is fine for a seven-variable dataset. When I have 250 variables, though, I start to pale at looking up column indexes over and over. I know from reading the list archives that I can extract the column index of Industry, for example, by typing which(Industry==names(pollute)) but doing that before each command seems dire. Trained to using Stata as I am, I am inclined to check the correlation of the first three or the second, fourth and sixth columns by substituting the column names for the column indexes--something like the following: cor(pollute[Pollution:Industry]) cor(pollute[c(Temp,Population,Rain)]) These however throw errors. I know that many commands in R are perfectly happy to take variable names--the regression models, for example--but that some do not. And so I ask you two general questions: 1. Is there a syntax for referring to variable names rather than column indexes in situations like these? 2. Is there something that I should look for in a command's help file that often indicates whether it can take column names rather than indexes? Again, apologies for asking something that has likely been asked before. I would appreciate any suggestions that you have. Best, John-Paul Ferguson Assistant Professor of Organizational Behavior Stanford University Graduate School of Business 518 Memorial Way, K313 Stanford, CA 94305 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Referencing variable names rather than column numbers
As baptiste noted, you can do cor(pollute[ ,c(Pollution,Temp,Industry)]). But cor(pollute[,Pollution:Industry]) will not work. For that you can do cor(pollute[ ,which(names(pollute)==Pollution):which(names(pollute)==Industry)]) -Ista On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson ferguson_john-p...@gsb.stanford.edu wrote: I apologize for how basic a question this is. I am a Stata user who has begun using R, and the syntax differences still trip me up. The most basic questions, involving as they do general terms, can be the hardest to find solutions for through search. Assume for the moment that I have a dataset that contains seven variables: Pollution, Temp, Industry, Population, Wind, Rain and Wet.days. (This actual dataset is taken from Michael Crawley's Statistics: An Introduction Using R and is available as pollute.txt in http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.) Assume I have attached pollute. Then cor(pollute) will give me the correlation table for these seven variables. If I would prefer only to see the correlations between, say, Pollution, Temp and Industry, I can get that with cor(pollute[,1:3]) or with cor(pollute[1:3]) Similarly, I can see the correlations between Temp, Population and Rain with cor(pollute[,c(2,4,6)]) or with cor(pollute[c(2,4,6)]) This is fine for a seven-variable dataset. When I have 250 variables, though, I start to pale at looking up column indexes over and over. I know from reading the list archives that I can extract the column index of Industry, for example, by typing which(Industry==names(pollute)) but doing that before each command seems dire. Trained to using Stata as I am, I am inclined to check the correlation of the first three or the second, fourth and sixth columns by substituting the column names for the column indexes--something like the following: cor(pollute[Pollution:Industry]) cor(pollute[c(Temp,Population,Rain)]) These however throw errors. I know that many commands in R are perfectly happy to take variable names--the regression models, for example--but that some do not. And so I ask you two general questions: 1. Is there a syntax for referring to variable names rather than column indexes in situations like these? 2. Is there something that I should look for in a command's help file that often indicates whether it can take column names rather than indexes? Again, apologies for asking something that has likely been asked before. I would appreciate any suggestions that you have. Best, John-Paul Ferguson Assistant Professor of Organizational Behavior Stanford University Graduate School of Business 518 Memorial Way, K313 Stanford, CA 94305 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Ista Zahn Graduate student University of Rochester Department of Clinical and Social Psychology http://yourpsyche.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Referencing variable names rather than column numbers
Dear John-Paul, Take a look at https://stat.ethz.ch/pipermail/r-help/2009-July/204027.html It contains different ways to do (in part) what you want. HTH, Jorge On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson wrote: I apologize for how basic a question this is. I am a Stata user who has begun using R, and the syntax differences still trip me up. The most basic questions, involving as they do general terms, can be the hardest to find solutions for through search. Assume for the moment that I have a dataset that contains seven variables: Pollution, Temp, Industry, Population, Wind, Rain and Wet.days. (This actual dataset is taken from Michael Crawley's Statistics: An Introduction Using R and is available as pollute.txt in http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.) Assume I have attached pollute. Then cor(pollute) will give me the correlation table for these seven variables. If I would prefer only to see the correlations between, say, Pollution, Temp and Industry, I can get that with cor(pollute[,1:3]) or with cor(pollute[1:3]) Similarly, I can see the correlations between Temp, Population and Rain with cor(pollute[,c(2,4,6)]) or with cor(pollute[c(2,4,6)]) This is fine for a seven-variable dataset. When I have 250 variables, though, I start to pale at looking up column indexes over and over. I know from reading the list archives that I can extract the column index of Industry, for example, by typing which(Industry==names(pollute)) but doing that before each command seems dire. Trained to using Stata as I am, I am inclined to check the correlation of the first three or the second, fourth and sixth columns by substituting the column names for the column indexes--something like the following: cor(pollute[Pollution:Industry]) cor(pollute[c(Temp,Population,Rain)]) These however throw errors. I know that many commands in R are perfectly happy to take variable names--the regression models, for example--but that some do not. And so I ask you two general questions: 1. Is there a syntax for referring to variable names rather than column indexes in situations like these? 2. Is there something that I should look for in a command's help file that often indicates whether it can take column names rather than indexes? Again, apologies for asking something that has likely been asked before. I would appreciate any suggestions that you have. Best, John-Paul Ferguson Assistant Professor of Organizational Behavior Stanford University Graduate School of Business 518 Memorial Way, K313 Stanford, CA 94305 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Referencing variable names rather than column numbers
Alternatively, you can use subset(), which supports the : operator for the 'select' argument: cor(subset(iris, select = Sepal.Length:Petal.Length)) Sepal.Length Sepal.Width Petal.Length Sepal.Length1.000 -0.11756980.8717538 Sepal.Width-0.1175698 1.000 -0.4284401 Petal.Length0.8717538 -0.42844011.000 which is equivalent to: cor(iris[, 1:3]) Sepal.Length Sepal.Width Petal.Length Sepal.Length1.000 -0.11756980.8717538 Sepal.Width-0.1175698 1.000 -0.4284401 Petal.Length0.8717538 -0.42844011.000 So for the pollute data: cor(subset(pollute, select = Pollution:Industry)) should work. Note also that the 'select' argument to subset can take non-contiguous column names: # Skip 'Sepal.Width' cor(subset(iris, select = c(Sepal.Length, Petal.Length:Petal.Width))) Sepal.Length Petal.Length Petal.Width Sepal.Length1.0000.8717538 0.8179411 Petal.Length0.87175381.000 0.9628654 Petal.Width 0.81794110.9628654 1.000 So you have the option of specifying, by name, multiple series of contiguous and non-contiguous column names. See ?subset HTH, Marc Schwartz On Dec 5, 2009, at 10:43 AM, Ista Zahn wrote: As baptiste noted, you can do cor(pollute[ ,c(Pollution,Temp,Industry)]). But cor(pollute[,Pollution:Industry]) will not work. For that you can do cor (pollute [ ,which (names(pollute)==Pollution):which(names(pollute)==Industry)]) -Ista On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson ferguson_john-p...@gsb.stanford.edu wrote: I apologize for how basic a question this is. I am a Stata user who has begun using R, and the syntax differences still trip me up. The most basic questions, involving as they do general terms, can be the hardest to find solutions for through search. Assume for the moment that I have a dataset that contains seven variables: Pollution, Temp, Industry, Population, Wind, Rain and Wet.days. (This actual dataset is taken from Michael Crawley's Statistics: An Introduction Using R and is available as pollute.txt in http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.) Assume I have attached pollute. Then cor(pollute) will give me the correlation table for these seven variables. If I would prefer only to see the correlations between, say, Pollution, Temp and Industry, I can get that with cor(pollute[,1:3]) or with cor(pollute[1:3]) Similarly, I can see the correlations between Temp, Population and Rain with cor(pollute[,c(2,4,6)]) or with cor(pollute[c(2,4,6)]) This is fine for a seven-variable dataset. When I have 250 variables, though, I start to pale at looking up column indexes over and over. I know from reading the list archives that I can extract the column index of Industry, for example, by typing which(Industry==names(pollute)) but doing that before each command seems dire. Trained to using Stata as I am, I am inclined to check the correlation of the first three or the second, fourth and sixth columns by substituting the column names for the column indexes--something like the following: cor(pollute[Pollution:Industry]) cor(pollute[c(Temp,Population,Rain)]) These however throw errors. I know that many commands in R are perfectly happy to take variable names--the regression models, for example--but that some do not. And so I ask you two general questions: 1. Is there a syntax for referring to variable names rather than column indexes in situations like these? 2. Is there something that I should look for in a command's help file that often indicates whether it can take column names rather than indexes? Again, apologies for asking something that has likely been asked before. I would appreciate any suggestions that you have. Best, John-Paul Ferguson Assistant Professor of Organizational Behavior Stanford University Graduate School of Business 518 Memorial Way, K313 Stanford, CA 94305 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Ista Zahn Graduate student University of Rochester Department of Clinical and Social Psychology http://yourpsyche.org __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Referencing variable names rather than column numbers
On Dec 5, 2009, at 11:30 AM, baptiste auguie wrote: Hi, Try this, cor(pollute[ ,c(Pollution,Temp,Industry)]) and ?[ in particular, Character vectors will be matched to the names of the object John-Paul; In the time it took me to compose this, I see that others have already pointed out all of what I had written so it only remains to offer yet- another-R-method for ranges of column names. You could have defined a targets vector of names if you know the starting and ending position: ?Extract # or equivalently ?[ targets - names(pollute)[1:3]# colnames is an equivalent function for dataframe objects targets pollute[ , targets] -- Best; David. HTH, baptiste 2009/12/5 John-Paul Ferguson ferguson_john-p...@gsb.stanford.edu: I apologize for how basic a question this is. I am a Stata user who has begun using R, and the syntax differences still trip me up. The most basic questions, involving as they do general terms, can be the hardest to find solutions for through search. Assume for the moment that I have a dataset that contains seven variables: Pollution, Temp, Industry, Population, Wind, Rain and Wet.days. (This actual dataset is taken from Michael Crawley's Statistics: An Introduction Using R and is available as pollute.txt in http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.) Assume I have attached pollute. Then cor(pollute) will give me the correlation table for these seven variables. If I would prefer only to see the correlations between, say, Pollution, Temp and Industry, I can get that with cor(pollute[,1:3]) or with cor(pollute[1:3]) Similarly, I can see the correlations between Temp, Population and Rain with cor(pollute[,c(2,4,6)]) or with cor(pollute[c(2,4,6)]) This is fine for a seven-variable dataset. When I have 250 variables, though, I start to pale at looking up column indexes over and over. I know from reading the list archives that I can extract the column index of Industry, for example, by typing which(Industry==names(pollute)) but doing that before each command seems dire. Trained to using Stata as I am, I am inclined to check the correlation of the first three or the second, fourth and sixth columns by substituting the column names for the column indexes--something like the following: cor(pollute[Pollution:Industry]) cor(pollute[c(Temp,Population,Rain)]) These however throw errors. I know that many commands in R are perfectly happy to take variable names--the regression models, for example--but that some do not. And so I ask you two general questions: 1. Is there a syntax for referring to variable names rather than column indexes in situations like these? 2. Is there something that I should look for in a command's help file that often indicates whether it can take column names rather than indexes? Again, apologies for asking something that has likely been asked before. I would appreciate any suggestions that you have. Best, John-Paul Ferguson Assistant Professor of Organizational Behavior Stanford University Graduate School of Business 518 Memorial Way, K313 Stanford, CA 94305 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. David Winsemius, MD Heritage Laboratories West Hartford, CT __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Referencing variable names rather than column numbers
Holy Cats, those were four quick responses! And the question, basically, is answered: 1. When in doubt, try quoting column names where you would try using unquoted column indexes. 2. Subset() seems, overall, the most flexible analog to Stata's variable-referencing syntax. I appreciate the help. I'm encouraging several of my PhD students to pick up R, given the research that they are doing, but it seems wrong to make them do that without learning it myself. Humbling to be back at this level of basic interface interaction, but very good to know that a resource like this list exists. Best, John-Paul 2009/12/5 Marc Schwartz marc_schwa...@me.com: Alternatively, you can use subset(), which supports the : operator for the 'select' argument: cor(subset(iris, select = Sepal.Length:Petal.Length)) Sepal.Length Sepal.Width Petal.Length Sepal.Length 1.000 -0.1175698 0.8717538 Sepal.Width -0.1175698 1.000 -0.4284401 Petal.Length 0.8717538 -0.4284401 1.000 which is equivalent to: cor(iris[, 1:3]) Sepal.Length Sepal.Width Petal.Length Sepal.Length 1.000 -0.1175698 0.8717538 Sepal.Width -0.1175698 1.000 -0.4284401 Petal.Length 0.8717538 -0.4284401 1.000 So for the pollute data: cor(subset(pollute, select = Pollution:Industry)) should work. Note also that the 'select' argument to subset can take non-contiguous column names: # Skip 'Sepal.Width' cor(subset(iris, select = c(Sepal.Length, Petal.Length:Petal.Width))) Sepal.Length Petal.Length Petal.Width Sepal.Length 1.000 0.8717538 0.8179411 Petal.Length 0.8717538 1.000 0.9628654 Petal.Width 0.8179411 0.9628654 1.000 So you have the option of specifying, by name, multiple series of contiguous and non-contiguous column names. See ?subset HTH, Marc Schwartz On Dec 5, 2009, at 10:43 AM, Ista Zahn wrote: As baptiste noted, you can do cor(pollute[ ,c(Pollution,Temp,Industry)]). But cor(pollute[,Pollution:Industry]) will not work. For that you can do cor (pollute [ ,which (names(pollute)==Pollution):which(names(pollute)==Industry)]) -Ista On Sat, Dec 5, 2009 at 11:22 AM, John-Paul Ferguson ferguson_john-p...@gsb.stanford.edu wrote: I apologize for how basic a question this is. I am a Stata user who has begun using R, and the syntax differences still trip me up. The most basic questions, involving as they do general terms, can be the hardest to find solutions for through search. Assume for the moment that I have a dataset that contains seven variables: Pollution, Temp, Industry, Population, Wind, Rain and Wet.days. (This actual dataset is taken from Michael Crawley's Statistics: An Introduction Using R and is available as pollute.txt in http://www.bio.ic.ac.uk/research/crawley/statistics/data/zipped.zip.) Assume I have attached pollute. Then cor(pollute) will give me the correlation table for these seven variables. If I would prefer only to see the correlations between, say, Pollution, Temp and Industry, I can get that with cor(pollute[,1:3]) or with cor(pollute[1:3]) Similarly, I can see the correlations between Temp, Population and Rain with cor(pollute[,c(2,4,6)]) or with cor(pollute[c(2,4,6)]) This is fine for a seven-variable dataset. When I have 250 variables, though, I start to pale at looking up column indexes over and over. I know from reading the list archives that I can extract the column index of Industry, for example, by typing which(Industry==names(pollute)) but doing that before each command seems dire. Trained to using Stata as I am, I am inclined to check the correlation of the first three or the second, fourth and sixth columns by substituting the column names for the column indexes--something like the following: cor(pollute[Pollution:Industry]) cor(pollute[c(Temp,Population,Rain)]) These however throw errors. I know that many commands in R are perfectly happy to take variable names--the regression models, for example--but that some do not. And so I ask you two general questions: 1. Is there a syntax for referring to variable names rather than column indexes in situations like these? 2. Is there something that I should look for in a command's help file that often indicates whether it can take column names rather than indexes? Again, apologies for asking something that has likely been asked before. I would appreciate any suggestions that you have. Best, John-Paul Ferguson Assistant Professor of Organizational Behavior Stanford University Graduate School of Business 518 Memorial Way, K313 Stanford, CA 94305 __ R-help@r-project.org mailing list https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.