Re: [R] Create single vector after looping through multiple data frames with GREP

2010-10-10 Thread Simon Kiss
Hello all, 

I changed the subject line of the e-mail, because the question I''m posing now 
is different than the first one. I hope that this is proper etiquette.  
However, the original chain is included below.

I've incorporated bits of  both Ethan and Brian's code into the script below, 
but there's one aspect I can't get my head around. I'm totally new to 
programming with control structures. The reproducible code below creates a list 
containing 19 data frames, one each for the Most Important Problem  survey 
data for Canada.

What I'd like at this stage is a loop where I can search through all the data 
frames for rows containing the search term and then bind the rows together in a 
plotable (sp?) format.

At the bottom of the code below, you'll find my first attempt to make use of a 
search string and to put it into a plotable format.  It only partially works.  
I can only get the numbers for one year, where I'd like to be able to get a 
string of numbers for several years.But, on the upside, grep appears to do the 
trick in terms of selecting rows.  

Can any one suggest a solution?
Yours truly,
Simon Kiss

#This is the reproducible code to set-up all the data frames
require(XML)
library(XML)
#This gets the data from the web and lists them
mylist - paste (http://www.queensu.ca/cora/_trends/mip_;,
c(1987:2001,2003:2006), .htm, sep=)
alltables - lapply(mylist, readHTMLTable)

#convert to dataframes
r-lapply(alltables, function(x) {as.data.frame(x)} )

#This is just some house-cleaning; structuring all the tables so they are 
uniform 
r[[1]][3]-r[[1]][2]
r[[1]][2]-c( )
r[[2]][4]-r[[2]][2]
r[[2]][5]-r[[2]][3]
r[[2]][2:3]-c( )
r[[3]][4:5]-r[[3]][3:4]
r[[3]][3]-c( )

#This loop deletes some superfluous columns and rows, turns the first column in 
to character strings and the data into numeric
for (i in 1:19) {
n.rows-dim(r[[i]])[1]
r[[i]] - r[[i]][15:n.rows-3, 1:5]
n.rows-dim(r[[i]])[1]
row.names(r[[i]]) -NULL
names(r[[i]]) - c(Response, Q1, Q2, Q3, Q4)

r[[i]][, 1]-as.character(r[[i]][,1])
#r[[i]][,2:5]-as.numeric(as.character(r[[i]][,2:5]))
r[[i]][, 2:5]-lapply(r[[i]][, 2:5], function(x) {as.numeric(as.character(x))})
#n.rows-dim(r[[i]])[1]
#r[[i]]-r[[i]][9
}

#This code is my first attempt at introducing a search string, getting the 
rows, binding and plotting;
economy-r[[10]][grep('Economy', r[[10]][,1]),]
economy_2-r[[11]][grep('Economy', r[[11]][,1]),]
test-cbind(economy, economy_2)
plot(as.numeric(test), type='l')

#here's another attempt I'm trying
economy-data.frame
for (i in 15:19) {
economy[i,] -r[[i]][grep('Economy', r[[i]][,1]), ]
}

Begin forwarded message:

 From: Simon Kiss sjk...@gmail.com
 Date: October 7, 2010 4:59:46 PM EDT
 To: Simon Kiss simonjk...@yahoo.ca
 Subject: Fwd: [R] Converting scraped data
 
 
 
 Begin forwarded message:
 
 From: Ethan Brown ethancbr...@gmail.com
 Date: October 6, 2010 4:22:41 PM GMT-04:00
 To: Simon Kiss sjk...@gmail.com
 Cc: r-help@r-project.org
 Subject: Re: [R] Converting scraped data
 
 Hi Simon,
 
 You'll notice the test data.frame has a whole mix of characters in
 the columns you're interested, including a - for missing values, and
 that the columns you're interested in are in fact factors.
 
 as.numeric(factor) returns the level of the factor, not the value of
 the level. (See ?levels and ?factor)--that's why it's giving you those
 irrelevant integers. I always end up using something like this handy
 code snippet to deal with the situation:
 
 unfactor - function(factors)
 # From http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor
 # Transform a factor back into its factor names
 {
  return(levels(factors)[factors])
 }
 
 Then, to get your data to where you want it, I'd do this:
 
 require(XML)
 theurl - http://www.queensu.ca/cora/_trends/mip_2006.htm;
 tables - readHTMLTable(theurl)
 n.rows - unlist(lapply(tables, function(t) dim(t)[1]))
 class(tables)
 test-data.frame(tables, stringsAsFactors=FALSE)
 
 
 result - test[11:42, 1:5] #Extract the actual data we want
 names(result) - c(Response, Q1, Q2,Q3,Q4)
 for(i in 2:5) {
 # Convert columns to factors
 result[,i] - as.numeric(unfactor(result[,i]))
 }
 result
 
 From here you should be able to plot or do whatever else you want.
 
 Hope this helps,
 Ethan Brown
 
 
 On Wed, Oct 6, 2010 at 9:52 AM, Simon Kiss sjk...@gmail.com wrote:
 Dear Colleagues,
 I used this code to scrape data from the URL conatined within.  This code
 should be reproducible.
 
 require(XML)
 library(XML)
 theurl - http://www.queensu.ca/cora/_trends/mip_2006.htm;
 tables - readHTMLTable(theurl)
 n.rows - unlist(lapply(tables, function(t) dim(t)[1]))
 class(tables)
 test-data.frame(tables, stringsAsFactors=FALSE)
 test[16,c(2:5)]
 as.numeric(test[16,c(2:5)])
 quartz()
 plot(c(1:4), test[15, c(2:5)])
 
 calling the values from the row of interest using test[16, c(2:5)] can bring
 them up as represented on the screen, plotting them or coercing them to
 numeric changes the values and in a way that doesn't make 

Re: [R] Create single vector after looping through multiple data frames with GREP

2010-10-10 Thread Michael Bedward
Hi Simon,

The function below should do it or at least get you started...

getPlotData - function (datalist, response, times)
{
  qdata - sapply(datalist[times],
function(df) {
  irow - grepl(response, df$Response)
  df[irow, 2:5]
}
  )

  # qdata is a matrix with rows Q1:Q4 and cols for times;
  # we turn it into a two col matrix with col 1 = time index
  # and col 2 = value
  time.index - seq(4 * ncol(qdata))
  out - cbind(time.index, as.numeric(qdata))
  rownames(out) - paste(time.index, rownames(qdata), sep=.)
  colnames(out) - c(time, response)
  out
}

#Example, get data for times 10:15 where Response contains Economy
x - getPlotData(r, Economy, 10:15)


Michael


On 11 October 2010 03:35, Simon Kiss sjk...@gmail.com wrote:
 Hello all,

 I changed the subject line of the e-mail, because the question I''m posing 
 now is different than the first one. I hope that this is proper etiquette.  
 However, the original chain is included below.

 I've incorporated bits of  both Ethan and Brian's code into the script below, 
 but there's one aspect I can't get my head around. I'm totally new to 
 programming with control structures. The reproducible code below creates a 
 list containing 19 data frames, one each for the Most Important Problem  
 survey data for Canada.

 What I'd like at this stage is a loop where I can search through all the data 
 frames for rows containing the search term and then bind the rows together in 
 a plotable (sp?) format.

 At the bottom of the code below, you'll find my first attempt to make use of 
 a search string and to put it into a plotable format.  It only partially 
 works.  I can only get the numbers for one year, where I'd like to be able to 
 get a string of numbers for several years.But, on the upside, grep appears to 
 do the trick in terms of selecting rows.

 Can any one suggest a solution?
 Yours truly,
 Simon Kiss

 #This is the reproducible code to set-up all the data frames
 require(XML)
 library(XML)
 #This gets the data from the web and lists them
 mylist - paste (http://www.queensu.ca/cora/_trends/mip_;,
 c(1987:2001,2003:2006), .htm, sep=)
 alltables - lapply(mylist, readHTMLTable)

 #convert to dataframes
 r-lapply(alltables, function(x) {as.data.frame(x)} )

 #This is just some house-cleaning; structuring all the tables so they are 
 uniform
 r[[1]][3]-r[[1]][2]
 r[[1]][2]-c( )
 r[[2]][4]-r[[2]][2]
 r[[2]][5]-r[[2]][3]
 r[[2]][2:3]-c( )
 r[[3]][4:5]-r[[3]][3:4]
 r[[3]][3]-c( )

 #This loop deletes some superfluous columns and rows, turns the first column 
 in to character strings and the data into numeric
 for (i in 1:19) {
 n.rows-dim(r[[i]])[1]
 r[[i]] - r[[i]][15:n.rows-3, 1:5]
 n.rows-dim(r[[i]])[1]
 row.names(r[[i]]) -NULL
 names(r[[i]]) - c(Response, Q1, Q2, Q3, Q4)

 r[[i]][, 1]-as.character(r[[i]][,1])
 #r[[i]][,2:5]-as.numeric(as.character(r[[i]][,2:5]))
 r[[i]][, 2:5]-lapply(r[[i]][, 2:5], function(x) 
 {as.numeric(as.character(x))})
 #n.rows-dim(r[[i]])[1]
 #r[[i]]-r[[i]][9
 }

 #This code is my first attempt at introducing a search string, getting the 
 rows, binding and plotting;
 economy-r[[10]][grep('Economy', r[[10]][,1]),]
 economy_2-r[[11]][grep('Economy', r[[11]][,1]),]
 test-cbind(economy, economy_2)
 plot(as.numeric(test), type='l')

 #here's another attempt I'm trying
 economy-data.frame
 for (i in 15:19) {
 economy[i,] -r[[i]][grep('Economy', r[[i]][,1]), ]
 }

 Begin forwarded message:

 From: Simon Kiss sjk...@gmail.com
 Date: October 7, 2010 4:59:46 PM EDT
 To: Simon Kiss simonjk...@yahoo.ca
 Subject: Fwd: [R] Converting scraped data



 Begin forwarded message:

 From: Ethan Brown ethancbr...@gmail.com
 Date: October 6, 2010 4:22:41 PM GMT-04:00
 To: Simon Kiss sjk...@gmail.com
 Cc: r-help@r-project.org
 Subject: Re: [R] Converting scraped data

 Hi Simon,

 You'll notice the test data.frame has a whole mix of characters in
 the columns you're interested, including a - for missing values, and
 that the columns you're interested in are in fact factors.

 as.numeric(factor) returns the level of the factor, not the value of
 the level. (See ?levels and ?factor)--that's why it's giving you those
 irrelevant integers. I always end up using something like this handy
 code snippet to deal with the situation:

 unfactor - function(factors)
 # From http://psychlab2.ucr.edu/rwiki/index.php/R_Code_Snippets#unfactor
 # Transform a factor back into its factor names
 {
  return(levels(factors)[factors])
 }

 Then, to get your data to where you want it, I'd do this:

 require(XML)
 theurl - http://www.queensu.ca/cora/_trends/mip_2006.htm;
 tables - readHTMLTable(theurl)
 n.rows - unlist(lapply(tables, function(t) dim(t)[1]))
 class(tables)
 test-data.frame(tables, stringsAsFactors=FALSE)


 result - test[11:42, 1:5] #Extract the actual data we want
 names(result) - c(Response, Q1, Q2,Q3,Q4)
 for(i in 2:5) {
 # Convert columns to factors
 result[,i] - as.numeric(unfactor(result[,i]))
 }
 result

 From here you should be