On Oct 31, 2012, at 9:56 AM, chuck.01 wrote:

> Sorry, I know I should read a little 1st about this, but I am actually just
> helping somebody really quick and need help too. 
> 
> I want to grep all of the names of the .txt files mentioned on this html web
> page:
> 
> http://www.epa.gov/emap/remap/html/three/data/index.html


This shows code that will identify lines in that source page containing URLs 
that end in '.txt"'

> lines <- 
> readLines(con=url("http://www.epa.gov/emap/remap/html/three/data/index.html";) 
> )
Warning message:
In readLines(con = 
url("http://www.epa.gov/emap/remap/html/three/data/index.html";)) :
  incomplete final line found on 
'http://www.epa.gov/emap/remap/html/three/data/index.html'
# You can generally ignore that warning.

> length(grep('\\"http://([./A-Za-z]){1+}\\.txt"', lines) )
[1] 11

Should be fairly straightforward to remove the preceding and trailing material.

> sub('(^.*\\")(http://([./A-Za-z]){1+}\\.txt)(".*$)', "\\2", lines[ 
> grep('\\"http://([./A-Za-z]){1+}\\.txt"', lines) ] )
 [1] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/benmet.txt";
  
 [2] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/benthic/bencnt.txt";
  
 [3] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/watchr.txt";
 
 [4] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/location/habbest.txt";
 [5] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/design/sdesign.txt";
  
 [6] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/wchem/chmval.txt";
    
 [7] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshmet.txt";
     
 [8] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshcnt.txt";
     
 [9] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/fish/fshnam.txt";
     
[10] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/tissue/ftmet.txt";
    
[11] 
"http://www.epa.gov/emap/html/data/surfwatr/data/mastreams/9396/tissue/ftorg.txt";
    

> 

> Thanks ahead of time.
> 
> 
> 
> --
> View this message in context: 
> http://r.789695.n4.nabble.com/grep-txt-file-names-from-html-tp4648037.html
> Sent from the R help mailing list archive at Nabble.com.
> 
> ______________________________________________
> R-help@r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.

David Winsemius, MD
Alameda, CA, USA

______________________________________________
R-help@r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Reply via email to