Re: [R] Downloading a directory of text files into R
Rui, Many thanks for your reply and coding, I was not expecting so much work was required. It worked perfectly. The only thing I needed to do, was create a Temp file in the Documents folder. Thanks again, Bob At 03:52 PM 7/26/2023, Rui Barradas wrote: Ãs 23:06 de 25/07/2023, Bob Green escreveu: Hello, I am seeking advice as to how I can download the 833 files from this site:"http://home.brisnet.org.au/~bgreen/Data/; I want to be able to download them to perform a textual analysis. If the 833 files, which are in a Directory with two subfolders were on my computer I could read them through readtext. Using readtext I get the error: > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*;) Error in download_remote(file, ignore_missing, cache, verbosity) : Â Remote URL does not end in known extension. Please download the file manually. > x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()") Error in download_remote(file, ignore_missing, cache, verbosity) : Â Remote URL does not end in known extension. Please download the file manually. Any suggestions are appreciated. Bob __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Hello, The following code downloads all files in the posted link. suppressPackageStartupMessages({ library(rvest) }) # destination directory, change this at will dest_dir <- "~/Temp" # first get the two subfolders from the Data webpage link <- "http://home.brisnet.org.au/~bgreen/Data/; page <- read_html(link) page %>% html_elements("a") %>% html_text() %>% grep("/$", ., value = TRUE) -> sub_folder # create relevant disk sub-directories, if # they do not exist yet for(subf in sub_folder) { d <- file.path(dest_dir, subf) if(!dir.exists(d)) { success <- dir.create(d) msg <- paste("created directory", d, "-", success) message(msg) } } # prepare to download the files dest_dir <- file.path(dest_dir, sub_folder) source_url <- paste0(link, sub_folder) success <- mapply(\(src, dest) { # read each Data subfolder # and get the file names therein # then lapply 'download.file' to each filename pg <- read_html(src) pg %>% html_elements("a") %>% html_text() %>% grep("\\.txt$", ., value = TRUE) %>% lapply(\(x) { s <- paste0(src, x) d <- file.path(dest, x) tryCatch( download.file(url = s, destfile = d), warning = function(w) w, error = function(e) e ) }) }, source_url, dest_dir) lengths(success) # http://home.brisnet.org.au/~bgreen/Data/Hanson1/ # 84 # http://home.brisnet.org.au/~bgreen/Data/Hanson2/ # 749 # matches the question's number sum(lengths(success)) # [1] 833 Hope this helps, Rui Barradas __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Downloading a directory of text files into R
Às 23:06 de 25/07/2023, Bob Green escreveu: Hello, I am seeking advice as to how I can download the 833 files from this site:"http://home.brisnet.org.au/~bgreen/Data/; I want to be able to download them to perform a textual analysis. If the 833 files, which are in a Directory with two subfolders were on my computer I could read them through readtext. Using readtext I get the error: > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*;) Error in download_remote(file, ignore_missing, cache, verbosity) : Remote URL does not end in known extension. Please download the file manually. > x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()") Error in download_remote(file, ignore_missing, cache, verbosity) : Remote URL does not end in known extension. Please download the file manually. Any suggestions are appreciated. Bob __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. Hello, The following code downloads all files in the posted link. suppressPackageStartupMessages({ library(rvest) }) # destination directory, change this at will dest_dir <- "~/Temp" # first get the two subfolders from the Data webpage link <- "http://home.brisnet.org.au/~bgreen/Data/; page <- read_html(link) page %>% html_elements("a") %>% html_text() %>% grep("/$", ., value = TRUE) -> sub_folder # create relevant disk sub-directories, if # they do not exist yet for(subf in sub_folder) { d <- file.path(dest_dir, subf) if(!dir.exists(d)) { success <- dir.create(d) msg <- paste("created directory", d, "-", success) message(msg) } } # prepare to download the files dest_dir <- file.path(dest_dir, sub_folder) source_url <- paste0(link, sub_folder) success <- mapply(\(src, dest) { # read each Data subfolder # and get the file names therein # then lapply 'download.file' to each filename pg <- read_html(src) pg %>% html_elements("a") %>% html_text() %>% grep("\\.txt$", ., value = TRUE) %>% lapply(\(x) { s <- paste0(src, x) d <- file.path(dest, x) tryCatch( download.file(url = s, destfile = d), warning = function(w) w, error = function(e) e ) }) }, source_url, dest_dir) lengths(success) # http://home.brisnet.org.au/~bgreen/Data/Hanson1/ # 84 # http://home.brisnet.org.au/~bgreen/Data/Hanson2/ # 749 # matches the question's number sum(lengths(success)) # [1] 833 Hope this helps, Rui Barradas __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Downloading a directory of text files into R
You cannot read files using name patterns. You can use list.files with patterns on your local filesystems, and you can use RCurl or httr contributed packages to parse out the web listing of files returned by the web server. See the example in ?RCurl. Then you can download the individual files in an lapply or for loop. On July 25, 2023 3:06:07 PM PDT, Bob Green wrote: >Hello, > >I am seeking advice as to how I can download the 833 files from this >site:"http://home.brisnet.org.au/~bgreen/Data/; > >I want to be able to download them to perform a textual analysis. > >If the 833 files, which are in a Directory with two subfolders were on my >computer I could read them through readtext. Using readtext I get the error: > >> x = readtext("http://home.brisnet.org.au/~bgreen/Data/*;) >Error in download_remote(file, ignore_missing, cache, verbosity) : > Remote URL does not end in known extension. Please download the file > manually. > >> x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()") >Error in download_remote(file, ignore_missing, cache, verbosity) : > Remote URL does not end in known extension. Please download the file > manually. > >Any suggestions are appreciated. > >Bob > >__ >R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >https://stat.ethz.ch/mailman/listinfo/r-help >PLEASE do read the posting guide http://www.R-project.org/posting-guide.html >and provide commented, minimal, self-contained, reproducible code. -- Sent from my phone. Please excuse my brevity. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.
Re: [R] Downloading a directory of text files into R
Where is readtext() from? Some combination of scraping http://home.brisnet.org.au/~bgreen/Data/Hanson1/ and http://home.brisnet.org.au/~bgreen/Data/Hanson2/ to recover the required file names: library(rvest) read_html("http://home.brisnet.org.au/~bgreen/Data/Hanson1/;) |> html_element("body") |> html_element("table") |> html_table() will get you most of the way there ... then an lapply() or for loop to download all the bits ...? On 2023-07-25 6:06 p.m., Bob Green wrote: Hello, I am seeking advice as to how I can download the 833 files from this site:"http://home.brisnet.org.au/~bgreen/Data/; I want to be able to download them to perform a textual analysis. If the 833 files, which are in a Directory with two subfolders were on my computer I could read them through readtext. Using readtext I get the error: > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*;) Error in download_remote(file, ignore_missing, cache, verbosity) : Remote URL does not end in known extension. Please download the file manually. > x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()") Error in download_remote(file, ignore_missing, cache, verbosity) : Remote URL does not end in known extension. Please download the file manually. Any suggestions are appreciated. Bob __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code. -- Dr. Benjamin Bolker Professor, Mathematics & Statistics and Biology, McMaster University Director, School of Computational Science and Engineering (Acting) Graduate chair, Mathematics & Statistics > E-mail is sent at my convenience; I don't expect replies outside of working hours. __ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.