Re: [R] Downloading a directory of text files into R

2023-07-26 Thread Bob Green

Rui,

Many thanks for  your reply and coding, I was not 
expecting so much work was required. It worked perfectly.


The only thing I needed to do, was create a Temp file in the Documents folder.

Thanks again,


Bob

At 03:52 PM 7/26/2023, Rui Barradas wrote:

Às 23:06 de 25/07/2023, Bob Green escreveu:

Hello,
I am seeking advice as to how I can download 
the 833 files from this site:"http://home.brisnet.org.au/~bgreen/Data/;

I want to be able to download them to perform a textual analysis.
If the 833 files, which are in a Directory with 
two subfolders were on my computer I could read 
them through readtext. Using readtext I get the error:

 > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*;)
Error in download_remote(file, ignore_missing, cache, verbosity) :
 Â  Remote URL does not end in known 
extension. Please download the file manually.

 > x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()")
Error in download_remote(file, ignore_missing, cache, verbosity) :
 Â  Remote URL does not end in known 
extension. Please download the file manually.

Any suggestions are appreciated.
Bob
__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.

Hello,

The following code downloads all files in the posted link.



suppressPackageStartupMessages({
  library(rvest)
})

# destination directory, change this at will
dest_dir <- "~/Temp"

# first get the two subfolders from the Data webpage
link <- "http://home.brisnet.org.au/~bgreen/Data/;
page <- read_html(link)
page %>%
  html_elements("a") %>%
  html_text() %>%
  grep("/$", ., value = TRUE) -> sub_folder

# create relevant disk sub-directories, if
# they do not exist yet
for(subf in sub_folder) {
  d <- file.path(dest_dir, subf)
  if(!dir.exists(d)) {
success <- dir.create(d)
msg <- paste("created directory", d, "-", success)
message(msg)
  }
}

# prepare to download the files
dest_dir <- file.path(dest_dir, sub_folder)
source_url <- paste0(link, sub_folder)

success <- mapply(\(src, dest) {
  # read each Data subfolder
  # and get the file names therein
  # then lapply 'download.file' to each filename
  pg <- read_html(src)
  pg %>%
html_elements("a") %>%
html_text() %>%
grep("\\.txt$", ., value = TRUE) %>%
lapply(\(x) {
  s <- paste0(src, x)
  d <- file.path(dest, x)
  tryCatch(
download.file(url = s, destfile = d),
warning = function(w) w,
error = function(e) e
  )
})
}, source_url, dest_dir)

lengths(success)
# http://home.brisnet.org.au/~bgreen/Data/Hanson1/
#   84
# http://home.brisnet.org.au/~bgreen/Data/Hanson2/
#  749

# matches the question's number
sum(lengths(success))
# [1] 833



Hope this helps,

Rui Barradas


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Downloading a directory of text files into R

2023-07-25 Thread Rui Barradas

Às 23:06 de 25/07/2023, Bob Green escreveu:

Hello,

I am seeking advice as to how I can download the 833 files from this 
site:"http://home.brisnet.org.au/~bgreen/Data/;


I want to be able to download them to perform a textual analysis.

If the 833 files, which are in a Directory with two subfolders were on 
my computer I could read them through readtext. Using readtext I get the 
error:


 > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*;)
Error in download_remote(file, ignore_missing, cache, verbosity) :
   Remote URL does not end in known extension. Please download the file 
manually.


 > x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()")
Error in download_remote(file, ignore_missing, cache, verbosity) :
   Remote URL does not end in known extension. Please download the file 
manually.


Any suggestions are appreciated.

Bob

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.

Hello,

The following code downloads all files in the posted link.



suppressPackageStartupMessages({
  library(rvest)
})

# destination directory, change this at will
dest_dir <- "~/Temp"

# first get the two subfolders from the Data webpage
link <- "http://home.brisnet.org.au/~bgreen/Data/;
page <- read_html(link)
page %>%
  html_elements("a") %>%
  html_text() %>%
  grep("/$", ., value = TRUE) -> sub_folder

# create relevant disk sub-directories, if
# they do not exist yet
for(subf in sub_folder) {
  d <- file.path(dest_dir, subf)
  if(!dir.exists(d)) {
success <- dir.create(d)
msg <- paste("created directory", d, "-", success)
message(msg)
  }
}

# prepare to download the files
dest_dir <- file.path(dest_dir, sub_folder)
source_url <- paste0(link, sub_folder)

success <- mapply(\(src, dest) {
  # read each Data subfolder
  # and get the file names therein
  # then lapply 'download.file' to each filename
  pg <- read_html(src)
  pg %>%
html_elements("a") %>%
html_text() %>%
grep("\\.txt$", ., value = TRUE) %>%
lapply(\(x) {
  s <- paste0(src, x)
  d <- file.path(dest, x)
  tryCatch(
download.file(url = s, destfile = d),
warning = function(w) w,
error = function(e) e
  )
})
}, source_url, dest_dir)

lengths(success)
# http://home.brisnet.org.au/~bgreen/Data/Hanson1/
#   84
# http://home.brisnet.org.au/~bgreen/Data/Hanson2/
#  749

# matches the question's number
sum(lengths(success))
# [1] 833



Hope this helps,

Rui Barradas

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Downloading a directory of text files into R

2023-07-25 Thread Jeff Newmiller
You cannot read files using name patterns. You can use list.files with patterns 
on your local filesystems, and you can use RCurl or httr contributed packages 
to parse out the web listing of files returned by the web server.  See the 
example in ?RCurl. Then you can download the individual files in an lapply or 
for loop.

On July 25, 2023 3:06:07 PM PDT, Bob Green  wrote:
>Hello,
>
>I am seeking advice as to how I can download the 833 files from this 
>site:"http://home.brisnet.org.au/~bgreen/Data/;
>
>I want to be able to download them to perform a textual analysis.
>
>If the 833 files, which are in a Directory with two subfolders were on my 
>computer I could read them through readtext. Using readtext I get the error:
>
>> x = readtext("http://home.brisnet.org.au/~bgreen/Data/*;)
>Error in download_remote(file, ignore_missing, cache, verbosity) :
>  Remote URL does not end in known extension. Please download the file 
> manually.
>
>> x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()")
>Error in download_remote(file, ignore_missing, cache, verbosity) :
>  Remote URL does not end in known extension. Please download the file 
> manually.
>
>Any suggestions are appreciated.
>
>Bob
>
>__
>R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
>https://stat.ethz.ch/mailman/listinfo/r-help
>PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
>and provide commented, minimal, self-contained, reproducible code.

-- 
Sent from my phone. Please excuse my brevity.

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.


Re: [R] Downloading a directory of text files into R

2023-07-25 Thread Ben Bolker

 Where is readtext() from?

  Some combination of scraping

http://home.brisnet.org.au/~bgreen/Data/Hanson1/

and

http://home.brisnet.org.au/~bgreen/Data/Hanson2/


to recover the required file names:

library(rvest)
read_html("http://home.brisnet.org.au/~bgreen/Data/Hanson1/;) |> 
html_element("body") |> html_element("table") |> html_table()


will get you most of the way there ...

then an lapply() or for loop to download all the bits ...?



On 2023-07-25 6:06 p.m., Bob Green wrote:

Hello,

I am seeking advice as to how I can download the 833 files from this 
site:"http://home.brisnet.org.au/~bgreen/Data/;


I want to be able to download them to perform a textual analysis.

If the 833 files, which are in a Directory with two subfolders were on 
my computer I could read them through readtext. Using readtext I get the 
error:


 > x = readtext("http://home.brisnet.org.au/~bgreen/Data/*;)
Error in download_remote(file, ignore_missing, cache, verbosity) :
   Remote URL does not end in known extension. Please download the file 
manually.


 > x = readtext("http://home.brisnet.org.au/~bgreen/Data/Dir/()")
Error in download_remote(file, ignore_missing, cache, verbosity) :
   Remote URL does not end in known extension. Please download the file 
manually.


Any suggestions are appreciated.

Bob

__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide 
http://www.R-project.org/posting-guide.html

and provide commented, minimal, self-contained, reproducible code.


--
Dr. Benjamin Bolker
Professor, Mathematics & Statistics and Biology, McMaster University
Director, School of Computational Science and Engineering
(Acting) Graduate chair, Mathematics & Statistics
> E-mail is sent at my convenience; I don't expect replies outside of 
working hours.


__
R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.