Hi Roy (& others) Many thanks for the advice - well taken. Thanks also to the others who have responded so quickly - I thought I might have to wait days!! :-)
I'm on a Linux (Mint) machine. Below, I document three attempts, two using officer and the last now using textreadr My attempts so far using 'officer': ################## (1) First Attempt: # Load libraries library(tcltk) library(tidyverse) library(officer) setwd(tk_choose.dir()) doc_path <- list.files(getwd(), pattern = ".docx", full.names = TRUE) files <- list.files(getwd(), ".docx") files length(files) ## This works to here - obtain a list of docx files in directory 'TEST with 9 files'. However, the next line doc_in <- read_docx(files) Results in this error:Error in filetype %in% c("docx") && grepl("^([fh]ttp)", file) :'length = 9' in coercion to 'logical(1)' No idea how to debug that. Even when trying Calum's suggestion with officer: content <- officer::docx_summary("Now they want us to charge our electric cars from litter bins.docx") # A title of one of the articles The error returned is:Error in x$doc_obj : $ operator is invalid for atomic vectors ################## (2) Second Attempt: # Load libraries library(tcltk) library(tidyverse) library(officer) setwd(tk_choose.dir()) doc_path <- list.files(getwd(), pattern = ".docx", full.names = TRUE) files <- list.files(getwd(), ".docx") files length(files) docx_summary(doc_path, preserve = FALSE) ## At this point, the error is:Error in x$doc_obj : $ operator is invalid for atomic vectors So, not sure how I am passing an atomic vector or if there is something I am supposed to set to make this something else? ################## (3) Third attempt - now trying with textreadr (Thanks for the help on installing this, Calum): # Load libraries library(tcltk) library(tidyverse) library(textreadr) folder <- setwd(tk_choose.dir()) files <- list.files(folder, ".docx") files length(files) doc <- read_docx("Now they want us to charge our electric cars from litter bins.docx") # One of the 9 files in the folder read_docx(doc, skip = 0, remove.empty = TRUE, trim = TRUE) # To test against one file ## The last line returns the following error:Error in filetype %in% c("docx") && grepl("^([fh]ttp)", file) :'length = 38' in coercion to 'logical(1)' ################## And so I am going around in circles and not at all clear on how I can make progress. I am sure that there must be a way, but the suggestions on-line each lead to the above errors. Thanks for any further help. Best wishes, and thanks Andy On 29/12/2023 18:25, Roy Mendelssohn - NOAA Federal wrote: > Hi Andy: > > I don’t have an answer but I do have what I hope is some friendly advice. > Generally the more information you can provide, the more likely you will get > help that is useful. In your case you say that you tried several packages > and they didn’t do what you wanted. Providing that code, as well as why > they didn’t do what you wanted (be specific) would greatly facilitate things. > > Happy new year, > > -Roy > > >> On Dec 29, 2023, at 10:14 AM, Andy<phaedr...@gmail.com> wrote: >> >> Hello >> >> I am trying to work through a problem, but feel like I've gone down a rabbit >> hole. I'd very much appreciate any help. >> >> The task: I have several directories of multiple (some directories, up to >> 2,500+) *.docx files (newspaper articles downloaded from Lexis+) that I want >> to iterate through to append to a spreadsheet only those articles that >> satisfy a condition (i.e., a specific keyword is present for >= 50% coverage >> of the subject matter). Lexis+ has a very specific structure and keywords >> are given in the row "Subject". >> >> I'd like to be able to accomplish the following: >> >> (1) Append the title, the month, the author, the number of words, and page >> number(s) to a spreadsheet >> >> (2) Read each article and extract keywords (in the docs, these are listed in >> 'Subject' section as a list of keywords with a percentage showing the extent >> to which the keyword features in the article (e.g., FAST FASHION (72%)) and >> to append the keyword and the % coverage to the same row in the spreadsheet. >> However, I want to ensure that the keyword coverage meets the threshold of >> >= 50%; if not, then pass onto the next article in the directory. Rinse and >> repeat for the entire directory. >> >> So far, I've tried working through some Stack Overflow-based solutions, but >> most seem to use the textreadr package, which is now deprecated; others use >> either the officer or the officedown packages. However, these packages don't >> appear to do what I want the program to do, at least not in any of the >> examples I have found, nor in the vignettes and relevant package manuals >> I've looked at. >> >> The first point is, is what I am intending to do even possible using R? If >> it is, then where do I start with this? If these docx files were converted >> to UTF-8 plain text, would that make the task easier? >> >> I am not a confident coder, and am really only just getting my head around R >> so appreciate a steep learning curve ahead, but of course, I don't know what >> I don't know, so any pointers in the right direction would be a big help. >> >> Many thanks in anticipation >> >> Andy >> >> ______________________________________________ >> R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see >> https://stat.ethz.ch/mailman/listinfo/r-help >> PLEASE do read the posting guidehttp://www.R-project.org/posting-guide.html >> and provide commented, minimal, self-contained, reproducible code. [[alternative HTML version deleted]] ______________________________________________ R-help@r-project.org mailing list -- To UNSUBSCRIBE and more, see https://stat.ethz.ch/mailman/listinfo/r-help PLEASE do read the posting guide http://www.R-project.org/posting-guide.html and provide commented, minimal, self-contained, reproducible code.