Nancy,

I have access to our staging database, but not production. I'm not sure our 
sysadmins will allow me to play around in the prod database, unless they can 
assign me read only maybe? Pulling the file_uri values for file_version would 
be much more efficient. However, I'm not just looking to check digital object 
links, but also any links found within collection and archival object level 
notes, either copied straight into the text of the notes or linked using the 
<extref> tag. I could probably query the database for that info too.

Corey
________________________________
From: archivesspace_users_group-boun...@lyralists.lyrasis.org 
<archivesspace_users_group-boun...@lyralists.lyrasis.org> on behalf of Kennedy, 
Nancy <kenne...@si.edu>
Sent: Wednesday, February 10, 2021 9:18 AM
To: Archivesspace Users Group <archivesspace_users_group@lyralists.lyrasis.org>
Subject: Re: [Archivesspace_Users_Group] Checking for Broken URLs in Resources

[EXTERNAL SENDER - PROCEED CAUTIOUSLY]


Hi Corey –

Do you have access to query the database, as a starting point, instead of EAD?  
We were able to pull the file_uri values from the file_version table in the 
database.  Our sysadmin then checked the response codes for that list of URI, 
and we referred issues out to staff working on those collections.  Some 
corrections can be made directly by staff, or for long lists, you could include 
the digital_object id and post updates that way.



Nancy





From: archivesspace_users_group-boun...@lyralists.lyrasis.org 
<archivesspace_users_group-boun...@lyralists.lyrasis.org> On Behalf Of Corey 
Schmidt
Sent: Wednesday, February 10, 2021 8:45 AM
To: archivesspace_users_group@lyralists.lyrasis.org
Subject: [Archivesspace_Users_Group] Checking for Broken URLs in Resources



External Email - Exercise Caution

Dear all,

Hello, this is Corey Schmidt, ArchivesSpace PM at the University of Georgia. I 
hope everyone is doing well and staying safe and healthy.

Would anyone know of any script, plugin, or tool to check for invalid URLs 
within resources? We are investigating how to grab URLs from exported EAD.xml 
files and check them to determine if they throw back any sort of error (404s 
mostly, but also any others). My thinking is to build a small app that will 
export EAD.xml files from ArchivesSpace, then sift through the raw xml using 
python's lxml package to catch any URLs using regex. After capturing the URL, 
it would then use the requests library to check the status code of the URL and 
if it returns an error, log that error in a .CSV output file to act as a 
"report" of all the broken links within that resource.

The problems with this method are: 1. Exporting 1000s of resources takes a lot 
of time and some processing power, as well as a moderate amount of local 
storage space. 2. Even checking the raw xml file takes a considerable amount of 
time. The app I'm working on takes overnight to export and check all the xml 
files. I was considering pinging the API for different parts of a resource, but 
I figured that would take as much time as just exporting an EAD.xml and would 
be even more complex to write. I've checked Awesome ArchivesSpace, this 
listserv, and a few script libraries from institutions, but haven't found 
exactly what I am looking for.

Any info or advice would be greatly appreciated! Thanks!

Sincerely,

Corey



Corey Schmidt

ArchivesSpace Project Manager

University of Georgia Special Collections Libraries

Email: corey.schm...@uga.edu<mailto:corey.schm...@uga.edu>
_______________________________________________
Archivesspace_Users_Group mailing list
Archivesspace_Users_Group@lyralists.lyrasis.org
http://lyralists.lyrasis.org/mailman/listinfo/archivesspace_users_group

Reply via email to