Exactly my thoughts, as well, Buddy.

I was going to recommend BaseX, at least as a first step to investigate the 
full corpus (https://docs.basex.org/wiki/Getting_Started).  It indexes XML 
documents very quickly (and in their entirety, which is important whether you 
want to use those documents as is or transform them to something else).  You 
can do that from the command line or, without even taking the time to learn its 
commands, you can use the GUI to 1) index and 2) get an overview of your new 
database properties, including an exhaustive summary of attributes, elements, 
path structure, etc.

That said, I'm now curious about the NYT XML dataset in general.  Can you 
provide a link to more info about it?  

I just did a very quick bit of searching, and found this interesting blog post, 
https://open.blogs.nytimes.com/2016/07/26/the-future-of-the-past-modernizing-the-new-york-times-archive,
 which I believe describes how that dataset (or a similar one) was handled 
internally, converting it and HTML documents for missing XML docs into JSON.  
Due to that, I expect it's not the type of XML that you'll need to retain as 
XML, but getting a full view of the entire forest is probably the only way to 
know for sure.

Mark



-----Original Message-----
From: Code for Libraries [mailto:CODE4LIB@LISTS.CLIR.ORG] On Behalf Of 
Pennington, Buddy D.
Sent: Thursday, 17 December, 2020 3:31 PM
To: CODE4LIB@LISTS.CLIR.ORG
Subject: Re: [CODE4LIB] Web app to search XML files

Yes, lots of excellent suggestions from folks. I was actually looking at Basex 
earlier today as a tool to review the XML once we have it. 

Thanks!

Buddy Pennington
Head of Electronic Resources & Systems
University Libraries
University of Missouri - Kansas City
(he/him/his)

-----Original Message-----
From: Code for Libraries <CODE4LIB@LISTS.CLIR.ORG> On Behalf Of David Mayo
Sent: Thursday, December 17, 2020 2:15 PM
To: CODE4LIB@LISTS.CLIR.ORG
Subject: Re: [CODE4LIB] Web app to search XML files

WARNING: This message has originated from an External Source. This may be a 
phishing expedition that can result in unauthorized access to our IT System. 
Please use proper judgment and caution when opening attachments, clicking 
links, or responding to this email.

A lot of good suggestions; if you're looking for fast turnaround without having 
to decompose and shift the data, it might be worth looking at dedicated XML 
databases like eXistDB and Basex

https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fexist-db.org%2Fexist%2Fapps%2Fhomepage%2Findex.html&amp;data=04%7C01%7Cmark.custer%40yale.edu%7C849f5aa006d84afd0c1d08d8a2caa43d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637438338513534783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=eL%2FucCumXe8y8a5oAqdothJKneDwvcLdncQ3AB9ckcI%3D&amp;reserved=0
https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbasex.org%2F&amp;data=04%7C01%7Cmark.custer%40yale.edu%7C849f5aa006d84afd0c1d08d8a2caa43d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637438338513534783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=k2TDcMoP%2F1uFjApdAZwc4zdisOnIf6KZnjCtqMVTQX4%3D&amp;reserved=0

IIRC, eXist-db has dedicated functionality for building applications built in; 
even if you don't go that way, I've found these very useful for analysis of XML 
corpuses prior to running other software to transform them.

- Dave Mayo (He/Him)
Software Dev @ Harvard LTS


On Thu, Dec 17, 2020 at 2:53 PM Stuart A. Yeates <syea...@gmail.com> wrote:

> There's XML and XML.
>
> I suggest that you enquire about the exact format that you're going to 
> be receiving and ask around for systems that support it out of the 
> box.
>
> cheers
> stuart
>
>
> --
> ...let us be heard from red core to black sky
>
> On Fri, 18 Dec 2020 at 07:37, Pennington, Buddy D. 
> <penningt...@umkc.edu>
> wrote:
> >
> > Hi all,
> >
> > We're purchasing an XML dataset for the historical NY Times and I am
> curious about any suggestions to quickly build a web app to search and 
> display those records for end users.
> >
> > Buddy Pennington
> > Head of Electronic Resources & Systems University Libraries 
> > University of Missouri - Kansas City
> > (he/him/his)
>

Reply via email to