Re: [CODE4LIB] Web app to search XML files

Majewski, Steven Dennis (sdm7g) Thu, 17 Dec 2020 13:10:29 -0800

+1 for BaseX.  ( I believe you can do pretty much the same with eXist. ) 
For a WebApp, look at RESTXQ https://docs.basex.org/wiki/RESTXQ 
<https://docs.basex.org/wiki/RESTXQ> which uses annotations to map HTTP Routes 
to XQuery Functions and arguments.


— Steve M. 

> On Dec 17, 2020, at 3:57 PM, Pennington, Buddy D. <[email protected]> 
> wrote:
> 
> Interesting. Our XML is being provided by ProQuest. We're cancelling our 
> subscription to their Historical NYT database and they are offering up XML of 
> the articles for 1851-1938. 
> 
> Buddy Pennington
> Head of Electronic Resources & Systems
> University Libraries
> University of Missouri - Kansas City
> (he/him/his)
> 
> 
> -----Original Message-----
> From: Code for Libraries <[email protected]> On Behalf Of Custer, Mark
> Sent: Thursday, December 17, 2020 2:46 PM
> To: [email protected]
> Subject: Re: [CODE4LIB] Web app to search XML files
> 
> WARNING: This message has originated from an External Source. This may be a 
> phishing expedition that can result in unauthorized access to our IT System. 
> Please use proper judgment and caution when opening attachments, clicking 
> links, or responding to this email.
> 
> Exactly my thoughts, as well, Buddy.
> 
> I was going to recommend BaseX, at least as a first step to investigate the 
> full corpus (https://docs.basex.org/wiki/Getting_Started).  It indexes XML 
> documents very quickly (and in their entirety, which is important whether you 
> want to use those documents as is or transform them to something else).  You 
> can do that from the command line or, without even taking the time to learn 
> its commands, you can use the GUI to 1) index and 2) get an overview of your 
> new database properties, including an exhaustive summary of attributes, 
> elements, path structure, etc.
> 
> That said, I'm now curious about the NYT XML dataset in general.  Can you 
> provide a link to more info about it?  
> 
> I just did a very quick bit of searching, and found this interesting blog 
> post, 
> https://open.blogs.nytimes.com/2016/07/26/the-future-of-the-past-modernizing-the-new-york-times-archive,
>  which I believe describes how that dataset (or a similar one) was handled 
> internally, converting it and HTML documents for missing XML docs into JSON.  
> Due to that, I expect it's not the type of XML that you'll need to retain as 
> XML, but getting a full view of the entire forest is probably the only way to 
> know for sure.
> 
> Mark
> 
> 
> 
> -----Original Message-----
> From: Code for Libraries [mailto:[email protected]] On Behalf Of 
> Pennington, Buddy D.
> Sent: Thursday, 17 December, 2020 3:31 PM
> To: [email protected]
> Subject: Re: [CODE4LIB] Web app to search XML files
> 
> Yes, lots of excellent suggestions from folks. I was actually looking at 
> Basex earlier today as a tool to review the XML once we have it. 
> 
> Thanks!
> 
> Buddy Pennington
> Head of Electronic Resources & Systems
> University Libraries
> University of Missouri - Kansas City
> (he/him/his)
> 
> -----Original Message-----
> From: Code for Libraries <[email protected]> On Behalf Of David Mayo
> Sent: Thursday, December 17, 2020 2:15 PM
> To: [email protected]
> Subject: Re: [CODE4LIB] Web app to search XML files
> 
> WARNING: This message has originated from an External Source. This may be a 
> phishing expedition that can result in unauthorized access to our IT System. 
> Please use proper judgment and caution when opening attachments, clicking 
> links, or responding to this email.
> 
> A lot of good suggestions; if you're looking for fast turnaround without 
> having to decompose and shift the data, it might be worth looking at 
> dedicated XML databases like eXistDB and Basex
> 
> https://nam12.safelinks.protection.outlook.com/?url=http%3A%2F%2Fexist-db.org%2Fexist%2Fapps%2Fhomepage%2Findex.html&amp;data=04%7C01%7Cmark.custer%40yale.edu%7C849f5aa006d84afd0c1d08d8a2caa43d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637438338513534783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=eL%2FucCumXe8y8a5oAqdothJKneDwvcLdncQ3AB9ckcI%3D&amp;reserved=0
> https://nam12.safelinks.protection.outlook.com/?url=https%3A%2F%2Fbasex.org%2F&amp;data=04%7C01%7Cmark.custer%40yale.edu%7C849f5aa006d84afd0c1d08d8a2caa43d%7Cdd8cbebb21394df8b4114e3e87abeb5c%7C0%7C0%7C637438338513534783%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=k2TDcMoP%2F1uFjApdAZwc4zdisOnIf6KZnjCtqMVTQX4%3D&amp;reserved=0
> 
> IIRC, eXist-db has dedicated functionality for building applications built 
> in; even if you don't go that way, I've found these very useful for analysis 
> of XML corpuses prior to running other software to transform them.
> 
> - Dave Mayo (He/Him)
> Software Dev @ Harvard LTS
> 
> 
> On Thu, Dec 17, 2020 at 2:53 PM Stuart A. Yeates <[email protected]> wrote:
> 
>> There's XML and XML.
>> 
>> I suggest that you enquire about the exact format that you're going to 
>> be receiving and ask around for systems that support it out of the 
>> box.
>> 
>> cheers
>> stuart
>> 
>> 
>> --
>> ...let us be heard from red core to black sky
>> 
>> On Fri, 18 Dec 2020 at 07:37, Pennington, Buddy D. 
>> <[email protected]>
>> wrote:
>>> 
>>> Hi all,
>>> 
>>> We're purchasing an XML dataset for the historical NY Times and I am
>> curious about any suggestions to quickly build a web app to search and 
>> display those records for end users.
>>> 
>>> Buddy Pennington
>>> Head of Electronic Resources & Systems University Libraries 
>>> University of Missouri - Kansas City
>>> (he/him/his)
>>

smime.p7s
Description: S/MIME cryptographic signature

Re: [CODE4LIB] Web app to search XML files

Reply via email to