Hello!

In brief, I’m looking for any advice folks have on accepting XML of various 
encodings into a single database. Are there any considerations I should take 
into account with, say, RESTXQ form parameters? Or with storage, indexing, 
querying, etc.?

For context, I’m currently (re)building a RESTXQ API for a service that 
publishes TEI. The site, TAPAS, is open to anyone with TEI that might need a 
place to store it and show it off online. We do some minimal testing on 
uploaded files [1], but up to this point we haven’t applied any limitations on 
the character encoding that folks use. While I expect that many people use 
UTF-8 encoding for TEI, I would like to try to ensure that folks using other 
encodings can use the service as well. The previous version of the TAPAS-xq API 
was not tested in this regard (and in other ways as well). I’m trying to do 
better in writing this new version.

One problem that I’m running into is simply retrieving UTF-16 XML from a 
multipart form parameter.[2] I think BaseX is serializing the file as a string, 
but when I try to parse it as XML, the file is flagged as ill-formed (“Content 
is not allowed in prolog”). Is there a way to be flexible about uploaded XML 
while giving BaseX any parsing/serialization hints it might need?

I’m about at the limit of my ability to figure out how different encodings 
might be working within BaseX; I appreciate any and all information or advice!

Warmly,
Ash

[1]: Our minimal testing: Is this well-formed XML? Is it in the TEI namespace? 
Is the outermost element <TEI> rather than <teiCorpus> or anything else? We 
also try to make sure the XML doesn’t have Javascript that might make it into a 
reader’s browser.

[2]: I first discovered this might be a problem when creating a unit 
test<https://github.com/NEU-DSG/tapas-xq/blob/d66066e65a661a4b909c21ae852d6479f0ae6274/modules/test-suite.xql#L120-L155>
 for this 
function<https://github.com/NEU-DSG/tapas-xq/blob/d66066e65a661a4b909c21ae852d6479f0ae6274/modules/tapas-api.xql#L225-L273>
 — the UTF-16 file-as-string could not be parsed as XML. I then tested a UTF-16 
file against the RESTXQ endpoint with curl, and found the same problem (though 
I do not know if the string has been read as UTF-8 or -16). In contrast, when I 
tried to create a standalone test 
module<https://gist.github.com/amclark42/9f8d8135a30e4659774a673627a263ac>, a 
UTF-16 file-as-string could be parsed as XML, but not when encoded as binary 
and then decoded again.


Ash Clark (my pronouns are e/em/eir)
XML Applications Developer
Digital Scholarship Group
Northeastern University Libraries
as.cl...@northeastern.edu
(617) 373-5983

Reply via email to