Hello! In brief, I’m looking for any advice folks have on accepting XML of various encodings into a single database. Are there any considerations I should take into account with, say, RESTXQ form parameters? Or with storage, indexing, querying, etc.?
For context, I’m currently (re)building a RESTXQ API for a service that publishes TEI. The site, TAPAS, is open to anyone with TEI that might need a place to store it and show it off online. We do some minimal testing on uploaded files [1], but up to this point we haven’t applied any limitations on the character encoding that folks use. While I expect that many people use UTF-8 encoding for TEI, I would like to try to ensure that folks using other encodings can use the service as well. The previous version of the TAPAS-xq API was not tested in this regard (and in other ways as well). I’m trying to do better in writing this new version. One problem that I’m running into is simply retrieving UTF-16 XML from a multipart form parameter.[2] I think BaseX is serializing the file as a string, but when I try to parse it as XML, the file is flagged as ill-formed (“Content is not allowed in prolog”). Is there a way to be flexible about uploaded XML while giving BaseX any parsing/serialization hints it might need? I’m about at the limit of my ability to figure out how different encodings might be working within BaseX; I appreciate any and all information or advice! Warmly, Ash [1]: Our minimal testing: Is this well-formed XML? Is it in the TEI namespace? Is the outermost element <TEI> rather than <teiCorpus> or anything else? We also try to make sure the XML doesn’t have Javascript that might make it into a reader’s browser. [2]: I first discovered this might be a problem when creating a unit test<https://github.com/NEU-DSG/tapas-xq/blob/d66066e65a661a4b909c21ae852d6479f0ae6274/modules/test-suite.xql#L120-L155> for this function<https://github.com/NEU-DSG/tapas-xq/blob/d66066e65a661a4b909c21ae852d6479f0ae6274/modules/tapas-api.xql#L225-L273> — the UTF-16 file-as-string could not be parsed as XML. I then tested a UTF-16 file against the RESTXQ endpoint with curl, and found the same problem (though I do not know if the string has been read as UTF-8 or -16). In contrast, when I tried to create a standalone test module<https://gist.github.com/amclark42/9f8d8135a30e4659774a673627a263ac>, a UTF-16 file-as-string could be parsed as XML, but not when encoded as binary and then decoded again. Ash Clark (my pronouns are e/em/eir) XML Applications Developer Digital Scholarship Group Northeastern University Libraries as.cl...@northeastern.edu (617) 373-5983