#984: BibRecord/BatchUploader/BibUpload/..: investigate improved handling of
input XML encoding
------------------------+---------------------------
Reporter: jcaffaro | Owner:
Type: enhancement | Status: new
Priority: major | Component: BibUpload
Version: | Keywords: BatchUploader
------------------------+---------------------------
BibRecord library is currently more or less encoding-agnostic*: it does
not need to load the files to parse (the ''client'' therefore has to deal
with file encoding when using BibRecord) and will somehow deliver this
data "AS IS" to its underlying parsers. The underlying parsers will have
different behaviors with respect to the encoding of the input string. For
example ''minidom'' would try to guess encoding using the XML prologue:
{{{
<?xml version="1.0" encoding="iso-8859-1" ?>
}}}
and would automatically decode/encode to the correct representation, while
''pyRXP'' would simply assume an UTF-8 input, possibly silently leading to
an incorrect representation of the data. For eg. the following Latin-1
encoded XML would be correctly parsed by minidom (either when loaded with
{{{xml.dom.minidom.parseString(...)}}} or with
{{{xml.dom.minidom.parse(...)}}}), but not using PyRXP, which would assume
in any case a UTF-8 encoding:
{{{
<?xml version="1.0" encoding="iso-8859-1" ?>
<collection>
<record>
<controlfield tag="001">62602</controlfield>
<datafield tag="245" ind1=" " ind2=" ">
<subfield code="a">Ullmann Günther</subfield>
[...]
}}}
It the XML prologue is removed the above sample XML, minidom will assume
UT-8 encoding too (even when providing the path to the file, i.e. when
some heuristic could be use to guess file encoding), unless the string has
been encoded to 'Latin_1' using Python.
Since BibUpload will use BibRecord to insert records into the database, it
is important to check/fix somewhere the input encoding and/or instruct
clients to use a given encoding (UTF-8). Some handling of the encoding
could be done in:
* the parsers: that could be the ideal solution, but as described above
each parser has a slightly different behaviour. It seems furthermore that
some heuristic would be necessary to guess file encoding, which might be
unsafe here.
* BibRecord: it would be beneficial for all clients to delegate encoding
matters to BibRecord, but being file and source agnostic, it seems safer
to remain encoding agnostic too.
* the clients: knowing the input source and having access to the file
seems to make them the perfect responsibles for dealing with encoding. For
eg. BibUpload might simply rely on the fact that the admin is responsible
enough to check encoding before uploading, while BatchUploader would
enforce stronger restrictions with respect to the encoding (for eg. force
specification of the XML prologue with encoding and check that it matches
guessed-encoding, by running for eg. xmllint). Tool ''xmlmarclint'' could
also be improved to check encoding, by optionally running ''xmllint'' (or
using the same technique for discovering mismatches, if easily
reproducible) provided that it does not trigger false positives (BibRecord
is fixing XML if necessary before passing it to the parser, for eg. by
adding '<collection>' and '</collection>' tags if necessary).
{{{
$ xmllint batchupload__20120120130121_E4r3SA_without_encoding
batchupload__20120120130121_E4r3SA_without_encoding:5: parser error :
Input is not proper UTF-8, indicate encoding !
Bytes: 0xFC 0x6E 0x74 0x68
<subfield code="a">Ullmann G�nther</subfield>
$ xmlmarclint batchupload__20120120130121_E4r3SA_without_encoding
(no output)
}}}
Given all the above considerations it would seem reasonable to leave to
the ''client'' to deal totally with encodings and have them provide a
UTF-8 input. If so this should be formalized.
There might also be some better, or safe-enough alternative to deal with
encoding right from within BibRecord or the supported parsers. These
alternatives should be investigated.
(The encoding being more subject to errors for input coming through
BatchUploader, the ticket is currently linked to BibUpload component)
*Still there are some assumptions here and there that the data is
UTF8-encoded, and it will be expected that the resulting parsed tree will
be UTF-8 also (grep for 'utf-8' in {{{bibrecord.py}}}).
--
Ticket URL: <http://invenio-software.org/ticket/984>
Invenio <http://invenio-software.org>