#984: BibRecord/BatchUploader/BibUpload/..: investigate improved handling of
input XML encoding
------------------------+---------------------------
Reporter:  jcaffaro     |      Owner:
    Type:  enhancement  |     Status:  new
Priority:  major        |  Component:  BibUpload
 Version:               |   Keywords:  BatchUploader
------------------------+---------------------------
 BibRecord library is currently more or less encoding-agnostic*: it does
 not need to load the files to parse (the ''client'' therefore has to deal
 with file encoding when using BibRecord) and will somehow deliver this
 data "AS IS" to its underlying parsers. The underlying parsers will have
 different behaviors with respect to the encoding of the input string. For
 example ''minidom'' would try to guess encoding using the XML prologue:
 {{{
 <?xml version="1.0" encoding="iso-8859-1" ?>
 }}}
 and would automatically decode/encode to the correct representation, while
 ''pyRXP'' would simply assume an UTF-8 input, possibly silently leading to
 an incorrect representation of the data. For eg. the following Latin-1
 encoded XML would be correctly parsed by minidom (either when loaded with
 {{{xml.dom.minidom.parseString(...)}}} or with
 {{{xml.dom.minidom.parse(...)}}}), but not using PyRXP, which would assume
 in any case a UTF-8 encoding:
 {{{
 <?xml version="1.0" encoding="iso-8859-1" ?>
 <collection>
 <record>
    <controlfield tag="001">62602</controlfield>
    <datafield tag="245" ind1=" " ind2=" ">
       <subfield code="a">Ullmann Günther</subfield>
 [...]
 }}}

 It the XML prologue is removed the above sample XML, minidom will assume
 UT-8 encoding too (even when providing the path to the file, i.e. when
 some heuristic could be use to guess file encoding), unless the string has
 been encoded to 'Latin_1' using Python.

 Since BibUpload will use BibRecord to insert records into the database, it
 is important to check/fix somewhere the input encoding and/or instruct
 clients to use a given encoding (UTF-8). Some handling of the encoding
 could be done in:

  * the parsers: that could be the ideal solution, but as described above
 each parser has a slightly different behaviour. It seems furthermore that
 some heuristic would be necessary to guess file encoding, which might be
 unsafe here.
  * BibRecord: it would be beneficial for all clients to delegate encoding
 matters to  BibRecord, but being file and source agnostic, it seems safer
 to remain encoding agnostic too.
  * the clients: knowing the input source and having access to the file
 seems to make them the perfect responsibles for dealing with encoding. For
 eg. BibUpload might simply rely on the fact that the admin is responsible
 enough to check encoding before uploading, while BatchUploader would
 enforce stronger restrictions with respect to the encoding (for eg. force
 specification of the XML prologue with encoding and check that it matches
 guessed-encoding, by running for eg. xmllint). Tool ''xmlmarclint'' could
 also be improved to check encoding, by optionally running ''xmllint'' (or
 using the same technique for discovering mismatches, if easily
 reproducible) provided that it does not trigger false positives (BibRecord
 is fixing XML if necessary before passing it  to the parser, for eg. by
 adding '<collection>' and '</collection>' tags if necessary).
 {{{
 $ xmllint batchupload__20120120130121_E4r3SA_without_encoding
 batchupload__20120120130121_E4r3SA_without_encoding:5: parser error :
 Input is not proper UTF-8, indicate encoding !
 Bytes: 0xFC 0x6E 0x74 0x68
       <subfield code="a">Ullmann G�nther</subfield>
 $ xmlmarclint batchupload__20120120130121_E4r3SA_without_encoding
 (no output)
 }}}

 Given all the above considerations it would seem reasonable to leave to
 the ''client'' to deal totally with encodings and have them provide a
 UTF-8 input. If so this should be formalized.
 There might also be some better, or safe-enough alternative to deal with
 encoding right from within BibRecord or the supported parsers. These
 alternatives should be investigated.

 (The encoding being more subject to errors for input coming through
 BatchUploader, the ticket is currently linked to BibUpload component)

 *Still there are some assumptions here and there that the data is
 UTF8-encoded, and it will be expected that the resulting parsed tree will
 be UTF-8 also (grep for 'utf-8' in {{{bibrecord.py}}}).

-- 
Ticket URL: <http://invenio-software.org/ticket/984>
Invenio <http://invenio-software.org>

Reply via email to