#748: wash_for_utf8 breaks sometimes when fed an already-unicode string
-----------------------+------------------------------
  Reporter:  jblayloc  |      Owner:  jblayloc
      Type:  defect    |     Status:  assigned
  Priority:  blocker   |  Milestone:
 Component:  MiscUtil  |    Version:
Resolution:            |   Keywords:  INSPIRE DEPLOYED
-----------------------+------------------------------

Comment (by jblayloc):

 Replying to [comment:6 simko]:
 > Actually, it is not correct to amend the docstring of
 > `wash_for_utf8()` to say that it returns UTF-8 encoded string with
 > incorrect characters stripped, while at the same time to amend it to
 > return Unicode strings unchanged.  Looking at the function, it assumes
 > that its input parameter `text` is UTF-8 binary string already, so it
 > simply checks for compliance and removes any extra binary characters.
 > The function would fail should it receive UTF-16 encoded binary
 > string, for example.  So we should amend either the docstring or the
 > behaviour of the function to cover these non-UTF-8 input cases better.

 Ok, I think amending the doc string is the right thing to do.  I haven't
 yet seen a case where we've had UTF-16 in the wild, so I'm not convinced
 we need to worry about them too much.  In my view, the present better is
 still far closer to correct than the previous behavior.

 > For example, we could amend it to accept either UTF-8 binary strings
 > or Unicode strings (so UTF-16 would still lead to error), while it
 > would return UTF-8 binary strings in both input cases.  But then it
 > would not be only "washing", it would be also "converting"...

 Yes; if I recall correctly, the intention of this method (Piotr?  Help me
 out?) was that it would take strings is arbitrary encoding and return the
 parts of those strings which happen to have an ASCII representation when
 the string is interpreted as UTF-8.  Which I view as a kind of converting;
 I think the choice of names may have been unfortunate.  What's really
 desired, I think, is an actual anything-to-UTF8 converter.  But of course
 we don't actually store the encoding of strings in the database, so we
 can't be sure what we're converting from, which I believe limits our
 options.

 > Moreover, calling this function for a Unicode input string, while the
 > function usually expects UTF-8 binary string, may lead to masking
 > troubles with encode/decode in our code base.  (Kind of similar to
 > calling `cgi.escape(x)` for an already escaped argument.)  One can
 > argue that since this function is used to wash UTF-8 binary strings,
 > why should it be called for Unicode strings in the first place?

 I don't understand your question.

 As I understand it, all UTF8 strings are Unicode strings, but not vice-
 versa.  Further, I believe we have no way to query a string and find out
 what its current encoding is.  However, it's a fact that we have strings
 stored in our database already (in the Author indexes) which are already
 UTF8.  And possibly other, unknown, encodings.  This function, as it was
 written before, threw an exception when called with a UTF8 string.  This
 stopped Henning and Sam from being able to run BAI.  Simply not molesting
 input seemed like the least invasive change.

 The remainder of the patch after the first two lines is (strictly
 speaking) unnecessary optimization because Piotr was offended by the
 inefficiency of what he'd written and so sat with me and rewrote the
 function before he'd let me deploy it anywhere.

 > It is therefore possible that the problem at hand is better tackled at
 > the source place somewhere else, not at the level of this UTF-8
 > washing.  Why exactly was this change needed?
 >
 > (Putting the ticket status back to "assigned", and raising its
 > priority to "blocker", since this patch was already deployed on
 > INSPIRE.)

 What do you mean by "the source place"?

 I really really wish I could add Cc:s to tickets, but the Modify thing
 below only lets me add myself.  This is irritating because I want Piotr
 Henning and Sam to read and comment on this.

-- 
Ticket URL: <http://invenio-software.org/ticket/748#comment:8>
Invenio <http://invenio-software.org>

Reply via email to