#748: wash_for_utf8 breaks sometimes when fed an already-unicode string
-----------------------+------------------------------
Reporter: jblayloc | Owner: jblayloc
Type: defect | Status: assigned
Priority: blocker | Milestone:
Component: MiscUtil | Version:
Resolution: | Keywords: INSPIRE DEPLOYED
-----------------------+------------------------------
Comment (by jblayloc):
Replying to [comment:6 simko]:
> Actually, it is not correct to amend the docstring of
> `wash_for_utf8()` to say that it returns UTF-8 encoded string with
> incorrect characters stripped, while at the same time to amend it to
> return Unicode strings unchanged. Looking at the function, it assumes
> that its input parameter `text` is UTF-8 binary string already, so it
> simply checks for compliance and removes any extra binary characters.
> The function would fail should it receive UTF-16 encoded binary
> string, for example. So we should amend either the docstring or the
> behaviour of the function to cover these non-UTF-8 input cases better.
Ok, I think amending the doc string is the right thing to do. I haven't
yet seen a case where we've had UTF-16 in the wild, so I'm not convinced
we need to worry about them too much. In my view, the present better is
still far closer to correct than the previous behavior.
> For example, we could amend it to accept either UTF-8 binary strings
> or Unicode strings (so UTF-16 would still lead to error), while it
> would return UTF-8 binary strings in both input cases. But then it
> would not be only "washing", it would be also "converting"...
Yes; if I recall correctly, the intention of this method (Piotr? Help me
out?) was that it would take strings is arbitrary encoding and return the
parts of those strings which happen to have an ASCII representation when
the string is interpreted as UTF-8. Which I view as a kind of converting;
I think the choice of names may have been unfortunate. What's really
desired, I think, is an actual anything-to-UTF8 converter. But of course
we don't actually store the encoding of strings in the database, so we
can't be sure what we're converting from, which I believe limits our
options.
> Moreover, calling this function for a Unicode input string, while the
> function usually expects UTF-8 binary string, may lead to masking
> troubles with encode/decode in our code base. (Kind of similar to
> calling `cgi.escape(x)` for an already escaped argument.) One can
> argue that since this function is used to wash UTF-8 binary strings,
> why should it be called for Unicode strings in the first place?
I don't understand your question.
As I understand it, all UTF8 strings are Unicode strings, but not vice-
versa. Further, I believe we have no way to query a string and find out
what its current encoding is. However, it's a fact that we have strings
stored in our database already (in the Author indexes) which are already
UTF8. And possibly other, unknown, encodings. This function, as it was
written before, threw an exception when called with a UTF8 string. This
stopped Henning and Sam from being able to run BAI. Simply not molesting
input seemed like the least invasive change.
The remainder of the patch after the first two lines is (strictly
speaking) unnecessary optimization because Piotr was offended by the
inefficiency of what he'd written and so sat with me and rewrote the
function before he'd let me deploy it anywhere.
> It is therefore possible that the problem at hand is better tackled at
> the source place somewhere else, not at the level of this UTF-8
> washing. Why exactly was this change needed?
>
> (Putting the ticket status back to "assigned", and raising its
> priority to "blocker", since this patch was already deployed on
> INSPIRE.)
What do you mean by "the source place"?
I really really wish I could add Cc:s to tickets, but the Modify thing
below only lets me add myself. This is irritating because I want Piotr
Henning and Sam to read and comment on this.
--
Ticket URL: <http://invenio-software.org/ticket/748#comment:8>
Invenio <http://invenio-software.org>