> Let me elaborate then. In Python 2, there are two distinct types of > strings, so to speak, Unicode (character) strings and binary (byte) > strings. In Invenio, we are using typically UTF-8 binary strings in > I/O tiers: they get stored into DB, they get read from or written into > the web interface, etc. When necessary, we transform them into > Unicode strings inside the application business logic for > manipulations that require Unicode string objects instead of binary > string objects. For an example, see `wash_index_term()`. > > Now the reason for the existence of `wash_for_utf8()` is that > sometimes the UTF-8 binary strings at the I/O times may be corrupted. > For example, maybe our call to `pdftotext` did not return fully UTF-8 > valid output and there was some gibberish. In these cases, we may try > to do what we can with these ill-defined input strings to "wash away" > problematic bytes from UTF-8 binary string, while retaining the good > ones, so that the incoming string would be as untouched as possible > while being fully UTF-8 valid. This is what `wash_for_utf8()` > function attempts to do.
Isn't this what unicode(str, 'utf8', errors='ignore') does?
> From this point of view, it is not necessary to amend this function so
> that it would work on Unicode strings too -- simply because Unicode
> strings can be converted into valid UTF-8 binary strings anytime by
> `encode('utf-8')`. It should not be needed to wash anything in these
> cases, since our business logic works with valid Unicode strings
> already.
>
> Hence my "masking worries", notably that some component of the Invenio
> business logic probably calls UTF-8 washing function in an
> inappropriate way, akin to escaping an already escaped argument, so to
> speak.
The place this was being called from, though, was
perform_request_search. Sam and Henning didn't call the washer
directly; they are taking strings out of the database and handing them
to p_r_s directly without intermediation. So it's not that they're
calling the washer on something that shouldn't be washable; it's that
the assumption that what's in the database is utf8-encoded binary
strings is false. Or at least, not always true. And as far as I can
tell, we can't really count on things to become any cleaner in the
forseeable future.
So if something is washing inappropriately, we have two problems:
finding that, and dealing with this. But the problem there isn't with
Henning and Sam's stuff, so the most likely culprit is bibindex, which
calls wash_for_utf8 more than anyone else.
My feeling though is that there will always be bad data sneaking into
the system, and core components like search_engine should be as robust
to this as possible. So if the right place to force coercion to unicode
isn't wash_for_utf8, surely it's in wash_pattern? Or should we wrap
every call to wash_pattern to check types?
In a perfect world everything would be Unicode strings only, and all of
our outputs (and our source files (and our default encoding in
sitecustomize.py) (and our database schemata)) would all be UTF-8 (not
to mention all of our inports and exports having explicit content-type
and encoding tags), and we'd be more-or-less saved from thinking about
this stuff. But right now we have something in the messy inbetweenness.
So how do we move from here, closer to there?
Couldn't we just change the name of wash_for_utf8() to give_me_utf8()
and change the
{{{
return text
}}}
on line 352 of my branch to
{{{
return text.encode('utf8')
}}}
?
As an aside, wash_for_utf8 is called in a bunch of methods in
bibindex_engine that are called sequentially, so washing is being done
potentially many times for one string.
If this part of the Invenio application already transformed
> UTF-8 binary string into Unicode string, then why is there a need to
> call `wash_for_utf8()`? And if this part of the application tries to
> perform some I/O washing upon Unicode strings, then shouldn't these
> have been converted into UTF-8 binary strings previously? Trying to
> amend `wash_for_utf8()` to accept also Unicode string arguments could
> "mask" problems of the kind that some code base component expects
> UTF-8 binary string as its input, but some other component passes it
> Unicode strings instead. We should avoid cases like that.
What if we have wash_for_utf8 log a warning when it does this (rather
than an error)?
> (Alternatively, and perhaps ideally, we could work everywhere in our
> application logic code base with Unicode strings only, and transform
> them to UTF-8 binary strings at the I/O times only, but this is not
> the state of the things right now.)
I think this is the only reasonable end goal.
> So this is why I wanted to understand why this change was needed and
> whether we shouldn't rather try to isolate the part of the code that
> was apparently calling for this amendment of the UTF-8 washing
> function, in order to check whether we shouldn't solve the issue "at
> the source component" level and not inside `wash_for_utf8()` level.
> The `wash_for_utf8()` function was designed primarily to help with
> washing of potentially bad UTF-8 input/output values; it should not be
> needed to work with Unicode string objects.
Well, as far as I can tell, all of the callers are doing things that
seem reasonable, it's just our data is dirty.
> > The remainder of the patch after the first two lines is (strictly
> speaking) unnecessary optimization because Piotr was offended by the
> inefficiency of what he'd written and so sat with me and rewrote the
> function before he'd let me deploy it anywhere.
>
> We can commit this part independently as an optimisation :)
Re-re-reading it, I'm less certain than I was a month ago, but it looks
like it may technically be more correct, also.
Joe
signature.asc
Description: OpenPGP digital signature

