On Mon, Jun 9, 2014 at 8:00 PM, Sean Pringle <sprin...@wikimedia.org> wrote:

> On Tue, Jun 10, 2014 at 1:04 AM, Nuria Ruiz <nu...@wikimedia.org> wrote:
>
>> >Just to narrow this down a little further from the DB server-side: the
>> eventlogging tables do use utf-8, so the fix probably doesn't require
>> laborious schema changes (if that's what you meant by changing database
>> types).
>> To follow the structure on mediawiki I think the easiest is to change db
>> types from varchar to varbinary where utf-8 is being used. Please let us
>> know if you do not think it is appropriate.
>>
>
> Ah, so long-term ecosystem consistency is also an aim. Sounds wise. I was
> only commenting in case it could make the current python encoding fix
> easier and faster.
>
> Were it a new system without ties to MW I'd push for solving character set
> issues properly with something like utf8mb4, depending on how you want to
> read/sort the data, but without that luxury varbinary is fine.
>


commit 9cff78b7c6a9516611cfd055906fd0707c4d5b88
Author: Ori Livneh <o...@wikimedia.org>
Date:   Sun Apr 28 14:46:28 2013 -0700

    Default MariaDB character encoding for EL data: utf8 -> utf8mb4

    This change sets the default character encoding for MySQL / MariaDB
    EventLogging data to 'utf8mb4' (was: 'utf8'), adding support for
characters
    above the Base Multilingual Plane. Deployment will require manual
migration of
    existing data in the database.

    One of the consequences of this migration is that the previous default
size for
    string columns is not longer appropriate, since the columns it
generates are
    not indexable by InnoDB, which will not index columns beyond 767 bytes.
This
    change therefore amends the default size to be 191, which is the
maximum size a
    utf8mb4 string column can be and still remain indexable.

    Finally, as a way of not being blocked on deployment of I8fdcc046d,
this change
    adds a live hack that substitutes 'utf8mb4' for 'utf8' in database
connection
    strings. The hack can be removed once I8fdcc046d is deployed.

    FIXME: Database setup instructions and minimum requirements should be
    documented.

    Change-Id: Ia94f2c2155de5fb9031a8164306720e06455cced

commit 041cb2c34c540dfea05886368edc5d6209102aed
Author: Ori Livneh <o...@wikimedia.org>
Date:   Sun Apr 28 15:13:26 2013 -0700

    ...and back to utf8 as default charset

    The version of MySQLdb that is packaged for Precise does not know about
    utf8mb4. I (inexcusably) tested against the dev branch of MySQLdb.

    Keeping the 191 limit to ease migration in the future.

    Change-Id: I807e1d3a6f192b13e36811af376806d6a92e122d
_______________________________________________
Analytics mailing list
Analytics@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/analytics

Reply via email to