I thought MediaWiki, by default, stored data as binary blobs, rather than
something of a particular encoding?

On May 2, 2017 at 10:11:38 AM, Mark Clements (HappyDog) (
gm...@kennel17.co.uk) wrote:

Hi all,

I seem to recall that a long, long time ago MediaWiki was using UTF-8
internally but storing the data in 'latin1' fields in MySQL.

I notice that there is now the option to use either 'utf8' or 'binary'
columns (via the $wgDBmysql5 setting), and the default appears to be
'binary'.[1]

I've come across an old project which followed MediaWiki's lead (literally
-
it cites MediaWiki as the reason) and stores its UTF-8 data in latin1
tables. I need to upgrade it to a more modern data infrastructure, but I'm
hesitant to simply switch to 'utf8' without understanding the reasons for
this initial implementation decision.

Can anyone confirm that MediaWiki used to behave in this manner, and if so
why?

If it was due to MySQL bugs, does anyone know in what version these were
fixed?

Finally, is current best-practice to use 'binary' or 'utf-8' for this? Why
does MediaWiki make this configurable?

I have a very good understanding of character encodings and have no
problems
with performing whatever migrations are necessary - and the code itself is
fully utf-8 compliant except for the database layer - but I'm just trying
to
understand the design choices (or technical limitations) that resulted in
MediaWiki handling character encodings in this manner.

- Mark Clements (HappyDog)

[1] https://www.mediawiki.org/wiki/Manual:$wgDBmysql5



_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l
_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to