On Tue, May 2, 2017 at 8:05 PM, Jaime Crespo <jcre...@wikimedia.org> wrote:
> On Tue, May 2, 2017 at 9:24 PM, Brian Wolff <bawo...@gmail.com> wrote:
>
>> .
>> >
>> > On the latest discussions, there are proposals to increase the minimum
>> > mediawiki requirements to MySQL/MariaDB 5.5 and allow binary or utf8mb4
>> > (not utf8, 3 byte utf8), https://phabricator.wikimedia.org/T161232.
>> Utf8mb4
>> > should be enough for most uses (utf8 will not allow for emojis, for
>> > example), although I am not up to date with the latest unicode standard
>> > changes and MySQL features supporting them.
>> >
>>
>> I dont know about mysql, but in unicode emojis are like any other astral
>> character, and utf-8 can encode them in 4 bytes*.
>>
>
> I am sorry I wasn't clear before, MySQL's utf8 IS NOT international
> standard generally known as UTF-8, it is a bastardization of 3-byte max
> UTF-8. MySQL's utf8mb4 is UTF-8:
>
> Proof:
>
> ```
> mysql> use test
> Database changed
> mysql> CREATE TABLE test (a char(1) CHARSET utf8, b char(1) CHARSET
> utf8mb4, c binary(4));
> Query OK, 0 rows affected (0.02 sec)
>
> mysql> SET NAMES utf8mb4;
> Query OK, 0 rows affected (0.00 sec)
>
> mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9');
> Query OK, 1 row affected, 1 warning (0.00 sec)
>
> mysql> SHOW WARNINGS;
> +---------+------+--------------------------------------------------------------------+
> | Level   | Code | Message
>            |
> +---------+------+--------------------------------------------------------------------+
> | Warning | 1366 | Incorrect string value: '\xF0\x9F\x92\xA9' for column
> 'a' at row 1 |
> +---------+------+--------------------------------------------------------------------+
> 1 row in set (0.01 sec)
>
> mysql> SELECT * FROM test;
> +------+------+------+
> | a    | b    | c    |
> +------+------+------+
> | ?    | 💩     | 💩     | -- you will need an emoji-compatible font here
> +------+------+------+
> 1 row in set (0.00 sec)
>
> mysql> SELECT hex(a), hex(b), hex(c) FROM test;
> +--------+----------+----------+
> | hex(a) | hex(b)   | hex(c)   |
> +--------+----------+----------+
> | 3F     | F09F92A9 | F09F92A9 |
> +--------+----------+----------+
> 1 row in set (0.00 sec)
> ```
>
> To avoid truncations:
>
> ```
> mysql> set sql_mode='TRADITIONAL'; --
> https://phabricator.wikimedia.org/T108255
> Query OK, 0 rows affected (0.00 sec)
>
> mysql> insert into test VALUES ('\U+1F4A9', '\U+1F4A9', '\U+1F4A9');
> ERROR 1366 (22007): Incorrect string value: '\xF0\x9F\x92\xA9' for column
> 'a' at row 1
> ```
>
> More info at:
> https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8.html vs.
> https://dev.mysql.com/doc/refman/5.7/en/charset-unicode-utf8mb4.html
>
>
> --
> Jaime Crespo
> <http://wikimedia.org>
> _______________________________________________
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Oh my bad, I had misread your previous email. I thought you were
talking about emoiji's in utf8mb4.

--
Brian

_______________________________________________
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

Reply via email to