[Wikidata-bugs] [Maniphest] [Created] T199833: wb_terms contains invalid UTF-8 data

Smalyshev Tue, 17 Jul 2018 12:52:10 -0700

Smalyshev created this task.
Smalyshev added a project: Wikidata.
Herald added a subscriber: Aklapper.

TASK DESCRIPTION

wb_terms table contains terms - e.g. labels, descriptions, etc. - for Wikidata items. The length of these terms is limited by the table definitions:

| term_text           | varbinary(255)      | NO   | MUL | NULL    |                |

However, it is not ensured that when the longer data is cut off, the result is a valid utf-8. For example, running:

select * from wb_terms where term_full_entity_id='Q1102' and term_language='kn' and term_type='description';

We get this:

|  2349274243 |              0 | Q1102               | item             | kn            | description | ಪ್ಲುಟೋನಿಯಮ್ ಎಂಬುದು ಪೂ ಮತ್ತು ಅಣುಗಳ ಸಂಖ್ಯೆ 94 ಅನ್ನು ಹೊಂದಿರುವ ಟ್ರಾನ್ಸ್ಯುರಾನಿಕ್ ವಿಕಿರಣಶೀಲ ರಾಸಾಯನಿ?                                                                                                                                                                   |                 |           0 |

Note the ? at the end of the text - it's there because it was cut-off and now is an invalid UTF-8 sequence. I think Wikibase should cut the terms so that it would not produce invalid sequences, otherwise other code that might use that table would get all kinds of weird errors from functions that assume valid UTF-8.

TASK DETAIL

https://phabricator.wikimedia.org/T199833

EMAIL PREFERENCES

https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Aklapper, Smalyshev, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Mbch331

_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

[Wikidata-bugs] [Maniphest] [Created] T199833: wb_terms contains invalid UTF-8 data

Reply via email to