Smalyshev created this task.
Smalyshev added a project: Wikidata.
Herald added a subscriber: Aklapper.

TASK DESCRIPTION

wb_terms table contains terms - e.g. labels, descriptions, etc. - for Wikidata items. The length of these terms is limited by the table definitions:

| term_text           | varbinary(255)      | NO   | MUL | NULL    |                |

However, it is not ensured that when the longer data is cut off, the result is a valid utf-8. For example, running:

select * from wb_terms where term_full_entity_id='Q1102' and term_language='kn' and term_type='description';

We get this:

|  2349274243 |              0 | Q1102               | item             | kn            | description | ಪ್ಲುಟೋನಿಯಮ್ ಎಂಬುದು ಪೂ ಮತ್ತು ಅಣುಗಳ ಸಂಖ್ಯೆ 94 ಅನ್ನು ಹೊಂದಿರುವ ಟ್ರಾನ್ಸ್ಯುರಾನಿಕ್ ವಿಕಿರಣಶೀಲ ರಾಸಾಯನಿ?                                                                                                                                                                   |                 |           0 |

Note the ? at the end of the text - it's there because it was cut-off and now is an invalid UTF-8 sequence. I think Wikibase should cut the terms so that it would not produce invalid sequences, otherwise other code that might use that table would get all kinds of weird errors from functions that assume valid UTF-8.


TASK DETAIL
https://phabricator.wikimedia.org/T199833

EMAIL PREFERENCES
https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: Smalyshev
Cc: Aklapper, Smalyshev, Lahi, Gq86, GoranSMilovanovic, QZanden, LawExplorer, Wikidata-bugs, aude, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to