dcausse added a comment.
Ꜵ being conflated with 🇦🇴 is a bug in the version of ICU4j we use, switching to ICU 68.1 (currently use 4.8 from 2011) solves the problem. Other issues related to similar chars (⑫ vs ⓬) do indeed require switching to switch the collation strength to //identical// which will increase the key sizes by ~80%. Hard to tell what is the actual impact on the blazegraph journal size. As discussed in https://github.com/blazegraph/database/issues/93 it does seem that query perf should not be affected too much. The user impact is hard to evaluate as well, while it's clearly wrong&confusing when two terms are conflated we have no idea how useful it can be when the terms are not ambiguous. There are queries that are perhaps relying on this to find results. In P13502 <https://phabricator.wikimedia.org/P13502> I've listed (brute-force search) the list of charaters that would no longer be conflated using //identical//. This sadly does not take into account sequences (like emojis and the angola flag) for which I don't have great ideas on how to evaluate the impact, this particular problem could well be very isolated. Concerning the version of ICU we currently use, I believe that using //identical// will solve most of these problems but it's probable that we might be affected by other bugs esp. when sorting. This probably deserves its own ticket and is more related to blazegraph's tech-dept. To move this ticket forward it does seem clear that we can't enable this option on production machines without prior testing on sizes but also on user impact. We don't have enough machines to run multiple tests at the same time and we might have to either: - wait for the planned tests (blank node removal with the streaming updater) to finish - or do it at the same time. I'll prepare some puppet patches in the meantime TASK DETAIL https://phabricator.wikimedia.org/T233204 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: dcausse Cc: Unjoanqualsevol, Nikki, CamelCaseNick, Smalyshev, Aklapper, Lucas_Werkmeister_WMDE, Igorkim78, Gehel, Lea_Lacroix_WMDE, CBogen, Akuckartz, Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs