dcausse added a comment.

  Ꜵ being conflated with 🇦🇴 is a bug in the version of ICU4j we use, switching 
to ICU 68.1 (currently use 4.8 from 2011) solves the problem.
  
  Other issues related to similar chars (⑫ vs ⓬) do indeed require switching to 
switch the collation strength to //identical// which will increase the key 
sizes by ~80%. Hard to tell what is the actual impact on the blazegraph journal 
size. As discussed in https://github.com/blazegraph/database/issues/93 it does 
seem that query perf should not be affected too much.
  The user impact is hard to evaluate as well, while it's clearly 
wrong&confusing when two terms are conflated we have no idea how useful it can 
be when the terms are not ambiguous. There are queries that are perhaps relying 
on this to find results.
  In P13502 <https://phabricator.wikimedia.org/P13502> I've listed (brute-force 
search) the list of charaters that would no longer be conflated using 
//identical//. This sadly does not take into account sequences (like emojis and 
the angola flag) for which I don't have great ideas on how to evaluate the 
impact, this particular problem could well be very isolated.
  
  Concerning the version of ICU we currently use, I believe that using 
//identical// will solve most of these problems but it's probable that we might 
be affected by other bugs esp. when sorting. This probably deserves its own 
ticket and is more related to blazegraph's tech-dept.
  
  To move this ticket forward it does seem clear that we can't enable this 
option on production machines without prior testing on sizes but also on user 
impact.
  We don't have enough machines to run multiple tests at the same time and we 
might have to either:
  
  - wait for the planned tests (blank node removal with the streaming updater) 
to finish
  - or do it at the same time.
  
  I'll prepare some puppet patches in the meantime

TASK DETAIL
  https://phabricator.wikimedia.org/T233204

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: Unjoanqualsevol, Nikki, CamelCaseNick, Smalyshev, Aklapper, 
Lucas_Werkmeister_WMDE, Igorkim78, Gehel, Lea_Lacroix_WMDE, CBogen, Akuckartz, 
Nandana, Namenlos314, Lahi, Gq86, GoranSMilovanovic, QZanden, EBjune, merbst, 
LawExplorer, _jensen, rosalieper, Scott_WUaS, Jonas, Xmlizer, jkroll, 
Wikidata-bugs, Jdouglas, aude, Tobias1984, Manybubbles, Mbch331
_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to