[Wikidata-bugs] [Maniphest] [Changed Subscribers] T150891: Find a good way to represent multi-lingual text fields in Elastic

2016-11-19 Thread daniel
daniel added subscribers: Jan_Dittrich, Lydia_Pintscher.daniel added a comment.

In T150891#2802755, @dcausse wrote:

In T150891#2802255, @daniel wrote:
@dcausse I added use cases to the ticket description



Autocomplete Looking at the current behavior it seems that you display exact matches first and then prefix matches.



We actually do up to four queries at the moment, until we have found enough matches to fill the desired limit:


full length case insensitive match, user language only
full length case insensitive match, fallback languages
prefix match, user language only
prefix match, fallback languages


We currently rank by a crude heuristic score: max( |sitelinks|, |labels| ).

In addition to a prefix field you need a untokenized field in order to promote exact matches first.

Doesn't prefix also require untokenized?

Since prefix and fullmatch fields do not require fancy language features (no tokenization required) do you think it's still important to break by language?

Yes, we want to ignore, or at least strongly demote, languages that the user is not known to speak.

Breaking by language would only be needed for ranking, when 2 entities are ambiguous always prefer the match that comes from a language field close to the user language.

Indeed

It can become rather complex since we have two competing matches, assuming I'm french would I prefer an exact match in english or a prefix match in french?

See the algorithm described above.

Do we have enough ambiguities to really care about that? Would a simple solution where we merge all languages into the same field be sufficient?

I do not think it would be sufficient. I think that the result would often get swamped with results that are irrelevant for the user, and worse, impossible to read and interpret, especially for short prefixes like "li".

However, I have no research to support this, and I don't know how we would conduct such research. It boils down to a product level UX choice, so this is something to ask @Lydia_Pintscher and @Jan_Dittrich about.TASK DETAILhttps://phabricator.wikimedia.org/T150891EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: aude, danielCc: Lydia_Pintscher, Jan_Dittrich, EBernhardson, dcausse, hoo, Ricordisamoa, aude, Deskana, StudiesWorld, Aklapper, Smalyshev, Tobi_WMDE_SW, thiemowmde, JanZerebecki, gerritbot, Jonas, daniel, EBjune, mschwarzer, Avner, debt, Gehel, D3r1ck01, FloNight, Izno, Wikidata-bugs, jayvdb, Mbch331, jeremyb___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Changed Subscribers] T150891: Find a good way to represent multi-lingual text fields in Elastic

2016-11-16 Thread Smalyshev
Smalyshev added subscribers: dcausse, EBernhardson.Smalyshev added a comment.
I'd ask @dcausse and @EBernhardson to weigh in on this.TASK DETAILhttps://phabricator.wikimedia.org/T150891EMAIL PREFERENCEShttps://phabricator.wikimedia.org/settings/panel/emailpreferences/To: aude, SmalyshevCc: EBernhardson, dcausse, hoo, Ricordisamoa, aude, Deskana, StudiesWorld, Aklapper, Smalyshev, Tobi_WMDE_SW, thiemowmde, JanZerebecki, gerritbot, Jonas, daniel, EBjune, mschwarzer, Avner, debt, Gehel, D3r1ck01, FloNight, Izno, Wikidata-bugs, jayvdb, Mbch331, jeremyb___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs