dcausse added a subscriber: dcausse.
dcausse added a comment.

//First of all: sorry for all the low level details in this comment but it's 
always complex to tackle such relevance issues.//

I assume that `life` is the query.

Wikidata already uses `incoming_link` to boost the top-N results (8196 docs per 
shards).

The way cirrus scores documents for wikidata is :

1. The lucene score (applied to all docs). NOTE: When I talk about top-N docs 
below this is according to this ranking.
2. The phrase rescore: if the query has more than 1 word, the doc is 
overboosted if it contains the same sequence of adjacent words. Only the top-N 
docs are analyzed (N=512 per shards here because it's very costly). This does 
not apply here because the query is one word.
3. Special:Search on wikidata is configured to query 2 namespaces (0 and 120). 
Boost for ns 0 is 0.05 and for ns 120 is 0.2 (top-8196 docs per shards 
analyzed). I assume this is not related to our problem because there's only 10 
properties 
<https://www.wikidata.org/w/index.php?title=Special%3ASearch&profile=advanced&search=life&fulltext=Search&ns120=1&profile=advanced>
 related to //life//.
4. The number of incoming links (top-8196 docs per shards analyzed).

A small note on the lucene score:
Lucene scores docs using a tf.idf formula this formula also includes a 
normalization based on document size. Large documents tend to be ranked lower, 
this understandable because large docs may have higher term frequencies and 
thus higher raw tf.idf scores, normalization on size helps to mitigate this 
problem.
Why does it affect wikidata?
Because we flatten all the data into the same field, a wikibase entity with a 
lot of labels in many different languages (likely to happen for high profile 
items) will be larger than //less important items// and thus have a lower 
lucene score.

Because of the current cirrus<->wikidata mapping problems we're trying to 
address  (everything is in the same field so no boosts on title/redirects can 
be applied) it's very likely that the incoming_link boost will take precedence 
over lucene score and from what I see: life has a low number of incoming_link 
<https://www.wikidata.org/wiki/Q3?action=cirrusDump> (53) compared to 
Encyclopedia of Life <https://www.wikidata.org/wiki/Q82486?action=cirrusDump> 
which has //1 081 079// incoming links.
On the other hand the third result has only 32 incoming_links 
<https://www.wikidata.org/wiki/Q752241?action=cirrusDump>.

Why Q3 has a bad lucene score?
Let's compare Q3 (ranked ~700) and Q752241 (ranked 4)

- Q3 lucene score is 0.5476983
- Q752241 lucene score is 0.85728467

This is because there's only 10 occurrences of the word life in the content for 
Q3 and 64 for Q752241 and Q3 is larger (length norm effect).

The boost on incoming link is :

- Q3: should be something like log(2+53) but it's 0.69897 <- **completely 
wrong**
  - it looks it's log(2+3)
- Q279744: should be something like log(2+32) and it's 1.5314789 which is good.

So looks like the problem is because the number of incoming links stored in 
elasticsearch does not reflect the actual number.
This is normal in certain conditions: we have an optimization to not update 
docs too frequently, so if the number of incoming links does not change more 
than 20% we ignore the update.
But here it's way more than 20% it's a 1700% difference...

I'm not sure what's happened here...

Would it be possible to update Q3 to force a re-index of this entity and see if 
it fixes the issue?
If yes then we will certainly have to write a maintenance script to check this 
incoming_link consistency.

Side note: as you can see lucene score is rather bad for Q3, so scoring is very 
fragile on wikidata. This cannot be addressed without all the work planned to 
add a better cirrus<>wikidata integration.


TASK DETAIL
  https://phabricator.wikimedia.org/T110648

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: dcausse
Cc: dcausse, Deskana, daniel, Mbch331, Aklapper, Lydia_Pintscher, 
Wikidata-bugs, aude, Gryllida, jeremyb



_______________________________________________
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs

Reply via email to