[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata
gerritbot added a comment. Change 256023 merged by jenkins-bot: Introduce hook handlers for CirrusSearch https://gerrit.wikimedia.org/r/256023 TASK DETAIL https://phabricator.wikimedia.org/T119066 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: aude, gerritbot Cc: gerritbot, thiemowmde, daniel, aude, Aklapper, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata
aude added a comment. @thiemowmde sorry, the patch was not linked to the task. now it is linked TASK DETAIL https://phabricator.wikimedia.org/T119066 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: aude Cc: gerritbot, thiemowmde, daniel, aude, Aklapper, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata
gerritbot added a subscriber: gerritbot. gerritbot added a comment. Change 256023 had a related patch set uploaded (by Aude): Introduce hook handlers for CirrusSearch https://gerrit.wikimedia.org/r/256023 TASK DETAIL https://phabricator.wikimedia.org/T119066 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: aude, gerritbot Cc: gerritbot, thiemowmde, daniel, aude, Aklapper, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata
thiemowmde added a subscriber: thiemowmde. thiemowmde added a comment. Why is this on review? What should we review here? TASK DETAIL https://phabricator.wikimedia.org/T119066 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: aude, thiemowmde Cc: thiemowmde, daniel, aude, Aklapper, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata
daniel added a comment. In https://phabricator.wikimedia.org/T119066#1824919, @aude wrote: > @daniel if you would like "encyclopedia of life" to be the first result for > searching "life", then incoming links alone might be good for scoring > > life (Q3) has 56 incoming links > > encyclopedia of life (Q82486) has 1365362 incoming links Ah, right... we'd want to consider only links from main snaks, not from references (nto sure about qualifiers). That would need some work... > I'm not sure that *not* doing tf/idf is the solution, but we can investigate. Term frequency doesn't seem to be a good indicator in our use case. > The way we munge all the different terms in all the languages together in one > field is probably not ideal for tf/idf. "life" is probably translated > differently in most languages whereas "Half Life" (Q752241) is generally not > translated yet has labels in lots of languages, so "life" is especially > frequent. If we could consider just english when searching in english, then > "Half Life" probably is not boosted as much compared to "life". Yes, this should be per language. > As well, things like exact title matches don't really work currently for > Wikidata. Ideally, we would consider exact label matches in the search > language and exact matches would get a boost. Indeed. > I think considering other attributes (e.g. # of site links, # of statements, > etc) of the document to boost scoring could help. This would not replace > considering incoming links but just be additional consideration in scoring. > It already works okayish enough in the entity selector. Once we put these in, > then we can try different rescorings to see what works well. If this turns > out to be a bad idea, then we can remove the custom rescoring config for > wikidata and do as we do now. Number of sitelinks or statements can help. I'd like to avoid gettign too many parameterrs into the mix, though. If we can, let's find one or two indicators that work well. If there are too many factors, things tend to be come unpredictable. My objection to sitelinks was based on the assumption that we already have something better (incoming links), so why invest time into the sitelinks stuff. But as you point out, the raw number of incoming links includes links from references, and can thus be misleading. TASK DETAIL https://phabricator.wikimedia.org/T119066 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: aude, daniel Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata
aude added a comment. @daniel if you would like "encyclopedia of life" to be the first result for searching "life", then incoming links alone might be good for scoring life (Q3) has 56 incoming links encyclopedia of life (Q82486) has 1365362 incoming links I'm not sure that *not* doing tf/idf is the solution, but we can investigate. The way we munge all the different terms in all the languages together in one field is probably not ideal for tf/idf. "life" is probably translated differently in most languages whereas "Half Life" (Q752241) is generally not translated yet has labels in lots of languages, so "life" is especially frequent. If we could consider just english when searching in english, then "Half Life" probably is not boosted as much compared to "life". I think considering other attributes (e.g. # of site links, # of statements, etc) of the document to boost scoring could help. It already works okayish enough in the entity selector. Once we put these in, then we can try different rescorings to see what works well. TASK DETAIL https://phabricator.wikimedia.org/T119066 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: aude Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs
[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata
daniel added a subscriber: daniel. daniel added a comment. Using the sitelink count for scoring was intended to be a workaround. Cirrus already has the number of incoming links ("in-degree") for each item, which it uses for scoring per default. Why is that not good enough for our case? The main problem with the current scoring seems to be that Cirrus uses tf/idf scoring. The "tf" bit ("term frequency", the number of times the search term occurs in the document) should not be used for wikidata items, it's not a good indicator of relevance. The "idf" bit is intended to reduce the impact of irrelevant (too common) terms in the search string - which is useless for single word (or prefix) searches. If we want to improve scoring, we should make sure that in-degree is used, and tf/idf is not used. TASK DETAIL https://phabricator.wikimedia.org/T119066 EMAIL PREFERENCES https://phabricator.wikimedia.org/settings/panel/emailpreferences/ To: aude, daniel Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331 ___ Wikidata-bugs mailing list Wikidata-bugs@lists.wikimedia.org https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs