[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata

2015-12-17 Thread gerritbot
gerritbot added a comment.

Change 256023 merged by jenkins-bot:
Introduce hook handlers for CirrusSearch

https://gerrit.wikimedia.org/r/256023


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude, gerritbot
Cc: gerritbot, thiemowmde, daniel, aude, Aklapper, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata

2015-12-08 Thread aude
aude added a comment.

@thiemowmde sorry, the patch was not linked to the task.  now it is linked


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude
Cc: gerritbot, thiemowmde, daniel, aude, Aklapper, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata

2015-12-08 Thread gerritbot
gerritbot added a subscriber: gerritbot.
gerritbot added a comment.

Change 256023 had a related patch set uploaded (by Aude):
Introduce hook handlers for CirrusSearch

https://gerrit.wikimedia.org/r/256023


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude, gerritbot
Cc: gerritbot, thiemowmde, daniel, aude, Aklapper, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata

2015-12-08 Thread thiemowmde
thiemowmde added a subscriber: thiemowmde.
thiemowmde added a comment.

Why is this on review? What should we review here?


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude, thiemowmde
Cc: thiemowmde, daniel, aude, Aklapper, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata

2015-11-25 Thread daniel
daniel added a comment.

In https://phabricator.wikimedia.org/T119066#1824919, @aude wrote:

> @daniel if you would like "encyclopedia of life" to be the first result for 
> searching "life", then incoming links alone might be good for scoring
>
> life (Q3) has 56 incoming links
>
> encyclopedia of life (Q82486) has 1365362 incoming links


Ah, right... we'd want to consider only links from main snaks, not from 
references (nto sure about qualifiers). That would need some work...

> I'm not sure that *not* doing tf/idf is the solution, but we can investigate.

Term frequency doesn't seem to be a good indicator in our use case.

> The way we munge all the different terms in all the languages together in one 
> field is probably not ideal for tf/idf.  "life" is probably translated 
> differently in most languages whereas "Half Life" (Q752241) is generally not 
> translated yet has labels in lots of languages, so "life" is especially 
> frequent.  If we could consider just english when searching in english, then 
> "Half Life" probably is not boosted as much compared to "life".

Yes, this should be per language.

> As well, things like exact title matches don't really work currently for 
> Wikidata. Ideally, we would consider exact label matches in the search 
> language and exact matches would get a boost.

Indeed.

> I think considering other attributes (e.g. # of site links, # of statements, 
> etc) of the document to boost scoring could help. This would not replace 
> considering incoming links but just be additional consideration in scoring. 
> It already works okayish enough in the entity selector. Once we put these in, 
> then we can try different rescorings to see what works well.  If this turns 
> out to be a bad idea, then we can remove the custom rescoring config for 
> wikidata and do as we do now.

Number of sitelinks or statements can help. I'd like to avoid gettign too many 
parameterrs into the mix, though. If we can, let's find one or two indicators 
that work well. If there are too many factors, things tend to be come 
unpredictable.

My objection to sitelinks was based on the assumption that we already have 
something better (incoming links), so why invest time into the sitelinks stuff. 
But as you point out, the raw number of incoming links includes links from 
references, and can thus be misleading.


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude, daniel
Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata

2015-11-23 Thread aude
aude added a comment.

@daniel if you would like "encyclopedia of life" to be the first result for 
searching "life", then incoming links alone might be good for scoring

life (Q3) has 56 incoming links

encyclopedia of life (Q82486) has 1365362 incoming links

I'm not sure that *not* doing tf/idf is the solution, but we can investigate. 
The way we munge all the different terms in all the languages together in one 
field is probably not ideal for tf/idf.  "life" is probably translated 
differently in most languages whereas "Half Life" (Q752241) is generally not 
translated yet has labels in lots of languages, so "life" is especially 
frequent.  If we could consider just english when searching in english, then 
"Half Life" probably is not boosted as much compared to "life".

I think considering other attributes (e.g. # of site links, # of statements, 
etc) of the document to boost scoring could help. It already works okayish 
enough in the entity selector. Once we put these in, then we can try different 
rescorings to see what works well.


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude
Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs


[Wikidata-bugs] [Maniphest] [Commented On] T119066: Add sitelink count to search index for Wikidata

2015-11-23 Thread daniel
daniel added a subscriber: daniel.
daniel added a comment.

Using the sitelink count for scoring was intended to be a workaround. Cirrus 
already has the number of incoming links ("in-degree") for each item, which it 
uses for scoring per default. Why is that not good enough for our case?

The main problem with the current scoring seems to be that Cirrus uses tf/idf 
scoring. The "tf" bit ("term frequency", the number of times the search term 
occurs in the document) should not be used for wikidata items, it's not a good 
indicator of relevance. The "idf" bit is intended to reduce the impact of 
irrelevant (too common) terms in the search string - which is useless for 
single word (or prefix) searches.

If we want to improve scoring, we should make sure that in-degree is used, and 
tf/idf is not used.


TASK DETAIL
  https://phabricator.wikimedia.org/T119066

EMAIL PREFERENCES
  https://phabricator.wikimedia.org/settings/panel/emailpreferences/

To: aude, daniel
Cc: daniel, aude, Aklapper, Wikidata-bugs, Mbch331



___
Wikidata-bugs mailing list
Wikidata-bugs@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikidata-bugs