Re: [Wikitech-l] Indexing structures for Wikidata

2013-03-08 Thread bawolff
On Thu, Mar 7, 2013 at 12:50 PM, Denny Vrandečić
 wrote:
> As you probably know, the search in Wikidata sucks big time.
>
> Until we have created a proper Solr-based search and deployed on that
> infrastructure, we would like to implement and set up a reasonable stopgap
> solution.
>
> The simplest and most obvious signal for sorting the items would be to
> 1) make a prefix search
> 2) weight all results by the number of Wikipedias it links to
>
> This should usually provide the item you are looking for. Currently, the
> search order is random. Good luck with finding items like California,
> Wellington, or Berlin.
>
> Now, what I want to ask is, what would be the appropriate index structure
> for that table. The data is saved in the wb_terms table, which would need
> to be extended by a "weight" field. There is already a suggestion (based on
> discussions between Tim and Daniel K if I understood correctly) to change
> the wb_terms table index structure (see here <
> https://bugzilla.wikimedia.org/show_bug.cgi?id=45529> ), but since we are
> changing the index structure anyway it would be great to get it right this
> time.
>
> Anyone who can jump in? (Looking especially at Asher and Tim)
>
> Any help would be appreciated.
>
> Cheers,
> Denny
>
> --
> Project director Wikidata
> Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
> Tel. +49-30-219 158 26-0 | http://wikimedia.de
>
> Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
> Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
> der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
> Körperschaften I Berlin, Steuernummer 27/681/51985.
> ___
> Wikitech-l mailing list
> Wikitech-l@lists.wikimedia.org
> https://lists.wikimedia.org/mailman/listinfo/wikitech-l

AFAIK sql isn't particularly good for indexing that type of query.

You could maybe have a bunch of indexes for the first couple letters
of a term, and then after some point hope that things are narrowed
down enough that just doing a prefix search is acceptable. For
example, you might have an indexes on (wb_term(1), wb_weight),
(wb_term(2), wb_weight), ..., (wb_term(7), wb_weight) and one on just
wb_term. That way (I believe) you would be able to do efficient
searches for a prefix ordered by weight, provided the prefix is less
than 7 characters. (7 was chosen arbitrarily out of a hat. Performance
goes down as you add more indexes from what I understand. I'm not sure
how far you would be able to take this scheme before that becomes an
issue. You could maybe enhance this by only showing search suggestion
updates for every 2 characters the user enters or something).

--bawolff

p.s. Have not tested this, and talking a bit outside my knowledge area, so ymmv

___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l

[Wikitech-l] Indexing structures for Wikidata

2013-03-07 Thread Denny Vrandečić
As you probably know, the search in Wikidata sucks big time.

Until we have created a proper Solr-based search and deployed on that
infrastructure, we would like to implement and set up a reasonable stopgap
solution.

The simplest and most obvious signal for sorting the items would be to
1) make a prefix search
2) weight all results by the number of Wikipedias it links to

This should usually provide the item you are looking for. Currently, the
search order is random. Good luck with finding items like California,
Wellington, or Berlin.

Now, what I want to ask is, what would be the appropriate index structure
for that table. The data is saved in the wb_terms table, which would need
to be extended by a "weight" field. There is already a suggestion (based on
discussions between Tim and Daniel K if I understood correctly) to change
the wb_terms table index structure (see here <
https://bugzilla.wikimedia.org/show_bug.cgi?id=45529> ), but since we are
changing the index structure anyway it would be great to get it right this
time.

Anyone who can jump in? (Looking especially at Asher and Tim)

Any help would be appreciated.

Cheers,
Denny

-- 
Project director Wikidata
Wikimedia Deutschland e.V. | Obentrautstr. 72 | 10963 Berlin
Tel. +49-30-219 158 26-0 | http://wikimedia.de

Wikimedia Deutschland - Gesellschaft zur Förderung Freien Wissens e.V.
Eingetragen im Vereinsregister des Amtsgerichts Berlin-Charlottenburg unter
der Nummer 23855 B. Als gemeinnützig anerkannt durch das Finanzamt für
Körperschaften I Berlin, Steuernummer 27/681/51985.
___
Wikitech-l mailing list
Wikitech-l@lists.wikimedia.org
https://lists.wikimedia.org/mailman/listinfo/wikitech-l