[Wikitech-l] Re: Word embeddings / vector search

Isaac Johnson Tue, 09 May 2023 13:10:10 -0700

+1 to the suggestion to connect with the Search team. Also a few more
thoughts about vector / natural-language search and its relevance to
Wikimedia from my perspective in Research:


   - The common critique of lexical / keyword-based search and why folks
   point to vector / embedding-based search is handling more natural-language
   queries (e.g., "What are the different objectives of the United Nations
   Sustainable Development Goals?" vs. "UN SDG"). The former has a lot of
   words in it that lead to keyword overlap with less-relevant pages so
   keyword-based search doesn't do as well. The latter is much more direct and
   even matches an existing redirect on Wikipedia to the article on UN
   Sustainable Development Goals, so our existing keyword-based search handles
   it very well.
   - Most existing users of Wikimedia's search are probably doing something
   closer to the latter above -- i.e. using pretty exact keywords to navigate
   to a specific page (or find it exists). This is backed up by the data: 80%
   of searches on Wikipedia are auto-completed directly to article pages
   
<https://upload.wikimedia.org/wikipedia/commons/8/87/Understanding_Search_Behavior_in_Wikipedia_-_Report_-_Bruno_Scarone.pdf#page=7>.
   In that sense, the system is working quite well! The Search team also has
   added quite a bit of normalization into the pipeline (see
   
https://diff.wikimedia.org/2023/04/28/language-harmony-and-unpacking-a-year-in-the-life-of-a-search-nerd/
   for a fun overview). For the more complicated natural-language queries to
   find relevant Wikipedia articles, my sense is that folks using natural
   language searches are probably doing that within external search engines,
   which have huge teams/infrastructure to support this, and then clicking
   through to Wikipedia.
   - That said, there are probably use-cases where natural-language search
   would be more valuable. For example, within new interaction domains such as
   chat-bots or for new editors / developers who don't yet know the exact
   terminology to search for but want to do generic things like get access to
   Toolforge or find out how to add a link to a page. I've been putting
   together an example of this for Wikitech for the upcoming Hackathon (
   details <https://phabricator.wikimedia.org/T333853>) and others have
   proposed e.g., this for Project pages to help editors find answers to
   questions about editing (details
   <https://phabricator.wikimedia.org/T335013>).
   - Finally, there's a second, related aspect to this which is the size
   and diversity of a given document. Within the Wikipedia article namespace,
   documents are generally about a single, constrained topic. So the fact that
   lexical search systems like Elasticsearch operate at the document-level is
   a very good fit -- i.e. index all the keywords for a given article
   together. When thinking about other namespaces like Project/Help pages or
   Wikitech documentation, a single page can be much larger and be about far
   more diverse topics. This presents further challenges to finding good
   keyword-overlap because often the search would ideally find a very specific
   paragraph in a much larger document about many other things. Vector search
   doesn't directly solve this but in practice, folks tend to learn embeddings
   for smaller passages than an entire doc -- e.g., sections or even
   paragraphs within the section. For that reason alone, I suspect vector
   search will do better for namespaces outside of the article namespace on
   Wikipedia. Whether it's worth the cost is a separate question as it also
   introduces substantial new challenges in keeping the embeddings up-to-date
   :)

Hope that helps.

Best,
Isaac

On Tue, May 9, 2023 at 2:10 PM Dan Andreescu <dandree...@wikimedia.org>
wrote:

> I encourage you to reach out to the search team, they're lovely folks and
> even better engineers.
>
> On Tue, May 9, 2023 at 1:53 PM Lars Aronsson <l...@aronsson.se> wrote:
>
>> On 2023-05-09 09:27, Thiemo Kreuz wrote:
>> > I'm curious what the actual question is. The basic concepts are
>> > studied for about 60 years, and are in use for about 20 to 30 years.
>>
>> Sorry to hear that you're so negative. It's quite obvious that this is not
>> currently used in Wikipedia, but is presented everywhere as a novelty
>> that has not been around for 20 or 30 years.
>>
>> >
>> https://www.elastic.co/de/blog/introducing-approximate-nearest-neighbor-search-in-elasticsearch-8-0
>> > https://en.wikipedia.org/wiki/Special:Version
>> >
>> https://meta.wikimedia.org/wiki/Wikimedia_Foundation_Annual_Plan/2023-2024/Draft/Product_%26_Technology#Objectives
>> >
>> https://wikitech.wikimedia.org/wiki/Search_Platform/Contact#Office_Hours
>>
>>
>> Thanks! This answers my question. It's particularly interesting to read
>> the talk page to the plan. Part of the problem is that "word embedding"
>> and "vector search" are not mentioned there, but a vector search could
>> have found the "ML-enabled natural language search" that is mentioned.
>> If and when this is tried, we will need to evaluate how well it works for
>> various languages.
>>
>>
>> --
>>    Lars Aronsson (l...@aronsson.se, user:LA2)
>>    Linköping, Sweden
>>
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
>>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/



-- 
Isaac Johnson (he/him/his) -- Senior Research Scientist -- Wikimedia
Foundation

_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

[Wikitech-l] Re: Word embeddings / vector search

Reply via email to