>
> Do you have examples of projects using NLP in Wikimedia communities.
>

I do! Defining NLP is something of a moving target, and the most common
definition, which I learned when I worked in industry, is that "NLP" has
often been used as a buzzword that means "any language processing you do
that your competitors don't". Getting away from profit-driven buzzwords, I
have a pretty generous definition of NLP, as any software that improves
language-based interactions between people and computers.

Guillaume mentioned CirrusSearch in general, but there are lots of specific
parts within search. I work on a lot of NLP-type stuff for search, and I
write a lot of documentation on Mediawiki, so this is biased towards stuff
I have worked on or know about.

Language analysis is the general process of converting text (say, of
Wikipedia articles) into tokens (approximately "words" in English) to be
stored in the search index. There are lots of different levels of
complexity in the language analysis. We currently use Elasticsearch, and
they provide a lot of language-specific analysis tools (link to Elastic
language analyzers
<https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html>),
which we customize and build on.

Here is part of the config for English, reordered to be chronological,
rather than alphabetical, and annotated:

"text": {
    "type": "custom",
    "char_filter": [
        "word_break_helper", — break_up.words:with(uncommon)separators
        "kana_map" — map Japanese Hiragana to Katakana (notes
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese>
)
    ],
    "tokenizer": "standard" — break text into tokens/words; not trivial for
English, very hard for other languages (blog post
<https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/>
)
    "filter": [
        "aggressive_splitting", —splitting of more likely *multi-part*
*ComplexTokens*
        "homoglyph_norm", —correct typos/vandalization which mix Latin and
Cyrillic letters (notes
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs>)
        "possessive_english", —special processing for *English's*
possessive forms
        "icu_normalizer", —normalization of text (blog post
<https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/>
)
        "stop", —removal of stop words (blog post
<https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>,
section "To be or not to be indexed")
        "icu_folding", —more aggressive normalization
        "remove_empty", —misc bookkeeping
        "kstem", —stemming (blog post
<https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>
)
        "custom_stem" —more stemming
    ],
},

Tokenization, normalization, and stemming can vary wildly between
languages. Some other elements (from Elasticsearch or custom-built by us):

   - Stemmers and stop words for specific languages, including some
   open-source ones that we ported, and some developed with community help.
   - Elision processing (*l'homme* == *homme*)
   - Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123)
   - Custom lowercasing—Greek, Irish, and Turkish have special processing (
   notes
   
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization>
   )
   - Normalization of written Khmer (blog post
   
<https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/>
   )
   - Notes on lots more
   
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis>
   ...

We also did some work improving "Did you mean" suggestions, which currently
uses both the built-in suggestions from Elasticsearch (not always great,
but there are lots of them) and new suggestions from a module we called "
Glent
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions>"
(much better, but not as many suggestions).

We have some custom language detection available on some Wikipedias, so
that if you don't get very many results and your query looks like it is
another language, we show results from that other language. Example, searching
for Том Хэнкс on English Wikipedia
<https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1>
will
show results from Russian Wikipedia. (too many notes
<https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc.>
)

Outside of our search work, there are lots more. Some that come to mind:

   - Language Converter supports languages with multiple writing systems,
   which is sometimes easy and sometimes really hard. (blog post
   
<https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/>
   )
   - There's a Wikidata gadget on French Wikipedia and others that appends
   results from Wikidata and generates descriptions in various languages based
   on the Wikidata information. For example, searching for Molenstraat
   Vught on French Wikipedia
   
<https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1>,
   gives no local results, but shows two "Results from Wikidata" / "Résultats
   sur Wikidata" (if you are logged in you get results in your preferred
   language, if possible, otherwise the language of the project):
      - Molenstraat ; hameau de la commune de Vught (in French, when I'm
      not logged in)
      - Molenstraat ; street in Vught, the Netherlands (fallback to English
      for some reason)
      - The whole giant Content Translation project that uses machine
   translation to assist translating articles across wikis. (blog post
   
<https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/>
   )

There's lots more out there, I'm sure—but I gotta run!
—Trey

Trey Jones
Staff Computational Linguist, Search Platform
Wikimedia Foundation
UTC–4 / EDT
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to