Hello Ilario,

You might find this blog post I wrote a while back interesting
https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d1197d72bcf

In it you can find a brief (and definitely not comprehensive) review of NLP
with Wiki* along with links to an open source Kaggle dataset I built
connecting the plain text of wikipedia, the anchor links between pages, and
the links to wikidata. There are a few notebooks that demonstrate its use
... my favorite are probably,

* Pointwise Mutual Information embeddings
https://www.kaggle.com/code/kenshoresearch/kdwd-pmi-word-vectors
* Analyzing the "subclass of" graph from wikidata
https://www.kaggle.com/code/gabrielaltay/kdwd-subclass-path-ner
* Explicit topic modeling
https://www.kaggle.com/code/kenshoresearch/kdwd-explicit-topic-models

and if you are still looking for more after that, this is the query that
gives more every time you use it :)

https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=wikipedia&terms-0-field=abstract&terms-1-operator=OR&terms-1-term=wikidata&terms-1-field=abstract&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first

best,
-G



On Thu, Jun 23, 2022 at 4:17 PM Isaac Johnson <is...@wikimedia.org> wrote:

> Chiming in as a member of the Wikimedia Foundation Research team
> <https://research.wikimedia.org/> (so you'll see that likely biases the
> examples I'm aware of). I'd say that the most common type of NLP that shows
> up in our applications is tokenization / language analysis -- i.e. split
> wikitext into words/sentences. As Trey said, this tokenization is
> non-trivial for English and gets much harder in other languages that have
> more complex constructions / don't use spaces to delimit words
> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects>.
> These tokens often then become inputs into other types of models that
> aren't necessarily NLP. There are a number of more complex NLP technologies
> too that don't just identify words but try to identify similarities between
> them, translate them, etc.
>
> Some examples below. Additionally, I indicated whether each application
> was rule-based (follow a series of deterministic heuristics) or ML
> (learned, probabilistic model) in case that's of interest:
>
>    - Copyedit
>    
> <https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task>:
>    identifying potential grammar/spelling issues in articles (rule-based). I
>    believe there are a number of volunteer-run bots on the wikis as well as
>    the under-development tool I linked to, which is a collaboration between
>    the Wikimedia Foundation Research team
>    <https://research.wikimedia.org/> and Growth team
>    <https://www.mediawiki.org/wiki/Growth> that builds on an open-source
>    tool
>    
> <https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool>
>    .
>    - Link recommendation
>    
> <https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link#Link_recommendation_algorithm>:
>    detecting links that could be added to Wikipedia articles. The NLP aspect
>    mainly involves accurately parsing wikitext into sentences/words
>    (rule-based) and comparing the similarity of the source article and pages
>    that are potential target links (ML). Also collaboration between Research
>    team and Growth team.
>    - Content similarity: various tools such as SuggestBot
>    <https://en.wikipedia.org/wiki/User:SuggestBot>, RelatedArticles
>    Extension <https://www.mediawiki.org/wiki/Extension:RelatedArticles>,
>    or GapFinder <https://www.mediawiki.org/wiki/GapFinder> use the morelike
>    functionality of CirrusSearch
>    <https://www.mediawiki.org/wiki/Help:CirrusSearch#Page_weighting>
>    backend maintained by the Search team to find Wikipedia articles with
>    similar topics -- this is largely finding keyword overlap between content
>    with clever pre-processing/weighting as described by Trey.
>    - Readability
>    
> <https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research>:
>    score content based on its readability. Under development by Research team.
>    - Topic classification: predict what high-level topics are associated
>    with Wikipedia articles. The current model for English Wikipedia
>    <https://www.mediawiki.org/wiki/ORES#Topic_routing> uses word
>    embeddings from the article to make predictions (ML) and a proposed
>    model
>    
> <https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_agnostic_link-based_article_topic_model_card>
>    from the Research team will use NLP models but with article links instead
>    to support more (all) language editions.
>    - Citation needed <https://meta.wikimedia.org/wiki/Citation_Detective>:
>    detecting sentences in need of citations (ML). Protoype developed by
>    Research team.
>    - Edit Types
>    <https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types>:
>    summarizing how much text changed between two revisions of a Wikipedia
>    article -- e.g., how many words/sentences changed (rule-based). Protoype
>    developed by Research team.
>    - Vandalism detection: a number of different approaches in use on the
>    wikis generally have some form of a "bad word" list (generally a mix of
>    auto/manually-generated), extract words from new edits and compare these
>    words to the bad word list, and then use this to help judge how likely the
>    edit is to be vandalism. Examples include many filters in AbuseFilter
>    <https://www.mediawiki.org/wiki/Extension:AbuseFilter>, volunteer-led
>    efforts such as ClueBot NG
>    <https://en.wikipedia.org/wiki/User:ClueBot_NG#Bayesian_Classifiers>
>    (English Wikipedia) and Salebot
>    <https://fr.wikipedia.org/wiki/Utilisateur:Salebot> (French Wikipedia)
>    as well as the Wikimedia Foundation ORES edit quality models
>    <https://www.mediawiki.org/wiki/ORES/BWDS_review> (many wikis).
>    - Sockpuppet detection
>    <https://www.mediawiki.org/wiki/User:Ladsgroup/masz>: finding editors
>    who have similar stylistic patterns in their comments (volunteer tool).
>    - Content Translation was mentioned -- there are numerous potential
>    translation models available
>    
> <https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/MT_Clients#Machine_translation_clients>,
>    of which some are rule-based and some are ML. Tool maintained by Wikimedia
>    Foundation Language team
>    <https://www.mediawiki.org/wiki/Wikimedia_Language_engineering> but
>    depends on several external APIs.
>
> I've also done some thinking that might be of interest about what a
> natural language modeling strategy looks like for Wikimedia that balances
> effectiveness of models with equity/sustainability of supporting so many
> different language communities:
> https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Language_modeling
>
> Hope that helps.
>
> Best,
> Isaac
>
>
> On Wed, Jun 22, 2022, 10:43 Trey Jones <tjo...@wikimedia.org> wrote:
>
>> Do you have examples of projects using NLP in Wikimedia communities.
>>>
>>
>> I do! Defining NLP is something of a moving target, and the most common
>> definition, which I learned when I worked in industry, is that "NLP" has
>> often been used as a buzzword that means "any language processing you do
>> that your competitors don't". Getting away from profit-driven buzzwords, I
>> have a pretty generous definition of NLP, as any software that improves
>> language-based interactions between people and computers.
>>
>> Guillaume mentioned CirrusSearch in general, but there are lots of
>> specific parts within search. I work on a lot of NLP-type stuff for search,
>> and I write a lot of documentation on Mediawiki, so this is biased towards
>> stuff I have worked on or know about.
>>
>> Language analysis is the general process of converting text (say, of
>> Wikipedia articles) into tokens (approximately "words" in English) to be
>> stored in the search index. There are lots of different levels of
>> complexity in the language analysis. We currently use Elasticsearch, and
>> they provide a lot of language-specific analysis tools (link to Elastic
>> language analyzers
>> <https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html>),
>> which we customize and build on.
>>
>> Here is part of the config for English, reordered to be chronological,
>> rather than alphabetical, and annotated:
>>
>> "text": {
>>     "type": "custom",
>>     "char_filter": [
>>         "word_break_helper", — break_up.words:with(uncommon)separators
>>         "kana_map" — map Japanese Hiragana to Katakana (notes
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese>
>> )
>>     ],
>>     "tokenizer": "standard" — break text into tokens/words; not trivial
>> for English, very hard for other languages (blog post
>> <https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/>
>> )
>>     "filter": [
>>         "aggressive_splitting", —splitting of more likely *multi-part*
>> *ComplexTokens*
>>         "homoglyph_norm", —correct typos/vandalization which mix Latin
>> and Cyrillic letters (notes
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs>)
>>         "possessive_english", —special processing for *English's*
>> possessive forms
>>         "icu_normalizer", —normalization of text (blog post
>> <https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/>
>> )
>>         "stop", —removal of stop words (blog post
>> <https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>,
>> section "To be or not to be indexed")
>>         "icu_folding", —more aggressive normalization
>>         "remove_empty", —misc bookkeeping
>>         "kstem", —stemming (blog post
>> <https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>
>> )
>>         "custom_stem" —more stemming
>>     ],
>> },
>>
>> Tokenization, normalization, and stemming can vary wildly between
>> languages. Some other elements (from Elasticsearch or custom-built by us):
>>
>>    - Stemmers and stop words for specific languages, including some
>>    open-source ones that we ported, and some developed with community help.
>>    - Elision processing (*l'homme* == *homme*)
>>    - Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123)
>>    - Custom lowercasing—Greek, Irish, and Turkish have special
>>    processing (notes
>>    
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization>
>>    )
>>    - Normalization of written Khmer (blog post
>>    
>> <https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/>
>>    )
>>    - Notes on lots more
>>    
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis>
>>    ...
>>
>> We also did some work improving "Did you mean" suggestions, which
>> currently uses both the built-in suggestions from Elasticsearch (not always
>> great, but there are lots of them) and new suggestions from a module we
>> called "Glent
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions>"
>> (much better, but not as many suggestions).
>>
>> We have some custom language detection available on some Wikipedias, so
>> that if you don't get very many results and your query looks like it is
>> another language, we show results from that other language. Example, 
>> searching
>> for Том Хэнкс on English Wikipedia
>> <https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1>
>>  will
>> show results from Russian Wikipedia. (too many notes
>> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc.>
>> )
>>
>> Outside of our search work, there are lots more. Some that come to mind:
>>
>>    - Language Converter supports languages with multiple writing
>>    systems, which is sometimes easy and sometimes really hard. (blog post
>>    
>> <https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/>
>>    )
>>    - There's a Wikidata gadget on French Wikipedia and others that
>>    appends results from Wikidata and generates descriptions in various
>>    languages based on the Wikidata information. For example, searching for 
>> Molenstraat
>>    Vught on French Wikipedia
>>    
>> <https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1>,
>>    gives no local results, but shows two "Results from Wikidata" / "Résultats
>>    sur Wikidata" (if you are logged in you get results in your preferred
>>    language, if possible, otherwise the language of the project):
>>       - Molenstraat ; hameau de la commune de Vught (in French, when I'm
>>       not logged in)
>>       - Molenstraat ; street in Vught, the Netherlands (fallback to
>>       English for some reason)
>>       - The whole giant Content Translation project that uses machine
>>    translation to assist translating articles across wikis. (blog post
>>    
>> <https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/>
>>    )
>>
>> There's lots more out there, I'm sure—but I gotta run!
>> —Trey
>>
>> Trey Jones
>> Staff Computational Linguist, Search Platform
>> Wikimedia Foundation
>> UTC–4 / EDT
>>
>>
>> _______________________________________________
>> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
>> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
>>
>> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
>
> _______________________________________________
> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________
Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org
To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org
https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/

Reply via email to