Hello Ilario, You might find this blog post I wrote a while back interesting https://blog.kensho.com/announcing-the-kensho-derived-wikimedia-dataset-5d1197d72bcf
In it you can find a brief (and definitely not comprehensive) review of NLP with Wiki* along with links to an open source Kaggle dataset I built connecting the plain text of wikipedia, the anchor links between pages, and the links to wikidata. There are a few notebooks that demonstrate its use ... my favorite are probably, * Pointwise Mutual Information embeddings https://www.kaggle.com/code/kenshoresearch/kdwd-pmi-word-vectors * Analyzing the "subclass of" graph from wikidata https://www.kaggle.com/code/gabrielaltay/kdwd-subclass-path-ner * Explicit topic modeling https://www.kaggle.com/code/kenshoresearch/kdwd-explicit-topic-models and if you are still looking for more after that, this is the query that gives more every time you use it :) https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=wikipedia&terms-0-field=abstract&terms-1-operator=OR&terms-1-term=wikidata&terms-1-field=abstract&classification-computer_science=y&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first best, -G On Thu, Jun 23, 2022 at 4:17 PM Isaac Johnson <is...@wikimedia.org> wrote: > Chiming in as a member of the Wikimedia Foundation Research team > <https://research.wikimedia.org/> (so you'll see that likely biases the > examples I'm aware of). I'd say that the most common type of NLP that shows > up in our applications is tokenization / language analysis -- i.e. split > wikitext into words/sentences. As Trey said, this tokenization is > non-trivial for English and gets much harder in other languages that have > more complex constructions / don't use spaces to delimit words > <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Spaceless_Writing_Systems_and_Wiki-Projects>. > These tokens often then become inputs into other types of models that > aren't necessarily NLP. There are a number of more complex NLP technologies > too that don't just identify words but try to identify similarities between > them, translate them, etc. > > Some examples below. Additionally, I indicated whether each application > was rule-based (follow a series of deterministic heuristics) or ML > (learned, probabilistic model) in case that's of interest: > > - Copyedit > > <https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task>: > identifying potential grammar/spelling issues in articles (rule-based). I > believe there are a number of volunteer-run bots on the wikis as well as > the under-development tool I linked to, which is a collaboration between > the Wikimedia Foundation Research team > <https://research.wikimedia.org/> and Growth team > <https://www.mediawiki.org/wiki/Growth> that builds on an open-source > tool > > <https://meta.wikimedia.org/wiki/Research:Copyediting_as_a_structured_task/LanguageTool> > . > - Link recommendation > > <https://www.mediawiki.org/wiki/Growth/Personalized_first_day/Structured_tasks/Add_a_link#Link_recommendation_algorithm>: > detecting links that could be added to Wikipedia articles. The NLP aspect > mainly involves accurately parsing wikitext into sentences/words > (rule-based) and comparing the similarity of the source article and pages > that are potential target links (ML). Also collaboration between Research > team and Growth team. > - Content similarity: various tools such as SuggestBot > <https://en.wikipedia.org/wiki/User:SuggestBot>, RelatedArticles > Extension <https://www.mediawiki.org/wiki/Extension:RelatedArticles>, > or GapFinder <https://www.mediawiki.org/wiki/GapFinder> use the morelike > functionality of CirrusSearch > <https://www.mediawiki.org/wiki/Help:CirrusSearch#Page_weighting> > backend maintained by the Search team to find Wikipedia articles with > similar topics -- this is largely finding keyword overlap between content > with clever pre-processing/weighting as described by Trey. > - Readability > > <https://meta.wikimedia.org/wiki/Research:Multilingual_Readability_Research>: > score content based on its readability. Under development by Research team. > - Topic classification: predict what high-level topics are associated > with Wikipedia articles. The current model for English Wikipedia > <https://www.mediawiki.org/wiki/ORES#Topic_routing> uses word > embeddings from the article to make predictions (ML) and a proposed > model > > <https://meta.wikimedia.org/wiki/Machine_learning_models/Proposed/Language_agnostic_link-based_article_topic_model_card> > from the Research team will use NLP models but with article links instead > to support more (all) language editions. > - Citation needed <https://meta.wikimedia.org/wiki/Citation_Detective>: > detecting sentences in need of citations (ML). Protoype developed by > Research team. > - Edit Types > <https://meta.wikimedia.org/wiki/Research:Wikipedia_Edit_Types>: > summarizing how much text changed between two revisions of a Wikipedia > article -- e.g., how many words/sentences changed (rule-based). Protoype > developed by Research team. > - Vandalism detection: a number of different approaches in use on the > wikis generally have some form of a "bad word" list (generally a mix of > auto/manually-generated), extract words from new edits and compare these > words to the bad word list, and then use this to help judge how likely the > edit is to be vandalism. Examples include many filters in AbuseFilter > <https://www.mediawiki.org/wiki/Extension:AbuseFilter>, volunteer-led > efforts such as ClueBot NG > <https://en.wikipedia.org/wiki/User:ClueBot_NG#Bayesian_Classifiers> > (English Wikipedia) and Salebot > <https://fr.wikipedia.org/wiki/Utilisateur:Salebot> (French Wikipedia) > as well as the Wikimedia Foundation ORES edit quality models > <https://www.mediawiki.org/wiki/ORES/BWDS_review> (many wikis). > - Sockpuppet detection > <https://www.mediawiki.org/wiki/User:Ladsgroup/masz>: finding editors > who have similar stylistic patterns in their comments (volunteer tool). > - Content Translation was mentioned -- there are numerous potential > translation models available > > <https://www.mediawiki.org/wiki/Content_translation/Machine_Translation/MT_Clients#Machine_translation_clients>, > of which some are rule-based and some are ML. Tool maintained by Wikimedia > Foundation Language team > <https://www.mediawiki.org/wiki/Wikimedia_Language_engineering> but > depends on several external APIs. > > I've also done some thinking that might be of interest about what a > natural language modeling strategy looks like for Wikimedia that balances > effectiveness of models with equity/sustainability of supporting so many > different language communities: > https://meta.wikimedia.org/wiki/User:Isaac_(WMF)/Language_modeling > > Hope that helps. > > Best, > Isaac > > > On Wed, Jun 22, 2022, 10:43 Trey Jones <tjo...@wikimedia.org> wrote: > >> Do you have examples of projects using NLP in Wikimedia communities. >>> >> >> I do! Defining NLP is something of a moving target, and the most common >> definition, which I learned when I worked in industry, is that "NLP" has >> often been used as a buzzword that means "any language processing you do >> that your competitors don't". Getting away from profit-driven buzzwords, I >> have a pretty generous definition of NLP, as any software that improves >> language-based interactions between people and computers. >> >> Guillaume mentioned CirrusSearch in general, but there are lots of >> specific parts within search. I work on a lot of NLP-type stuff for search, >> and I write a lot of documentation on Mediawiki, so this is biased towards >> stuff I have worked on or know about. >> >> Language analysis is the general process of converting text (say, of >> Wikipedia articles) into tokens (approximately "words" in English) to be >> stored in the search index. There are lots of different levels of >> complexity in the language analysis. We currently use Elasticsearch, and >> they provide a lot of language-specific analysis tools (link to Elastic >> language analyzers >> <https://www.elastic.co/guide/en/elasticsearch/reference/7.10/analysis-lang-analyzer.html>), >> which we customize and build on. >> >> Here is part of the config for English, reordered to be chronological, >> rather than alphabetical, and annotated: >> >> "text": { >> "type": "custom", >> "char_filter": [ >> "word_break_helper", — break_up.words:with(uncommon)separators >> "kana_map" — map Japanese Hiragana to Katakana (notes >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Hiragana_to_Katakana_Mapping_for_English_and_Japanese> >> ) >> ], >> "tokenizer": "standard" — break text into tokens/words; not trivial >> for English, very hard for other languages (blog post >> <https://wikimediafoundation.org/news/2018/08/07/anatomy-search-token-affection/> >> ) >> "filter": [ >> "aggressive_splitting", —splitting of more likely *multi-part* >> *ComplexTokens* >> "homoglyph_norm", —correct typos/vandalization which mix Latin >> and Cyrillic letters (notes >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Homoglyphs>) >> "possessive_english", —special processing for *English's* >> possessive forms >> "icu_normalizer", —normalization of text (blog post >> <https://wikimediafoundation.org/news/2018/09/13/anatomy-search-variation-under-nature/> >> ) >> "stop", —removal of stop words (blog post >> <https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/>, >> section "To be or not to be indexed") >> "icu_folding", —more aggressive normalization >> "remove_empty", —misc bookkeeping >> "kstem", —stemming (blog post >> <https://wikimediafoundation.org/news/2018/11/28/anatomy-search-the-root-of-the-problem/> >> ) >> "custom_stem" —more stemming >> ], >> }, >> >> Tokenization, normalization, and stemming can vary wildly between >> languages. Some other elements (from Elasticsearch or custom-built by us): >> >> - Stemmers and stop words for specific languages, including some >> open-source ones that we ported, and some developed with community help. >> - Elision processing (*l'homme* == *homme*) >> - Normalization for digits (١ ٢ ٣ / १ २ ३ / ①②③ / 123) >> - Custom lowercasing—Greek, Irish, and Turkish have special >> processing (notes >> >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes/Language-Specific_Lowercasing_and_ICU_Normalization> >> ) >> - Normalization of written Khmer (blog post >> >> <https://techblog.wikimedia.org/2020/06/02/permuting-khmer-restructuring-khmer-syllables-for-search/> >> ) >> - Notes on lots more >> >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Elasticsearch_Analysis_Chain_Analysis> >> ... >> >> We also did some work improving "Did you mean" suggestions, which >> currently uses both the built-in suggestions from Elasticsearch (not always >> great, but there are lots of them) and new suggestions from a module we >> called "Glent >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#Glent_%22Did_You_Mean%22_Suggestions>" >> (much better, but not as many suggestions). >> >> We have some custom language detection available on some Wikipedias, so >> that if you don't get very many results and your query looks like it is >> another language, we show results from that other language. Example, >> searching >> for Том Хэнкс on English Wikipedia >> <https://en.wikipedia.org/w/index.php?search=%D0%A2%D0%BE%D0%BC+%D0%A5%D1%8D%D0%BD%D0%BA%D1%81&ns0=1> >> will >> show results from Russian Wikipedia. (too many notes >> <https://www.mediawiki.org/wiki/User:TJones_(WMF)/Notes#TextCat,_Language_ID,_Etc.> >> ) >> >> Outside of our search work, there are lots more. Some that come to mind: >> >> - Language Converter supports languages with multiple writing >> systems, which is sometimes easy and sometimes really hard. (blog post >> >> <https://diff.wikimedia.org/2018/03/12/supporting-languages-multiple-writing-systems/> >> ) >> - There's a Wikidata gadget on French Wikipedia and others that >> appends results from Wikidata and generates descriptions in various >> languages based on the Wikidata information. For example, searching for >> Molenstraat >> Vught on French Wikipedia >> >> <https://fr.wikipedia.org/w/index.php?search=Molenstraat+Vught&title=Sp%C3%A9cial:Recherche&profile=advanced&fulltext=1&ns0=1>, >> gives no local results, but shows two "Results from Wikidata" / "Résultats >> sur Wikidata" (if you are logged in you get results in your preferred >> language, if possible, otherwise the language of the project): >> - Molenstraat ; hameau de la commune de Vught (in French, when I'm >> not logged in) >> - Molenstraat ; street in Vught, the Netherlands (fallback to >> English for some reason) >> - The whole giant Content Translation project that uses machine >> translation to assist translating articles across wikis. (blog post >> >> <https://wikimediafoundation.org/news/2019/01/09/you-can-now-use-google-translate-to-translate-articles-on-wikipedia/> >> ) >> >> There's lots more out there, I'm sure—but I gotta run! >> —Trey >> >> Trey Jones >> Staff Computational Linguist, Search Platform >> Wikimedia Foundation >> UTC–4 / EDT >> >> >> _______________________________________________ >> Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org >> To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org >> >> https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/ > > _______________________________________________ > Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org > To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org > https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/
_______________________________________________ Wikitech-l mailing list -- wikitech-l@lists.wikimedia.org To unsubscribe send an email to wikitech-l-le...@lists.wikimedia.org https://lists.wikimedia.org/postorius/lists/wikitech-l.lists.wikimedia.org/