WSDM Cup 2026 - MULTILINGUAL RETRIEVAL
https://wsdmcup-2026.github.io

*** CHALLENGE FEATURES ***
Why you should care about this Cup!

* Multilingual: Search with queries and documents in different languages is 
difficult.
* For Better RAG: Incorporating information in different languages is critical 
for effective RAG.
* Informational Queries: We prepare rich information queries instead of factoid 
QA questions.

*** CALL FOR PARTICIPATION ***
Retrieval-Augmented Generation (RAG) systems provide an opportunity to expand 
the scope of available information to users, since they are able to retrieve 
and synthesize information from documents in languages that the user does not 
necessarily understand. The ability to retrieve documents only based on their 
relevance, regardless of language, is crucial for modern retrieval models to 
support better coverage of perspectives from different parts of the world. 
Thus, WSDM Cup 2026 features a multilingual retrieval task.

The participants will develop systems that receive English queries and search a 
collection of about 10 million documents 
(https://huggingface.co/datasets/neuclir/neuclir1) in Chinese (3.1M), Persian 
(2.2M), and Russian (4.6M). For each query, the system must produce a ranked 
list of 1,000 documents selected from the entire multilingual collection, 
ordered by likelihood and relevance to the topic. All systems should operate 
automatically without human intervention. Submissions must be in the TREC run 
file format. Each team may submit up to 5 submissions and will be evaluated 
using nDCG@20.

*** PARTICIPATION AND SUBMISSION INSTRUCTIONS ***
There are 41 development queries that the participants can use in system 
development. Participants are free to use other data, such as MIRACL. However, 
using the TREC NeuCLIR Track data besides the development queries provided by 
the organizers is strongly prohibited. The participants should not use the 
publicly available relevance assessments on this specific collection (NeuCLIR) 
besides the development queries and labels provided by the Cup. Participants 
will be asked to provide code for their training and inference process either 
through a tarball submission or a publicly available repository.

All participants are expected to submit a short write-up about their 
submissions (similar to the TREC system paper). Selected teams (including the 
winner of the Cup) will receive a slot at WSDM for oral presentation.

Submissions will also be made to the submission Google Form. All test queries 
should have at least 20 retrieved documents in each submission file. The 
following is an example of the submission format (the TREC run format).

*** IMPORTANT DATES ***
* November 17, 2025: Document collection, development/test queries, and the 
submission portal are available.
* December 1, 2025: Online Q&A session if needed
* February 2, 2025: Submission due
* February 22-26, 2026: WSDM Conference; winner and evaluation result 
announcement. An overview technical report will be released along with the 
final results.

*** ORGANIZING COMMITTEE ***
* Dawn Lawrie (HLTCOE, Johns Hopkins University)
* Sean Macavaney (University of Glasgow)
* James Mayfield (HLTCOE, Johns Hopkins University)
* Luca Soldaini (Allen Institute for AI)
* Eugene Yang (HLTCOE, Johns Hopkins University)
* Andrew Yates (HLTCOE, Johns Hopkins University)
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to