[Corpora-List] Release of the massive HPLT v3.0 multilingual dataset

Andrey Kutuzov via Corpora Sat, 18 Oct 2025 09:11:53 -0700

October is back and so are HPLT datasets (we've been doing this forthree consecutive years now!)This time, we announce the release of the massive HPLT v3.0 multilingualdataset which can be considered a major upgrade for large-scalemultilingual corpora.

Accounting for 29 billion documents, 198 language-script combinationsand 112 trillion characters, v3.0 shows significant gains over v2,driven by several improvements, including a new global deduplicationprocess:


- Unique content boosted from 52% to 73% on average.

- Data substance and robustness remains high with better extraction andimproved language identification.- Shows increased variety and better representativeness of natural webcontent.

This release provides a cleaner, more robust dataset for buildingpowerful LLMs and machine translation systems, including a myriad oflow- to medium-resourced languages. And we have not said our last word:wait for more data soon because we are already working on it.

Special thanks to all the collaborators and funding bodies, includingthe European Union's Horizon Europe program and UK Research and Innovation.

Explore the data and see the full analysis and evaluation highlights onour website:

https://hplt-project.org/datasets/v3.0


--
Andrey
Language Technology Group (LTG)
University of Oslo

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] Release of the massive HPLT v3.0 multilingual dataset

Reply via email to