October is back and so are HPLT datasets (we've been doing this for
three consecutive years now!)
This time, we announce the release of the massive HPLT v3.0 multilingual
dataset which can be considered a major upgrade for large-scale
multilingual corpora.
Accounting for 29 billion documents, 198 language-script combinations
and 112 trillion characters, v3.0 shows significant gains over v2,
driven by several improvements, including a new global deduplication
process:
- Unique content boosted from 52% to 73% on average.
- Data substance and robustness remains high with better extraction and
improved language identification.
- Shows increased variety and better representativeness of natural web
content.
This release provides a cleaner, more robust dataset for building
powerful LLMs and machine translation systems, including a myriad of
low- to medium-resourced languages. And we have not said our last word:
wait for more data soon because we are already working on it.
Special thanks to all the collaborators and funding bodies, including
the European Union's Horizon Europe program and UK Research and Innovation.
Explore the data and see the full analysis and evaluation highlights on
our website:
https://hplt-project.org/datasets/v3.0
--
Andrey
Language Technology Group (LTG)
University of Oslo
_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]