[Corpora-List] October 2025 Newsletter - LDC

Penn LDC via Corpora Sat, 18 Oct 2025 05:55:53 -0700

In this newsletter:
Membership year 2026 publication preview
Fall 2025 data scholarship recipients


New publications:
KAIROS Phase 2 Quizlet<https://catalog.ldc.upenn.edu/LDC2025T15>
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic 
Audio<https://catalog.ldc.upenn.edu/LDC2025S09>
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and 
Translations<https://catalog.ldc.upenn.edu/LDC2025T14>

________________________________
Membership year 2026 publication preview
The 2026 membership year is approaching and plans for next year's publications 
are in progress. Among the expected releases are:


  *   2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of 
English conversational telephone speech following the Mixer collection 
protocol, used in NIST's 2012 speaker recognition evaluation
  *   KAIROS schema learning corpus background data and Phase 1 evaluation 
datasets: multimodal English and Spanish source data and annotations for 
reasoning about complex real-world events

  *   CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational telephone 
speech from over 400 speakers to support text independent speaker recognition, 
used in the 2018 NIST Speaker Recognition Evaluation
  *   Multi-language conversational telephone speech: multiple releases, 
hundreds of hours of speech from speakers of confusable linguistic varieties 
(Arabic, Chinese, English, French, Slavic, Spanish) to support language 
identification
  *   CALLHOME Omnibus releases: combined speech and transcript datasets with 
updated directory structure, file formats and documentation, and lexicons 
(Chinese, English, German, Japanese, Spanish)

  *   IARPA MATERIAL language packs: conversational telephone speech, 
transcripts, English translations, annotations, and queries in multiple 
languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)

Check your inbox for more information about membership renewal.

Fall 2025 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2025 data scholarships:

Lasidu Dilshan: University of Moratuwa (Sri Lanka): BSc, Electronic and 
Telecommunication Engineering. Lasidu is awarded a copy of Asian Elephant 
Vocalizations LDC2010S05 for his work in elephant voice enhancement and 
classification.

Máté Gedeon: Budapest University of Technology and Economics (Hungary): PhD 
candidate, Department of Telecommunications and Artificial Intelligence. Máté 
is awarded a copy of Switchboard-1 Release 2 LDC97S62 for his work in simulated 
conversation generation.

Ping He: Northeastern University (USA): Student, Khoury College of Computer 
Sciences. Ping is awarded a copy of ETS Corpus of Non-Native Written English 
LDC2014T06 for their work in native language identification.

Thiyazen Iskander: Maulana Azad College of Arts, Science & Commerce (India), 
affiliated with Babasaheb Ambedkar Technological University (India): PhD 
candidate, Linguistics, Department of English. Thiyazen is awarded copies of 
Arabic Morphological Analyzer (SAMA) Version 3.1 LDC2010L01 and Arabic Treebank 
Part 1 v. 4.1 LDC2010T13 for his work in morphosyntactic analysis of short 
passives in Standard Arabic.

Michael Mooney: University of Glasgow (United Kingdom): PhD candidate, School 
of Computing Sciences. Michael is awarded copies of Treebank-2 LDC95T7 and 
BLLIP 1987-89 WSJ Corpus Release LDC2000T43 for their work in eye-tracking for 
text-centered modeling.

Abraham Sanders: Rensselaer Polytechnic Institute (USA): PhD candidate, 
Cognitive Science. Abraham is awarded a copy of Switchboard-1 Release 2 
LDC97S62 for his work in spoken dialogue systems.

________________________________
New publications:
KAIROS Phase 2 Quizlet<https://catalog.ldc.upenn.edu/LDC2025T15> was developed 
by LDC and contains English and Spanish text, video and image data, and 
annotations used for pre-evaluation research and system development during 
Phase 2 of the DARPA KAIROS program. KAIROS Quizlets were a series of narrowly 
defined tasks designed to explore specific evaluation objectives enabling 
KAIROS system developers to exercise individual system components on a small 
data set prior to the full program evaluation. This corpus contains the 
complete set of Quizlet data used in Phase 2 which focused on five real-world 
complex events within the Disease Outbreak scenario.

Source data was collected from the web; 66 root web pages were collected and 
processed, yielding 65 text data files, 890 image files and 10 video files. 
Annotation steps included labeling scenario-relevant events and relations for 
each document to develop a structured representation of temporally ordered 
events, relations and arguments; generating a reference knowledge graph; and 
linking labeled entries to a knowledge base derived from a Wikidata-based 
ontology.

The DARPA KAIROS (Knowledge-directed Artificial Intelligence Reasoning Over 
Schemas) program aimed to build technology capable of understanding and 
reasoning about complex real-world events in order to provide actionable 
insights to end users. KAIROS systems utilized formal event representations in 
the form of schema libraries that specified the steps, preconditions and 
constraints for an open set of complex events; schemas were then used in 
combination with event extraction to characterize and make predictions about 
real-world events in a large multilingual, multimedia corpus.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic 
Audio<https://catalog.ldc.upenn.edu/LDC2025S09> was developed by LDC and 
consists of 116 hours of speech from 274 unscripted telephone conversations 
between native speakers of the Arabic dialect spoken in Egypt. The calls were 
collected by LDC in the CALLFRIEND and CALLHOME series where participants 
called family members or close friends and spoke on topics of their choice. 
Around 33% of the recordings (92 calls) are publicly released for the first 
time. The remaining 182 recordings were previously published by LDC in various 
CALLFRIEND, CALLHOME, and HUB5 Arabic datasets.

The DARPA BOLT (Broad Operational Language Translation) program developed 
machine translation and information retrieval for less formal genres, focusing 
particularly on user-generated content. LDC supported the BOLT program by 
collecting informal data sources -- discussion forums, conversational telephone 
speech, text messaging, and chat -- in Chinese, Egyptian Arabic, and English. 
The material in this release represents the unannotated Egyptian Arabic source 
conversational telephone speech. The telephone data was transcribed, 
translated, and annotated for various tasks in the BOLT program including word 
alignment, treebanking, and co-reference.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic Transcripts and 
Translations<https://catalog.ldc.upenn.edu/LDC2025T14> contains transcripts and 
corresponding English translations for the conversational telephone speech in 
BOLT CTS CALLFRIEND CALLHOME Egyptian Arabic 
Audio<https://catalog.ldc.upenn.edu/LDC2025S09> and was developed by LDC to 
support the DARPA BOLT program.

Transcribers were required to produce a verbatim transcript of all speech 
within a file using the CODA<https://aclanthology.org/L12-1328/> orthographic 
approach; diacritics were not included. Some transcripts contain redactions for 
potential personally identifying information. All speech data was transcribed 
and is divided into training, development, and evaluation partitions.

The goal of the BOLT translation task was to translate the Arabic transcripts 
into fluent English while preserving the meaning present in the original Arabic 
text. Transcripts in the development and evaluation partitions received first 
pass and gold standard translations. 99% of the transcripts were translated 
into English.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] October 2025 Newsletter - LDC

Reply via email to