[Corpora-List] November 2025 Newsletter - LDC

Penn LDC via Corpora Mon, 17 Nov 2025 08:42:56 -0800

In this newsletter:
Join LDC for membership year 2026
Spring 2026 data scholarship application deadline


New publications:
AnnoDIFP CTS Audio and Transcripts<https://catalog.ldc.upenn.edu/LDC2025S10>
LORELEI Ilocano Incident Language Pack<https://catalog.ldc.upenn.edu/LDC2025T16>

________________________________
Join LDC for membership year 2026
It's time to renew your LDC membership for 2026. Any organization that joins 
the Consortium or renews their membership before March 2, 2026, will receive a 
10% discount off the membership fee.

In addition to accessing new publications, current LDC members enjoy the 
benefit of licensing at reduced fees older data from our Catalog of close to 
1000 holdings. Current-year for-profit members may use most data for commercial 
applications.

Plans for next year's publications are in progress. Among the expected releases 
are:

  *   2012 NIST Speaker Recognition Evaluation Test Set: 10,000+ hours of 
English conversational telephone speech following the Mixer collection 
protocol, used in NIST's 2012 speaker recognition evaluation
  *   KAIROS schema learning corpus background data and Phase 1 evaluation 
datasets: multimodal English and Spanish source data and annotations for 
reasoning about complex real-world events

  *   CALL MY NET 2: 800+ hours of Tunisian-Arabic conversational telephone 
speech from over 400 speakers to support text independent speaker recognition, 
used in the 2018 NIST Speaker Recognition Evaluation
  *   Multi-language conversational telephone speech: multiple releases, 
hundreds of hours of speech from speakers of confusable linguistic varieties 
(Arabic, Chinese, English, French, Slavic, Spanish) to support language 
identification
  *   CALLHOME omnibus releases: combined speech and transcript datasets with 
updated directory structure, file formats and documentation, and lexicons 
(Chinese, English, German, Japanese, Spanish)

  *   IARPA MATERIAL language packs: conversational telephone speech, 
transcripts, English translations, annotations and queries in multiple 
languages (e.g., Lithuanian, Pashto, Swahili, Tagalog)

For full descriptions of all LDC data sets, browse our 
Catalog<https://catalog.ldc.upenn.edu/>. Visit Join 
LDC<https://www.ldc.upenn.edu/members/join-ldc> for details on membership, user 
accounts and payment.

Spring 2026 data scholarship application deadline
Applications are now being accepted through January 15, 2026 for the Spring 
2026 LDC data scholarship program which provides university students with 
no-cost access to LDC data. Consult the LDC Data 
Scholarships<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>
 page for more information about program rules and submission requirements.
________________________________

New publications:
AnnoDIFP (Annotated Data for the Investigation of Facets of Personality) CTS 
(Conversational Telephone Speech) Audio and 
Transcripts<https://catalog.ldc.upenn.edu/LDC2025S10> was developed by LDC, the 
Florida Institute of Technology <https://www.fit.edu/> and the University of 
New Haven<https://www.newhaven.edu/index.php> to support algorithm development 
for predicting personality traits. It contains 242.52 hours of English 
telephone audio recordings and transcripts from 1,179 telephone calls involving 
327 participants paired with scores from two self-reported personality 
assessments, HEXACO Personality Inventory (Revised) (HEXACO-PI-R) and Short 
Dark Triad (SD3).

This corpus contains audio and transcripts for 277 participants and transcripts 
only for 50 participants. Telephone calls were collected using LDC's 
robot-operator 
platform<https://www.ldc.upenn.edu/about/facilities/human-subjects-collection>. 
The operator called participants every 24 hours during their indicated 
availability and paired them with another participant to speak on a prompted 
topic for 10 minutes. Transcripts were produced automatically using the 
Rev.ai<https://www.rev.ai/> speech-to-text service.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

LORELEI Ilocano Incident Language 
Pack<https://catalog.ldc.upenn.edu/LDC2025T16> was developed by LDC and is 
comprised of 8.9 million words of Ilocano monolingual text, 3.3 million words 
of English monolingual text, 3.2 million words of parallel Ilocano-English 
text, and 3 million words annotated for entity discovery and linking and 
situation frames. It constitutes all of the text data, annotations, 
supplemental resources, and related software tools for the Ilocano language 
used in the DARPA LORELEI / LoReHLT 2019 
Evaluation<https://www.nist.gov/itl/iad/mig/lorehlt-evaluations>.

The LORELEI (Low Resource Languages for Emergent Incidents) program was 
concerned with building human language technology for low resource languages in 
the context of emergent situations. In the evaluation scenario, an unforeseen 
event triggered a need for humanitarian and logistical support in a region 
where the incident language had received little or no attention in NLP 
research. Evaluation participants provided NLP solutions, including information 
extraction and machine translation, with limited resources and limited 
development time.

Data was collected from news, social network, weblog, newsgroup, discussion 
forum, and reference material. Entity discovery and linking annotation 
identified entities to be detected by systems for scoring purposes. Situation 
frame analysis was designed to extract basic information about needs and 
relevant issues for planning a disaster response effort.

2025 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.


Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] November 2025 Newsletter - LDC

Reply via email to