In this newsletter:
Renew your LDC membership today

New publications:
CALLHOME Japanese Second Edition<https://catalog.ldc.upenn.edu/LDC2026S02>
CALLHOME Japanese Lexicon Second 
Edition<https://catalog.ldc.upenn.edu/LDC2026L01>
MATERIAL Swahili-English Language Pack<https://catalog.ldc.upenn.edu/LDC2026S01>
________________________________
Renew your LDC membership today
The importance of curated resources for language-related education, research, 
and technology development drives LDC's mission to create them, to accept data 
contributions from researchers across the globe, and to broadly share such 
resources through the LDC Catalog. LDC members enjoy no-cost access to new 
corpora released annually, as well as the ability to license legacy data sets 
from among our 1000 holdings at reduced fees. Ensure that your data needs 
continue to be met by renewing your LDC membership or by joining the Consortium 
today.

Now through March 2, 2026, any organization that joins the Consortium or renews 
their membership will receive a 10% discount off the 2026 membership fee. 
Membership remains the most economical way to access current and past LDC 
releases. Consult Join LDC<https://www.ldc.upenn.edu/members/join-ldc> for more 
details on membership options and benefits.
________________________________

New publications:
CALLHOME Japanese Second Edition<https://catalog.ldc.upenn.edu/LDC2026S02> was 
developed by LDC and contains 49 hours of speech from 120 telephone 
conversations between native Japanese speakers. This publication is a 
re-release of the original CALLHOME Japanese collection, combining CALLHOME 
Japanese Speech (LDC96S37)<https://catalog.ldc.upenn.edu/LDC96S37> and CALLHOME 
Japanese Transcripts (LDC96T18)<https://catalog.ldc.upenn.edu/LDC96T18> with 
additional transcription and updated directory structure, file formats, and 
documentation.

This corpus contains the 120 calls from CALLHOME Japanese Speech which 
represented training and development data and a subset of evaluation data. 
Participants spoke on topics of their choice in a single telephone call lasting 
up to 30 minutes. Calls were manually audited for language, recording quality, 
channel characteristics, dialect, and region. For this second edition, all 
audio was converted from SPHERE files to FLAC format, and the original 
training/development/test partitioning was removed.

This release also features revised transcripts conforming to updated LDC 
transcription guidelines that addressed normalization of annotation formats, 
standardization of speaker-produced and background noises, application of 
foreign-language marking, whitespace cleanup, and corrections and consistency 
fixes.

The CALLHOME series consists of telephone conversations and transcripts 
developed by LDC and Rutgers, The State University of New Jersey, in support of 
research in speaker identification, language identification, and related 
technologies. Languages in the series include American English, Egyptian 
Arabic, German, Japanese, Mandarin Chinese, and Spanish.

2026 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.
*
CALLHOME Japanese Lexicon Second 
Edition<https://catalog.ldc.upenn.edu/LDC2026L01> was developed by LDC and 
contains 80,688 Japanese words with morphological, phonological, and stress 
information. This second edition updates file formats, directory structure, and 
documentation. The first edition is available as CALLHOME Japanese Lexicon 
(LDC96L17)<https://catalog.ldc.upenn.edu/LDC96L17>. The words in the lexicon 
were derived from 80 transcripts representing telephone conversations between 
native Japanese speakers contained in CALLHOME Japanese Second Edition 
(LDC2026S02)<https://catalog.ldc.upenn.edu/LDC2026S02>.

The lexicon contains seven tab-separated information fields: (1) headword: 
orthographic form in kanji or katakana or hiragana (if only written in 
hiragana); (2) hiragana: orthographic form in hiragana; (3) romanization: 
orthographic form in romaji; (4) pron: pronunciation of the headword; (5) 
morph: morphological analysis of the headword; (6) train freq: frequency of the 
headword in the transcripts; and (7) gloss: glosses of the headword. This 
release also includes a pronunciation dictionary derived from the lexicon in 
CMUdict<https://stdlib.io/docs/api/latest/@stdlib/datasets/cmudict> format and 
the grapheme-to-phoneme (G2P) tools used to automatically generate 
pronunciations for the original lexicon.

2026 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data for a fee.
*
MATERIAL Swahili-English Language 
Pack<https://catalog.ldc.upenn.edu/LDC2026S01> was developed by 
Appen<http://www.appen.com/> for the IARPA 
MATERIAL<https://www.iarpa.gov/index.php/research-programs/material> program 
and contains 112 hours of Swahili conversational telephone speech, transcripts, 
English translations, annotations, and queries. Calls were made using different 
telephones (e.g., mobile, landline) from a variety of environments. Transcripts 
cover approximately 30% of the speech files, 3% of which were translated into 
English. This release also includes domain annotations, English queries, and 
their relevance annotations.

The MATERIAL program focused on underserved languages with the ultimate goal to 
build cross language information retrieval systems to find speech and text 
content using English search queries.

2026 members can access this corpus through their LDC accounts provided they 
have submitted a completed copy of the special license agreement. Non-members 
may license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: [email protected]<mailto:[email protected]>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104





_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

Reply via email to