Fwd: FW: August 2018 Newsletter - LDC

lewis john mcgibbney Thu, 16 Aug 2018 12:19:47 -0700

FYI

---------- Forwarded message ---------
From: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov>
Date: Thu, Aug 16, 2018 at 12:18 PM
Subject: FW: August 2018 Newsletter - LDC
To: lewis john mcgibbney <lewi...@apache.org>







Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group (398M)

Instrument Software and Science Data Systems Section (398)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibb...@jpl.nasa.gov

ORCID: orcid.org/0000-0003-2185-928X



           [image: signature_601139709]



 Dare Mighty Things



*From: *Ldc-customers1 <ldc-customers1-boun...@ldc.upenn.edu> on behalf of
Penn LDC <l...@ldc.upenn.edu>
*Date: *Wednesday, August 15, 2018 at 8:09 AM
*To: *Penn LDC <l...@ldc.upenn.edu>
*Subject: *August 2018 Newsletter - LDC



*In this newsletter: *

*LDC at Interspeech 2018*

*Fall 2018 LDC Data Scholarship Program*

*New Publications:*

BOLT English SMS/Chat <https://catalog.ldc.upenn.edu/LDC2018T19>

CIEMPIESS Balance <https://catalog.ldc.upenn.edu/LDC2018S11>



2011 NIST Language Recognition Evaluation Test Set
<https://catalog.ldc.upenn.edu/LDC2018S06>





*LDC at Interspeech 2018*

*LDC will participate in various ways  at **Interspeech 2018
<http://interspeech2018.org/index.html>** held this year in Hyderabad,
India, September 2-6. It is co-organizing the special session, **The First
DIHARD Speech Diarization Challenge
<https://coml.lscp.ens.fr/dihard/index.html>**, **on September 3 and is a
sponsor of the September 1 pre-conference workshop, ** Young Female
Researchers in Speech Science & Technology
<https://sites.google.com/view/yfrsw2018/home>** (YFRSW). Results of recent
work will be presented during the poster session on September 3, “Global
TIMIT: Acoustic Phonetic Datasets for the World’s Languages.”*

*Fall 2018 LDC Data Scholarship Program*

Students can apply for the Fall 2018 Data Scholarship Program now through
September 15, 2018. The LDC Data Scholarship program provides students with
access to LDC data at no cost. For more information on application
requirements and program rules, please visit LDC Data Scholarships
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.




* New publications:*



(1) BOLT English SMS/Chat <https://catalog.ldc.upenn.edu/LDC2018T19> was
developed by LDC and consists of naturally-occurring Short Message Service
(SMS) and Chat (CHT) data collected through data donations and live
collection from native English speakers. The corpus contains 18,429
conversations totaling 3,674,802 words across 375,967 messages.

The BOLT <https://www.ldc.upenn.edu/collaborations/current-projects/bolt>
(Broad Operational Language Translation) program developed machine
translation and information retrieval for less formal genres, focusing
particularly on user-generated content. LDC supported the BOLT program by
collecting informal data sources -- discussion forums, text messaging, and
chat -- in Chinese, Egyptian Arabic, and English. The collected data was
translated and annotated for various tasks including word alignment,
treebanking, propbanking, and co-reference.

BOLT English SMS/Chat is available via web download.



2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for US $1750.



*



(2) CIEMPIESS Balance <https://catalog.ldc.upenn.edu/LDC2018S11> (Corpus de
Investigación en Español de México del Posgrado de Ingeniería Eléctrica y
Servicio Social) was developed by the Development of Speech Technologies
program at the School of Engineering <http://www.ingenieria.unam.mx>
at the National
Autonomous University of Mexico <http://www.unam.mx/> (UNAM) and consists
of approximately 18 hours of Mexican Spanish broadcast speech with
associated transcripts. The goal of this work was to create acoustic models
for automatic speech recognition. For more information and documentation
see the CIEMPIESS-UNAM Project website <http://www.CIEMPIESS.org/>.



CIEMPIESS Balance is a companion corpus to CIEMPIESS Light, released by LDC
as LDC2017S23 <https://catalog.ldc.upenn.edu/LDC2017S23>. It was developed
so that the data sets together constitute a gender-balanced corpus. The
gender breakdown in CIEMPIESS Light is approximately 75% male and 25%
female. In CIEMPIESS Balance, the gender breakdown is approximately 25%
male and 75% female.



The majority of the speech recordings were collected from Radio-IUS
<http://www.derecho.unam.mx/cultura-juridica/radio.php>, a UNAM radio
station. Other recordings were taken from IUS Canal Multimedia
<https://www.youtube.com/user/DEDUNAM/videos> and Centro Universitario de
Estudios Jurídicos
<https://www.youtube.com/channel/UCTxkzdUd0tiXT5BN5o6Xo-A/videos> (CUEJ
UNAM). These two channels feature videos with speech around legal issues
and topics related to UNAM.



CIEMPIESS Balance is available via web download.



2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data at no cost.



*



(3) 2011 NIST Language Recognition Evaluation Test Set
<https://catalog.ldc.upenn.edu/LDC2018S06> contains selected training data
and the evaluation test set for the 2011 NIST Language Recognition
Evaluation. It consists of approximately 204 hours of conversational
telephone speech and broadcast audio collected by LDC between 2009 and 2011
in the following 24 languages and dialects: Arabic (Iraqi), Arabic
(Levantine), Arabic (Maghrebi), Arabic (Standard), Bengali, Czech, Dari,
English (American), English (Indian), Farsi, Hindi, Lao, Mandarin, Punjabi,
Pashto, Polish, Russian, Slovak, Spanish, Tamil, Thai, Turkish, Ukrainian,
and Urdu.



The 2011
<https://www.nist.gov/itl/iad/mig/2011-language-recognition-evaluation>
evaluation emphasized the language pair condition and involved both
conversational telephone speech (CTS) and broadcast narrow-band speech
(BNBS).



This release includes training data for nine language varieties that had
not been represented in prior LRE cycles -- Arabic (Iraqi), Arabic
(Levantine), Arabic (Maghrebi), Arabic (Standard), Czech, Lao, Punjabi,
Polish, and Slovak -- contained in 893 audited segments of roughly 30
seconds duration and in 400 full-length CTS recordings. The evaluation test
set comprises a total of 29,511 audio files, all manually audited at LDC
for language and divided equally into three different test conditions
according to the nominal amount of speech content per segment.



LDC released the prior LREs as:



·         2003 NIST Language Recognition Evaluation (LDC2006S31
<https://catalog.ldc.upenn.edu/LDC2006S31>)

·         2005 NIST Language Recognition Evaluation (LDC2008S05
<https://catalog.ldc.upenn.edu/LDC2008S05>)

·         2007 NIST Language Recognition Evaluation Test Set (LDC2009S04
<https://catalog.ldc.upenn.edu/LDC2009S04>)

·         2007 NIST Language Recognition Evaluation Supplemental Training
Set (LDC2009S05 <https://catalog.ldc.upenn.edu/LDC2009S05>)

·         2009 NIST Language Recognition Evaluation Test Set (LDC2014S06
<https://catalog.ldc.upenn.edu/LDC2014S06>)





2011 NIST Language Recognition Evaluation Test Set is distributed via web
download.



2018 Subscription Members will receive copies of this corpus. 2018 Standard
Members may request a copy as part of their 16 free membership corpora.
Non-members may license this data for US $2000.



*



Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275

E: l...@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104

-- 
http://home.apache.org/~lewismc/
http://people.apache.org/keys/committer/lewismc

Fwd: FW: August 2018 Newsletter - LDC

Reply via email to