---------- Forwarded message ----------
From: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov>
Date: Tue, Jul 19, 2016 at 8:35 PM
Subject: FW: July 2016 Newsletter ­ LDC
To: Lewis John McGibbney <lewis.mcgibb...@gmail.com>



Dr. Lewis John McGibbney Ph.D., B.Sc.
Data Scientist II
Computer Science for Data Intensive Applications Group 398M
Jet Propulsion Laboratory
California Institute of Technology
4800 Oak Grove Drive
Pasadena, California 91109-8099
Mail Stop : 158-256C
Tel:  (+1) (818)-393-7402
Cell: (+1) (626)-487-3476
Fax:  (+1) (818)-393-1190
Email: lewis.j.mcgibb...@jpl.nasa.gov



 Dare Mighty Things

From: Ldc-customers1 <ldc-customers1-boun...@ldc.upenn.edu> on behalf of
Penn LDC <l...@ldc.upenn.edu>
Date: Tuesday, July 19, 2016 at 1:41 PM
To: Penn LDC <l...@ldc.upenn.edu>
Subject: July 2016 Newsletter – LDC


*In this Newsletter:*

*Fall 2016 Data Scholarship Program*


*2015 User Survey Results*


*New Publications:*

*English Speed Networking Conversational Transcripts
<https://catalog.ldc.upenn.edu/LDC2016T16>*



*Digital Archive of Southern Speech - NLP Version (DASS)
<https://catalog.ldc.upenn.edu/LDC2016S05>*



*GALE Phase 3 and 4 Chinese Broadcast News Parallel Text
<https://catalog.ldc.upenn.edu/LDC2016T15>*



*IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c
<https://catalog.ldc.upenn.edu/LDC2016S02>*







*Fall 2016 Data Scholarship Program*

Applications are now being accepted through *Thursday, September 15, 2016*
for the Fall 2016 LDC Data Scholarship program. The LDC Data Scholarship
program provides university students with access to LDC data at no-cost.

This program is open to students pursuing both undergraduate and graduate
studies in an accredited college or university. LDC Data Scholarships are
not restricted to any particular field of study; however, students must
demonstrate a well-developed research agenda and a bona fide inability to
pay. The selection process is highly competitive.

The application consists of two parts:

(1) Data Use Proposal. Applicants must submit a two-page proposal
describing their intended use of the data. The proposal should state which
data the student plans to use, how the data will benefit their research
project, the proposed methodology or algorithm which will be used and how
success will be measured.

Applicants should consult the Catalog <https://catalog.ldc.upenn.edu/> for
a complete list of data distributed by LDC. Due to certain restrictions, a
handful of LDC corpora are restricted to members of the Consortium.
Applicants are advised to select a maximum of one to two databases.

(2) Letter of Support. Applicants must submit one letter of support from
their thesis adviser or department chair. The letter must be signed and
printed on letterhead, describe the student and the research, evaluate the
probability of success and confirm that the department or university lacks
the funding to pay the full non-member fee for the data.

For further information on application materials and program rules, please
visit the LDC Data Scholarship page.
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>



*2015 User Survey Results*

LDC conducted its fourth user survey in December 2015. This survey built on
the previous surveys conducted in 2006, 2007 and 2012 to assess user
sentiment and also asked for the evaluation of key LDC-related topics
including:

·         Opinions on the new website and usability of the Catalog

·         Use and satisfaction with the enhanced user services and
e-commerce system

·         LDC’s Data Management Plan capabilities

·         Suggestions for future publications and preferred data delivery
methods

·         Use of web services for data access and processing

Overall, survey respondents were satisfied with LDC’s data, membership
options, website, Catalog and enhanced user services. Participants cited
the top five most useful corpora received between 2012 and 2015 as *OntoNotes
Release 5.0*, *TIMIT*, *TAC KBP Reference Knowledge Base*, *Penn Discourse
Treebank V 2.0*, and M*ulti-Channel WSJ Audio*. Three fourths of
respondents prefer digital delivery of data and the top three languages for
current research demands were identified as English, Chinese and Spanish.

We thank everyone who participated in this survey. Responses will benefit
the future of the Consortium and will help LDC to better meet the needs of
our members and data licensees.





*New Publications*



(1)* English Speed Networking Conversational Transcripts*
<https://catalog.ldc.upenn.edu/LDC2016T16> was developed at the University
of the West of England <http://www.uwe.ac.uk/> and contains 388 transcripts
of English face-to-face and instant messaging conversations  about business
ideas collected in 2014 and 2015 from participants (undergraduate students)
playing different power roles.



This corpus was created to examine communication accommodation,
specifically, the ways in which an individual's linguistic style is
affected by social power and personality. The data was collected in two
studies. In the first study, 40 participants had a series of paired five
minute face-to-face conversations playing either a high, low or neutral
power role. The same procedure was followed in the second study except that
participants discussed business ideas via instant messaging.



The face-to-face conversations were audio-recorded and transcribed verbatim.



All transcripts are presented as UTF-8 plain text files.



English Speed Networking Conversational Transcripts is distributed via web
download.

2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $400.00



(2) *Digital Archive of Southern Speech - NLP Version (DASS-NLP)*
<https://catalog.ldc.upenn.edu/LDC2016S05> was developed by LDC as an
alternate version of Digital Archive of Southern Speech (DASS) (LDC2012S03)
suitable for natural language processing and human language technology
applications. Specifically, the original audio files have been converted to
16kHz 16-bit flac compressed wav and file names have been normalized to
facilitate automatic processing.



DASS was developed by the University of Georgia <http://www.uga.edu/>. It
is a subset of the Linguistic Atlas of the Gulf States (LAGS), which is in
turn part of the Linguist Atlas Project (LAP). DASS-NLP contains
approximately 366 hours of English speech data from 30 female speakers and
34 male speakers, along with associated metadata about the speakers, the
recordings and maps in .jpeg format relating to the recording locations.



LAP consists of a set of survey research projects about the words and
pronunciation of everyday American English, the largest project of its kind
in the United States. Interviews with thousands of native speakers across
the country have been carried out since 1929. LAGS surveyed the everyday
speech of Georgia, Tennessee, Florida, Alabama, Mississippi, Arkansas,
Louisiana, and Texas in a series of 914 audio-taped interviews conducted
from 1968-1983.



The speakers' average age is 61 years; there are 30 women and 34 men from
the Gulf States region represented in this release. The interviews cover
common topics such as family, the weather, household articles and
activities, agriculture and social conditions.



Digital Archive of Southern Speech - NLP Version is distributed via web
download.



2016 Not-for-Profit Subscription Members will automatically receive two
copies of this corpus. 2016 For-Profit Subscription Members will receive
two copies provided they have submitted a completed copy of the For-Profit
Member User License Agreement for Digital Archive of Southern Speech – NLP
Version (LDC2016S05). 2016 Standard Members may request a copy as part of
their 16 free membership corpora. This data is being made available at
no-cost for non-member organizations under a research license.
<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/dass-nlp-fp-agreement.pdf>





(3) *GALE Phase 3 and 4 Chinese Broadcast News Parallel Text*
<https://catalog.ldc.upenn.edu/LDC2016T15> was developed by LDC. Along with
other corpora, the parallel text in this release comprised training data
for Phases 3 and 4 of the DARPA GALE (Global Autonomous Language
Exploitation) Program. This corpus contains Chinese source text and
corresponding English translations selected from broadcast news data
collected by LDC between 2006 and 2008 and transcribed and translated by
LDC or under its direction.



GALE Phase 3 and 4 Chinese Broadcast News Parallel Text includes 76
source-translation document pairs, comprising 614,608 tokens of Chinese
source text and its English translation. Data is drawn from 16 distinct
Chinese programs broadcast between 2006 and 2008 by China Central TV, a
national and international broadcaster in Mainland China and Phoenix TV, a
Hong Kong-based satellite television station. The programs in this release
feature news programs on current events topics.



The files in this release were transcribed by LDC staff and/or
transcription vendors under contract to LDC in accordance with the Quick
Rich Transcription guidelines developed by LDC.



Source data and translations are distributed in TDF format. All data are
encoded in UTF-8.



GALE Phase 3 and 4 Chinese Broadcast News Parallel is distributed via web
download



2016 Subscription Members will automatically receive two copies of this
corpus. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1750.00





(4*) **IARPA Babel Cantonese Language Pack IARPA-babel101b-v0.4c*
<https://catalog.ldc.upenn.edu/LDC2016S02> was developed by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program. It
contains approximately 215 hours of Cantonese conversational and scripted
telephone speech collected in 2011 along with corresponding transcripts.



The Babel program focuses on underserved languages and seeks to develop
speech recognition technology that can be rapidly applied to any human
language to support keyword search performance over large amounts of
recorded speech.





The Cantonese speech in this release represents that spoken in the Chinese
provinces of Guangdong and Guangxi, and within those provinces, among five
dialect groups. The gender distribution among speakers is approximately
even; speakers' ages range from 16 years to 67 years. Calls were made using
different telephones (e.g., mobile, landline) from a variety of
environments including the street, a home or office, a public place, and
inside a vehicle.



All audio data is presented as 8kHz 8-bit a-law encoded audio in sphere
format. Transcripts are available in two versions: simplified Chinese
characters and a romanization scheme based on the Yale system, both encoded
in UTF-8.



IARPA Babel Cantonese Language Pack IARPA is distributed via web download



2016 Subscription Members will receive two copies of this corpus
provided they have submitted a completed copy of the IARPA User Agreement
for Not-for-Profit Members or the IARPA User Agreement for For-Profit
Members. 2016 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $25.00
under a research
license
<https://www.ldc.upenn.edu/sites/www.ldc.upenn.edu/files/iarpa-babel-cantonese-nm-user-agreement.pdf>
.







*Membership Office*
*Linguistic Data Consortium*
University of Pennsylvania
3600 Market St. Suite 810
Philadelphia, PA 19130
Tel: 215-573-1275
email:l...@ldc.upenn.edu




-- 
*Lewis*

Reply via email to