Fwd: FW: July 2017 Newsletter -- LDC

lewis john mcgibbney Wed, 19 Jul 2017 17:01:49 -0700

FYI folks

---------- Forwarded message ---------
From: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov>
Date: Wed, Jul 19, 2017 at 10:14 AM
Subject: FW: July 2017 Newsletter -- LDC
To: lewis john mcgibbney <lewi...@apache.org>







Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group (398M)

Instrument Software and Science Data Systems Section (398)

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibb...@jpl.nasa.gov







 Dare Mighty Things



*From: *Ldc-customers1 <ldc-customers1-boun...@ldc.upenn.edu> on behalf of
Penn LDC <l...@ldc.upenn.edu>
*Date: *Tuesday, July 18, 2017 at 8:51 AM
*To: *Penn LDC <l...@ldc.upenn.edu>
*Subject: *July 2017 Newsletter -- LDC



*In this newsletter*



*LDC at ACL 2017*



*Fall 2017 Data Scholarship Program*



*New corpora: *



BOLT English Discussion Forums <https://catalog.ldc.upenn.edu/LDC2017T11>
IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
<https://catalog.ldc.upenn.edu/LDC2017S13>
KSUEmotions <https://catalog.ldc.upenn.edu/LDC2017S12>
Metalogue Multi-Issue Bargaining Dialogue
<https://catalog.ldc.upenn.edu/LDC2017S11>

_______________________________________________________________________________

*LDC at ACL 2017: July 31-August 2, Vancouver, Canada*

ACL has returned to North America and LDC is taking this opportunity to
interact with top HLT researchers gathering in Vancouver, Canada.  Stop by
our exhibition table to learn more about recent developments at the
Consortium and new publications.



*Fall 2017 Data Scholarship Program*

Student applications for the Fall 2017 LDC Data Scholarship program are
being accepted now through Friday, September 15, 2017, 11:59PM EST. The LDC
Data Scholarship program provides university students with access to LDC
data at no cost. Students must complete an application which consists of a
data use proposal and letter of support from their advisor.

For more information on application requirements and program rules, please
visit the LDC Data Scholarship page
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.

 Applicants can email their materials to the LDC Data Scholarship program
<datascholarsh...@ldc.upenn.edu>.



*New corpora*

(1) BOLT English Discussion Forums
<https://catalog.ldc.upenn.edu/LDC2017T11> was developed by LDC and
consists of 830,440 discussion forum threads in English harvested from the
Internet using a combination of manual and automatic processes.

The BOLT <https://www.ldc.upenn.edu/collaborations/current-projects/bolt>
(Broad
Operational Language Translation) program developed machine translation and
information retrieval for less formal genres, focusing particularly on
user-generated content in Chinese, Egyptian Arabic and English. The
collected data was translated and annotated for various tasks including
word alignment, treebanking, propbanking and co-reference.

The material in this release represents the unannotated English source data
in the discussion forum genre. Collection was seeded based on the results
of manual data scouting by native speaker annotators. When multiple threads
from a forum were submitted, the entire forum was automatically harvested
and added to the collection. Only a small portion of the threads included
in this release were manually reviewed, and it is expected that there may
be some offensive or otherwise undesired content as well as some threads
that contain a large amount of non-English content. Language identification
was performed on all threads in this corpus (using CLD2
<https://github.com/CLD2Owners/cld2>).

BOLT English Discussion Forums is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $3500.

***

(2) IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b
<https://catalog.ldc.upenn.edu/LDC2017S13> was developed by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program. It
contains 200 hours of Tamil conversational and scripted telephone speech
collected in 2012 and 2013 along with corresponding transcripts.

The Babel program focuses on underserved languages and seeks to develop
speech recognition technology that can be rapidly applied to any human
language to support keyword search performance over large amounts of
recorded speech.

The Tamil speech in this release represents that spoken in the Northern,
Central, Southern and Western dialect regions of the Indian state of Tamil
Nadu. The gender distribution among speakers is approximately equal;
speakers' ages range from 16 years to 65 years. Calls were made using
different telephones (e.g., mobile, landline) from a variety of
environments including the street, a home or office, a public place, and
inside a vehicle.

IARPA Babel Tamil Language Pack IARPA-babel204b-v1.1b is distributed via
web download.



2017 Subscription Members will receive copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2017
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for US $25.

*

(3) KSUEmotions <https://catalog.ldc.upenn.edu/LDC2017S12> was
developed by King
Saud University <http://ksu.edu.sa/en/> (KSU) and contains approximately
five hours of emotional Modern Standard Arabic (MSA) speech from 23
subjects. Speakers were from three countries: Yemen, Saudi Arabia and Syria.

Subjects read MSA sentences from newswire text in the following emotions:
neutral, anger, sadness, happiness, surprise, and interrogative (asking a
question). Human reviewers then listened to the recordings to identify the
emotion they heard. Audio was recorded in each participant's home.

KSUEmotions is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $1000.

*

(4) Metalogue Multi-Issue Bargaining Dialogue
<https://catalog.ldc.upenn.edu/LDC2017S11> was developed by the Metalogue
Consortium <http://www.metalogue.eu/consortium/> under the European
Community's Seventh Framework Programme for Research and Technological
Development <https://ec.europa.eu/research/fp7/index_en.cfm>. This release
consists of approximately 2.5 hours of semantically annotated English
dialogue data that includes speech and transcripts.

The goal of the Metalogue project was to develop a dialogue system with
flexible dialogue management to enable the system's behavior in setting
goals, choosing strategies and monitoring various processes. Six unique
subjects (undergraduates between 19 and 25 years of age) were involved in a
multi-issue bargaining scenario in which a representative of a city council
and a representative of small business owners negotiated the implementation
of new anti-smoking regulations. The negotiation involved four issues, each
with four or five options. Participants received a preference profile for
each scenario and negotiated for an agreement with the highest value based
on their preference information. Negotiators were not allowed to accept an
agreement with a negative value or to share their preference profiles with
other participants.

The dialogue speech was captured with two headset microphones and saved in
16kHz, 16-bit mono linear PCM FLAC format. Transcripts were produced
semi-automatically, using an automatic speech recognizer followed by manual
correction.

Metalogue Multi-Issue Bargaining Dialogue is distributed via web download.

2017 Subscription Members will automatically receive copies of this corpus.
2017 Standard Members may request a copy as part of their 16 free
membership corpora. Non-members may license this data for US $300.



Membership Office

Linguistic Data Consortium <http://ldc.upenn.edu>

University of Pennsylvania

T: +1-215-573-1275

E: l...@ldc.upenn.edu

M: 3600 Market St. Suite 810

      Philadelphia, PA 19104




-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney

Fwd: FW: July 2017 Newsletter -- LDC

Reply via email to