Hi Folks,
For anyone with access to LDC, it looks like there could be a really cool
Chinese --> English parallel sentence dataset.
Once I've finished my current batch of work (Russian) I'm going to have a
look at the dataset.
Lewis

---------- Forwarded message ----------
From: Mcgibbney, Lewis J (398M) <lewis.j.mcgibb...@jpl.nasa.gov>
Date: Wed, Oct 19, 2016 at 2:15 PM
Subject: FW: October 2016 Newsletter -- LDC
To: "lewis.mcgibb...@gmail.com" <lewis.mcgibb...@gmail.com>






Dr. Lewis John McGibbney Ph.D., B.Sc.

Data Scientist II

Computer Science for Data Intensive Applications Group 398M

Jet Propulsion Laboratory

California Institute of Technology

4800 Oak Grove Drive

Pasadena, California 91109-8099

Mail Stop : 158-256C

Tel:  (+1) (818)-393-7402

Cell: (+1) (626)-487-3476

Fax:  (+1) (818)-393-1190

Email: lewis.j.mcgibb...@jpl.nasa.gov







 Dare Mighty Things



*From: *Ldc-customers1 <ldc-customers1-boun...@ldc.upenn.edu> on behalf of
Penn LDC <l...@ldc.upenn.edu>
*Date: *Wednesday, October 19, 2016 at 6:32 AM
*To: *Penn LDC <l...@ldc.upenn.edu>
*Subject: *October 2016 Newsletter -- LDC



October 2016 Newsletter – LDC

*In this newsletter:*

*Fall 2016 LDC Data Scholarship recipients*

*Chilin HK and LDC partner on distribution of parallel patent data*

*New publications:*

IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
<https://catalog.ldc.upenn.edu/LDC2016S10>



KAFD: Arabic Font Database <https://catalog.ldc.upenn.edu/LDC2016T21>



Richer Event Description <https://catalog.ldc.upenn.edu/LDC2016T23>





*Fall 2016 LDC Data Scholarship recipients*

Congratulations to the recipients of LDC's Fall 2016 data scholarships:

Tiba Zaki Abdulhameed: Western Michigan University (USA); PhD Candidate,
Computer Science. Tiba is awarded copies of GALE Phase 2 Arabic Broadcast
Conversation Speech and Transcripts for her research in dialectal ASR.

Abhishek Abhishek: Indian Institute of Technology Guwahati (India); PhD
Candidate, Computer Science and Engineering. Abhishek is awarded a copies
of ACE 2004 Multilingual Training Corpus and The New York Times Annotated
Corpus for his research in coreference resolution and relation extraction.

Sara Ebrahim: Ain Shams University (Egypt); Msc, Computer Science. Sara is
awarded copies of LDC Standard Arabic Morphological Analyzer and NIST
OpenMT 2008 Evaluation Selected References and System Translations for her
work in machine translation.

Katherine Metcalf: Indiana University (USA), PhD Candidate, Computer
Science. Katherine is awarded a copy of Emotional Prosody Speech and
Transcripts for her research in acoustic/prosodic approaches to classifying
emotional states.

Mousmita Sarma: Gauhati University (India), Post-Masters Research,
Electronics and Communications Technology. Mousmita is awarded copies of
Switchboard 1-Release 2 and IARPA Babel Assamese Language Pack for her
research in Assamese dialect identification.

For program information visit the Data Scholarship page
<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.



*Chilin HK and LDC partner on distribution of parallel patent data*

Chilin HK Limited (Chilin) and LDC are pleased to announce that the
parallel data source developed by Chilin, A Corpus of Chinese-English
Parallel Sentences Extracted from Patents, is now available through the LDC
Catalog. This is a special release in addition to the LDC scheduled corpora
for membership year 2016, available under separate terms.

The Chilin Corpus has primarily resulted from training corpus and test sets
developed specifically for the Tokyo-based NTCIR 2009 & 2010 competitions
on Patent MT (machine translation), which drew more than 30 international
teams:

NTCIR-9: http://research.nii.ac.jp/ntcir/workshop/
OnlineProceedings9/NTCIR/01-NTCIR9-PATENTMT-GotoI.pdf

NTCIR-10: http://research.nii.ac.jp/ntcir/workshop/
OnlineProceedings10/pdf/NTCIR/OVERVIEW/01-NTCIR10-PATENTMT-GotoI.pdf

The training corpus is drawn from a much larger curated corpus of parallel
Chinese-English sentences and sentence fragments which have been winnowed
from an even larger corpus of more than 300k parallel Chinese-English
patents in different fields, initially at the Research Centre on Language
Information Sciences, City University of Hong Kong (authors:  Benjamin
Tsou, Bin Lu, and Kapo Chow). This data set is available from LDC under the
following reference:

LDC2016T22   A Corpus of Chinese-English Parallel Sentences Extracted from
Patents <https://catalog.ldc.upenn.edu/LDC2016T22>

Not-for-profit organizations may license this data set for US$25.00 under
the LDC Not-for-Profit Membership Agreement or under the LDC User Agreement
for Non-Members for use in linguistic research, education and
non-commercial technology development. For-profit organizations may license
this data for US$5000, discounted to US$4000 for LDC for-profit members,
under a commercial license.

*New Corpora*

(1) IARPA Babel Turkish Language Pack IARPA-babel105b-v0.5
<https://catalog.ldc.upenn.edu/LDC2016S10> was developed by Appen for the
IARPA (Intelligence Advanced Research Projects Activity) Babel program. It
contains approximately 213 hours of Turkish conversational and scripted
telephone speech collected in 2012 along with corresponding transcripts.



The Babel program focuses on underserved languages and seeks to develop
speech recognition technology that can be rapidly applied to any human
language to support keyword search performance over large amounts of
recorded speech.



The Turkish speech in this release represents that spoken in seven dialect
regions in Turkey. The gender distribution among speakers is approximately
equal; speakers' ages range from 16 years to 70 years. Calls were made
using different telephones (e.g., mobile, landline) from a variety of
environments including the street, a home or office, a public place, and
inside a vehicle.



Transcripts are encoded in UTF-8.



IARPA Babel Turkish Language Pack IARPA is distributed via web download.



2016 Subscription Members will receive two copies of this corpus provided they
have submitted a completed copy of the special license agreement. 2016
Standard Members may request a copy as part of their 16 free membership
corpora. Non-members may license this data for US $25.00.

(2) KAFD: Arabic Font Database <https://catalog.ldc.upenn.edu/LDC2016T21>
was developed by King Fahd University of Petroleum & Minerals and Qassim
University. It is comprised of approximately 2.5 million scanned Arabic
printed pages in a variety of fonts, sizes and resolutions along with
corresponding transcripts. KAFD was designed for research in Arabic text
recognition.

The scanned Arabic texts were collected from publications covering various
subjects such as religion, medicine, science and history. Texts were
printed in 40 different fonts, 10 sizes and four styles. Scans were made at
100, 200, 300 and 600 dpi (dots per inch).

The database is available in two formats: at the page level and at the line
level. Images are presented as TIFF images and transcripts are in plain
text format. Individual font folders are compressed into RAR archives.

The data is divided into training, validation and test sets.

2016 Subscription Members will automatically receive two copies of this
corpus.  2016 Standard Members may request a copy as part of their 16 free
membership corpora.  Non-members may license this data for US $250.

(3) Richer Event Description <https://catalog.ldc.upenn.edu/LDC2016T23> was
developed by the University of Colorado Boulder-CLEAR (Computational
Language and Education Research), Carnegie Mellon University and LDC. It
consists of coreference, bridging and event-event relations (temporal,
causal, subevent and reporting relations) annotations over 95 English
newswire, discussion forum and narrative text documents, covering all
events, times and non-eventive entities within each document.

RED annotation is intended to join different annotation layers and to
provide a rich representation of event phenomena.

Documents were annotated twice -- in a markable pass and in an event
annotation phase. Annotation and source documents are divided into three
partitions: (1) 20 newswire summarization documents, (2) 20 discussion
forum documents and newswire annotations used in the original RED pilot
annotations, and (3) 55 documents annotated by a range of DEFT (Deep
Exploration and Filtering of Test) annotation formats. Data is presented as
UTF-8 encoded xml and plain text.

2016 Subscription Members will automatically receive two copies of this
corpus.  2016 Standard Members may request a copy as part of their 16 free
membership corpora.  Non-members may license this data for US $1750.



Membership Office

Linguistic Data Consortium

University of Pennsylvania

T: +1-215-573-1275

E: l...@ldc.upenn.edu

M: 3600 Market St. Suite 810

    Philadelphia, PA 19104



-- 
http://home.apache.org/~lewismc/
@hectorMcSpector
http://www.linkedin.com/in/lmcgibbney

Reply via email to