[Corpora-List] October 2023 Newsletter - LDC

Penn LDC via Corpora Mon, 16 Oct 2023 09:35:59 -0700

In this newsletter:
Membership Year 2024 publication preview
Fall 2023 data scholarship recipients


New publications:
AIDA Scenario 1 Practice Topic Source 
Data<https://catalog.ldc.upenn.edu/LDC2023T11>
AIDA Scenario 1 and 2 Reference Knowledge 
Base<https://catalog.ldc.upenn.edu/LDC2023T10>

________________________________
Membership Year 2024 publication preview
The 2024 membership year is approaching and plans for next year's publications 
are in progress. Among the expected releases are:

  *   KASET: 147 hours of Sorani Kurdish and Kurmanji Kurdish conversational 
telephone speech and web broadcasts, 65 hours transcribed
  *   AIDA topic source data and annotations: multimodal source data and 
annotations in multiple languages (Russian, Ukrainian, English, Spanish) for 
information and entity extraction
  *   RATS Low Speech Density Data: 87 hours of Levantine Arabic, English, 
Persian, Pushto, and Urdu audio files selected from RATS speech activity 
detection and keyword spotting data sets, also including communications systems 
sounds and silence
  *   Call My Net 1: 364 hours of conversational telephone speech recordings in 
Tagalog, Cebuano, Cantonese, and Mandarin from speakers in the Philippines and 
China using various handsets under diverse noise conditions

  *   Ravnursson Faroese Speech and Transcripts: 109 hours of read speech from 
433 native speakers with transcripts
  *   Diaspora Tibetan Speech: elicited, read, and spontaneous speech from 73 
native Tibetan speakers in Katmandu's diaspora Tibetan community, some 
recordings transcribed

  *   IARPA MATERIAL language packs: conversational telephone speech, 
transcripts, English translations, annotations, and queries in multiple 
languages (e.g., Bulgarian, Somali, Georgian)
  *   LORELEI: representative and incident language packs containing 
monolingual text, bi-text, translations, annotations, supplemental resources, 
and related tools in various languages (e.g., Farsi, Hungarian, Hindi, Amharic)
Check your inbox in the coming weeks for more information about membership 
renewal.

Fall 2023 data scholarship recipients
Congratulations to the recipients of LDC's Fall 2023 data scholarships:

Nessma Diab: Ain-Shams University (Egypt): Pre-PhD student, Linguistics. Nessma 
is awarded copies of CALLHOME Egyptian Arabic Speech LDC97S45 and CALLHOME 
Egyptian Arabic Transcripts LDC97T10 for her work in machine translation.
Soheir Elssakkout: Ain-Shams University (Egypt): PhD candidate. Soheir is 
awarded copies of Turkish Broadcast News and Transcripts LDC2012S06 and Middle 
East Technical University Turkish Microphone Speech v 1.0 LDC2006S33 for her 
work in speech recognition.
Matheus Franco: Witten/Herdecke University (Germany): Post-doctoral scholar, 
Faculty of Management, Economics and Society. Matheus is awarded a copy of 
Avocado Research Email Collection LDC2015T03 for his work in emotional 
foundations of dynamic capabilities.
Kamal Jarrar: Birzeit University (Palestine): Master's student, Applied 
Statistics and Data Science Program. Kamal is awarded copies of Arabic Gigaword 
Fifth Edition LDC2011T11 and BOLT Arabic Discussion Forums LDC2018T10 for his 
work in part-of-speech tagging for dialectal Arabic.
Minkyoung Kim: Yonsei University (Korea); PhD candidate, Graduate School of 
Information. Minkyoung is awarded a copy of The New York Times Annotated Corpus 
LDC2018T19 for her work in event extraction and semantic event annotation.
Humaira Mehmood: Fatima Jinnah Women University (Pakistan): Master's student, 
Computer Sciences. Humaira is awarded a copy of ARL Urdu Speech Database, 
Training Data LDC2007S03 for her work in machine translation.
Diyam Mousa: Birzeit University (Palestine): PhD candidate, Computer Science 
Department. Diyam is awarded copies of Arabic Treebank: Part 3 v. 3.2 
LDC2010T08 and BOLT Egyptian Arabic Treebank - Discussion Forum LDC2018T23 for 
her work in morphological tagging for dialectal Arabic.

For information about the program, visit the Data Scholarships 
page<https://www.ldc.upenn.edu/language-resources/data/data-scholarships>.
________________________________
New publications:
AIDA Scenario 1 Practice Topic Source 
Data<https://catalog.ldc.upenn.edu/LDC2023T11> was developed by LDC and is 
comprised of 1511 files (text, image, and video) from English, Russian, and 
Ukrainian web sources. Each phase of the AIDA program centered on a specific 
scenario, or broad topic area, with related subtopics designated as either 
practice subtopics or evaluation subtopics. The Phase 1 scenario focused on 
political relations between Russia and Ukraine in the 2010s. This corpus 
constitutes the full set of topic-focused documents for Phase 1 practice 
subtopics. Data was collected from web sources by a combination of automatic 
and manual processes.

The DARPA AIDA (Active Interpretation of Disparate Alternatives) program aimed 
to develop a multi-hypothesis semantic engine to generate explicit alternative 
interpretations of events, situations, and trends from a variety of 
unstructured sources. LDC supported AIDA by collecting, creating and annotating 
multimodal linguistic resources in multiple languages.

The knowledge base for entity detection and linking annotation for all AIDA 
Scenario 1 and 2 corpora is available separately as AIDA Scenario 1 and 2 
Reference Knowledge Base (LDC2023T10)<https://catalog.ldc.upenn.edu/LDC2023T10>.

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

*

AIDA Scenario 1 and 2 Reference Knowledge 
Base<https://catalog.ldc.upenn.edu/LDC2023T10> contains the English knowledge 
base (KB) used for all AIDA entity linking annotation in Scenario 1 
(Russia-Ukraine Relations) and Scenario 2 (Crisis in Venezuela). The KB content 
was drawn from GeoNames, the CIA World Leaders List, and the CIA World Factbook 
and was supplemented with manually-created KB entries developed by LDC 
specifically for AIDA data.

This knowledge base supported the AIDIA entity detection and linking task for 
13 entity types: GPE (Geo-Political Entity), LOC (Location), PER (Person), ORG 
(Organization), FAC (Facility), MHI (Medical/Health Issue), WEA (Weapon), SID 
(Side), COM (Commodity), CRM (Crime), LAW (Law), VEH (Vehicle), and BAL 
(Ballot).

2023 members can access this corpus through their LDC accounts. Non-members may 
license this data for a fee.

To unsubscribe from this newsletter, log in to your LDC 
account<https://catalog.ldc.upenn.edu/login> and uncheck the box next to 
"Receive Newsletter" under Account Options or contact LDC for assistance.

Membership Coordinator
Linguistic Data Consortium<ldc.upenn.edu>
University of Pennsylvania
T: +1-215-573-1275
E: l...@ldc.upenn.edu<mailto:l...@ldc.upenn.edu>
M: 3600 Market St. Suite 810
      Philadelphia, PA 19104

_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

[Corpora-List] October 2023 Newsletter - LDC

Reply via email to