[Corpora-List] ELRA Catalogue of Language Resources - Update

Hélène Mazo via Corpora Thu, 27 Jul 2023 07:34:02 -0700

[Apologies for multiple postings]*
*

We are happy to announce that 66 new monolingual lexicons and 1 speechresource are now available in our catalogue. Moreover, 4 speechresources are now available at reduced fees.


*1) New Language Resources:*

*Bitext Lexical Datasets*<http://catalog.elra.info/en-us/repository/search/?q=Bitext+Lexical+Dataset>

The series of *Bitext Lexical Datasets* for the generic vocabularyincludes Lemmas, POS tagging, Frequency, Named Entities and Offensivefeatures. Depending on the dataset and language, other syntactic andmorphological features are also provided. The following 15 languages areavailable:

As a complement to the datasets mentioned above, 11 datasets of*Language Variants* can also be obtained:



1. Arabic (MSA)
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0136/>dataset
   and Arabic Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0151/>dataset
   consisting of Arabic Gulf, Arabic Najdi, Arabic Egypt and Arabic MSA
   variants,
2. Chinese (Simplified)
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0137/>dataset,
   Chinese (Traditional)
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0138/>dataset,
   and Chinese Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0152/>dataset
   (Simplified + Traditional),
3. Dutch
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0139/>dataset
   and Dutch Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0153/>dataset
   consisting of Netherlands and Belgium variants,
4. English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0140/>dataset
   and English Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0154/>dataset
   consisting of United States, United Kingdom and India variants,
5. Finnish
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0141/>dataset
   and Finnish Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0155/>dataset
   consisting of Standard and Colloquial Finnish variants,
6. French
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0142/>dataset
   and French Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0156/>dataset
   consisting of France, Canada and Switzerland variants,
7. German
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0143/>dataset
   and German Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0157/>dataset
   consisting of Germany and Switzerland variants,
8. Indonesian
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0144/>dataset,
9. Italian
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0145/>dataset
   and Italian Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0158/>dataset
   consisting of Italy and Switzerland variants,
10. Malay
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0146/>dataset,
11. Norwegian (Bokmal)
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0147/>dataset
   and Norwegian Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0159/>dataset
   consisting of Bokmal and Nynorsk variants,
12. Portuguese
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0148/>dataset
   and Portuguese Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0160/>dataset
   consisting of Portugal and Brazil variants,
13. Spanish
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0149/>dataset
   and Spanish Language Variants
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0161/>dataset
   consisting of Spain, North America, Central America, Andes and
   Southern Cone variants,

*Bitext Synthetic Data*<http://catalog.elra.info/en-us/repository/search/?q=Bitext+Synthetic+Data>

The Bitext Synthetic Data consist of pre-built training data for intentdetection and are provided for 20 verticals for English and Spanishlanguages. They cover the most common intents for each vertical andinclude a large number of example utterances for each intent, withoptional entity/slot annotations for each utterance. Data is distributedas models or open text files.


For each language, the following verticals are available:

1. Automotive: 52 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0162/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0182/>)
2. Retail banking: 26 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0163/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0183/>)
3. Education: 37 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0164/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0184/>)
4. Event and ticketing: 25 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0165/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0185/>)
5. Field Service: 27 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0166/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0186/>)
6. Healthcare: 40 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0167/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0187/>)
7. Hospitality: 24 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0168/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0188/>)
8. Insurance: 38 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0169/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0189/>)
9. Legal : 29 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0170/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0190/>)
10. Manufacturing: 34 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0171/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0191/>)
11. Media Streaming: 24 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0172/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0192/>)
12. Mortgage and loans: 39 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0173/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0193/>)
13. Moving and storage: 29 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0174/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0194/>)
14. Real estate and construction: 28 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0175/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0195/>)
15. Restaurant/ bar chains: 30 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0176/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0196/>)
16. Retail Ecomm: 34 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0177/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0197/>)
17. Telecommunication: 26 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0178/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0198/>)
18. Travel: 33 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0179/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0199/>)
19. Utilities: 21 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0180/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0200/>)
20. Wealth management: 24 intents (English
   <http://catalog.elra.info/en-us/repository/browse/ELRA-L0181/>,
   Spanish <http://catalog.elra.info/en-us/repository/browse/ELRA-L0201/>)

*Persian Kids’ Speech Corpus*<http://catalog.elra.info/en-us/repository/browse/ELRA-S0487/>

The Persian Kids’ Speech Corpus consists of speech signals recorded by286 children (141 girls, 145 boys), from 6 to 9 years old, through anAndreas Mic Anti-Noise microphone and a Premium Speechmike headphone.This recorded data was manually checked and labeled. Finally, a corpuscontaining 162,395 samples with a duration of 33 hours and 44 minuteswas created. The samples are distributed as follows:


1. 29,057 Words (478 minutes),
2. 17,429 SubWords (260 minutes),
3. 43,838 Syllables (485 minutes),
4. 70,078 Phonemes (765 minutes),
5. 1,993 Extra Vocabulary (36 minutes).

The prepared speech corpus comprehensively contains all the 29 Persianphonemes, 118 syllables, 56 sub-words, and 711 words and is particularlyapplicable to speech recognition and linguistics studies.


*2) Reduced fees for the following speech resources:*

 * *Chinese Mandarin (South) database*
   <http://catalog.elra.info/en-us/repository/browse/ELRA-S0397/>
 * *Chinese Mandarin (North) database*
   <http://catalog.elra.info/en-us/repository/browse/ELRA-S0398/>
 * *Japanese Kids Speech database (Lower Grade)*
   <http://catalog.elra.info/en-us/repository/browse/ELRA-S0411/>
 * *Japanese Kids Speech database (Upper Grade)*
   <http://catalog.elra.info/en-us/repository/browse/ELRA-S0412/>**

For more information on the catalogue or if you would like to enquireabout having your resources distributed by ELRA, please *contact us*<mailto:cont...@elda.org>.

_________________________________________

Visit the *ELRA Catalogue of Language Resources* <http://catalog.elra.info>
Visit the *Universal Catalogue* <http://universal.elra.info>**

*Archives *<http://www.elra.info/en/catalogues/language-resources-announcements>ofELRA Language Resources Catalogue Updates


/Our apologies if you have received multiple copies of this announcement./

_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

[Corpora-List] ELRA Catalogue of Language Resources - Update

Reply via email to