[Apologies for multiple postings]
We are happy to announce that 1 new written corpus, 3 new monolingual
lexica and 2 new bilingual lexica are now available in our catalogue.
Learner Corpus of Portuguese L2 – COPLE2
<http://catalog.elra.info/en-us/repository/browse/ELRA-W0331/>
ISLRN: 936-320-703-366-7 <http://www.islrn.org/resources/936-320-703-366-7>
The Learner Corpus of Portuguese as Second/Foreign Language (COPLE2) is
a corpus of written and oral texts produced by students of Portuguese as
Foreign/Second Language courses in the Instituto de Cultura e Língua
Portuguesa (the Institute of Portuguese Language and Culture) (ICLP –
FLUL) and by applicants for examinations in the Centro de Avaliação de
Português Língua Estrangeira (Center for Evaluation of Portuguese as a
Foreign Language) (CAPLE – FLUL). The corpus contains texts from
learners with 15 different native languages (L1s) and proficiencies from
A1 to C1, and covers different topics and types of tasks. It is encoded
in TEI format through the TEITOK environment. The corpus includes at
the moment a total of 182,474 tokens and 978 texts, classified according
to the CEFR scales. The corpus contains annotations for part of speech,
lemma and learner errors. All the information encoded is searchable
through the CQP query language.
CALEM (Comprehensive Arabic LEMmas)
<http://catalog.elra.info/en-us/repository/browse/ELRA-L0133/>
ISLRN: 462-532-124-988-8 <http://www.islrn.org/resources/462-532-124-988-8>
Comprehensive Arabic LEMmas is a lexicon covering a large list of Arabic
inflected word forms (stems) and their corresponding lemmas. It is
composed of 164,272 lemmas representing 7,151,106 stems, detailed as
follows: 720 Arabic particles, 15,291 broken plurals, 2,464,239 verbs,
4,675,856 nouns. The lexicon is provided as plain text in UTF8 encoding
and represents about 284 Mb of data.
MADED (Moroccan Arabic Dialect Electronic Dictionary)
<http://catalog.elra.info/en-us/repository/browse/ELRA-L0134/>
ISLRN: 977-057-254-691-5 <http://www.islrn.org/resources/977-057-254-691-5>
Moroccan Arabic Dialect Electronic Dictionary (MADED) is an electronic
lexicon containing almost 13,000 entries. They are written in Arabic
script wherein each Modern Standard Arabic (MSA) lemma is provided with
its corresponding Moroccan Arabic equivalent. In addition, MADED entries
are annotated with useful metadata such as part-of-speech (POS), origin
and root. MADED is designed for the practical use of the NLP community.
This dictionary is provided as a csv file and represents about 2 Mb of data.
MORV (Moroccan Morphological vocabulary)
<http://catalog.elra.info/en-us/repository/browse/ELRA-L0135/>
ISLRN: 064-194-729-767-0 <http://www.islrn.org/resources/064-194-729-767-0>
The Moroccan Morphological vocabulary is a lexicon containing more than
4.6 M entries describing a given Moroccan Arabic word with fourteen (14)
morphological and semantic features: the word orthographic form, the
segmentation (prefix and suffix), part-of-speech (POS), gender, number,
tense and transitivity (for verbs), its origin, dialectal lemma, Arabic
lemma, the root, voice, state, and affirmative/negative form. This
vocabulary is provided as a csv file and represents about 350 Mb of data.
CroaTPAS <http://catalog.elra.info/en-us/repository/browse/ELRA-M0108/>
ISLRN: 649-554-159-147-9 <http://www.islrn.org/resources/649-554-159-147-9>
CroaTPAS is a bi-lingual lexicon in Croatian and English. It was created
by manual annotation from the Croatian Web as Corpus and pattern
creation using the Skema editor on the Sketch Engine platform. CroaTPAS
is tailor-made to represent verb polysemy and currently contains a total
of 683 patterns (belonging to 180 Croatian verbs) expressing different
verb senses and 22.677 annotated corpus lines. Moreover, the resource
includes 109 metonymic sub patterns linked to 1112 corpus lines
featuring 62 different metonymic shifts.
T-PAS <http://catalog.elra.info/en-us/repository/browse/ELRA-M0109/>
ISLRN: 432-666-503-743-8 <http://www.islrn.org/resources/432-666-503-743-8>
T-PAS is a digital lexicographic resource consisting of a corpus-derived
collection of Italian verb valency structures, whose argument slots have
been manually annotated with a set of hierarchically organised semantic
labels called Semantic Types.
As of today, T-PAS contains a total of 1164 Italian verb entries
containing 5529 patterns expressing different verb senses, and 252943
annotated corpus lines. Moreover, the resource includes 84 metonymic
subpatterns linked to 1218 corpus lines featuring 37 different metonymic
shifts.
For more information on the catalogue or if you would like to enquire
about having your resources distributed by ELRA, please contact us
<mailto:[email protected]>.
_________________________________________
Visit the ELRA Catalogue of Language Resources <http://catalog.elra.info>
Visit the Universal Catalogue <http://universal.elra.info>
Archives
<http://www.elra.info/en/catalogues/language-resources-announcements>of
ELRA Language Resources Catalogue Updates