Dear Johannes

Re "CreoleVal":
at this point, it's more like a "one shouldn't" as opposed to whether "one
can".

The following is what I wrote to the SIGTYP, I think the message would be
similar for your initiative:

"""
---------- Forwarded message ---------
From: Ada Wan <adawan...@gmail.com>
Date: Tue, Oct 31, 2023 at 6:47 PM
Subject: Re: [Corpora-List] First CFP: The 6th Workshop on Research in
Computational Linguistic Typology and Multilingual NLP (SIGTYP 2024)
To: Michael Hahn <mh...@lst.uni-saarland.de>, <sig...@gmail.com>
Cc: <corpora@list.elra.info>


Dear Michael, dear SIGTYP officers and workshop organizers

I saw this posting of yours and have some concerns re the orientation of
this workshop/event. Given the work by Mielke et al. (2019) and Wan (2022),
I am surprised to see how the workshop description seems not to have been
updated accordingly.
I have some questions:
i. would/could such event/initiatives contribute to misinforming academics,
professionals, and practitioners (including those who may be new to the
topic)?
ii. at what granularity (e.g. "word", character, or byte) will "linguistic
typology" be promoted through this workshop/event?
iii. what is/are the "discipline-specific narrative(s)" (default
expectations of a discipline), if any, that is/are supposed to hold still,
esp. after the 2 publications mentioned above?
iv. how is "language" defined for the aim(s)/purpose(s) of your workshop?
and
v. since the initiatives of the workshop are computing-related, is
character encoding (an area that has been severely overlooked in the past
in Computational Linguistics / Natural Language Processing) being
used/promoted/introduced?

One major ethical consideration in the area of "linguistic typology" is
that it could unnecessarily exacerbate differences between language
varieties, esp. if/when such differences are not observable unless one
creates them through "word" (or "word"-like) tokenization in the
preprocessing step. It would be a violation of scientific integrity if one
were to continue "word"-hacking (in another formulation: intentionally
discarding data) in the name of "linguistic typology", would you not agree?

I look forward to your replies.

Thanks and best
Ada
"""

Thanks and best
Ada


On Tue, Oct 31, 2023 at 8:59 PM Johannes Bjerva via Corpora <
corpora@list.elra.info> wrote:

> We are proud to announce the release of CreoleVal - a collection of
> benchmarks for 28 Creole languages. The collection of datasets span tasks
> such as relation classification, machine comprehension, machine
> translation, named entity recognition, and use cases such as language
> modeling. We cover Haitian Creole, Bislama, Chavacano, Pitkern, Singlish,
> Tok Pisin, Papiamento, and others.
>
> We hope the NLP community will include this collection of datasets in
> ongoing & future evaluations of methods directed at low-resource languages.
> Not only that, we also hypothesise that CreoleVal will open the door for
> controlled experimentation with transfer learning methodology.
>
> This resource has been long in the making, and was made possible by a long
> list of collaborators.
>
> For a pre-print, see: https://arxiv.org/abs/2310.19567
>
> For code and data, see: https://github.com/hclent/CreoleVal
> (Repository under construction)
>
> _______________________________________________
> Corpora mailing list -- corpora@list.elra.info
> https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
> To unsubscribe send an email to corpora-le...@list.elra.info
>
_______________________________________________
Corpora mailing list -- corpora@list.elra.info
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to corpora-le...@list.elra.info

Reply via email to