We are glad to announce the first SemEval shared task on Semantic Textual
Relatedness (STR): A shared task on automatically detecting the degree of
semantic relatedness (closeness in meaning) between pairs of sentences.

The semantic relatedness of two language units has long been considered
fundamental to understanding meaning (Halliday and Hasan, 1976; Miller and
Charles, 1991), and automatically determining relatedness has many
applications such as evaluating sentence representation methods, question
answering, and summarization.

Two sentences are considered semantically similar when they have a
paraphrasal or entailment relation. On the other hand, relatedness is a
much broader concept that accounts for all the commonalities between two
sentences: whether they are on the same topic, express the same view,
originate from the same time period, one elaborates on (or follows from)
the other, etc. For instance, for the following sentence pairs:


    Pair 1: a. There was a lemon tree next to the house. b. The boy enjoyed
   reading under the lemon tree.


   Pair 2: a. There was a lemon tree next to the house. b. The boy was an
   excellent football player.

Most people will agree that the sentences in pair 1 are more related than
the sentences in pair 2.

In this task, new textual datasets will be provided for Afrikaans
<>, Algerian Arabic
<>, Amharic
<>, English, Hausa
<>, Hindi
<>, Indonesian
<>, Kinyarwanda
<>, Marathi
<>, Moroccan Arabic
<>, Modern Standard Arabic
<>, Punjabi
<>, Spanish
<>, and Telugu


Each instance in the training, development, and test sets is a sentence
pair. The instance is labeled with a score representing the degree of
semantic textual relatedness between the two sentences. The scores can
range from 0 (maximally unrelated) to 1 (maximally related). These gold
label scores have been determined through manual annotation. Specifically,
a comparative annotation approach was used to avoid known limitations of
traditional rating scale annotation methods This comparative annotation
process (which avoids several biases of traditional rating scales) led to a
high reliability of the final relatedness rankings.

Further details about the task, the method of data annotation, how STR is
different from semantic textual similarity, applications of semantic
textual relatedness, etc. can be found in this paper:


Each team can provide submissions for one, two or all of the tracks shown

Track A: Supervised

Participants are to submit systems that have been trained using the labeled
training datasets provided. Participating teams are allowed to use any
publicly available datasets (e.g., other relatedness and similarity
datasets or datasets in any other languages). However, they must report
additional data they used, and ideally report how impactful each resource
was on the final results.

Track B: Unsupervised

Participants are to submit systems that have been developed without the use
of any labeled datasets pertaining to semantic relatedness or semantic
similarity between units of text more than two words long in any language.
The use of unigram or bigram relatedness datasets (from any language) is

Track C: Cross-lingual

Participants are to submit systems that have been developed without the use
of any labeled semantic similarity or semantic relatedness datasets in the
target language and with the use of labeled dataset(s) from at least one
other language.  Note: Using labeled data from another track is mandatory
for submission to this track.

Deciding which track a submission should go to:


   If a submission uses labeled data in the target language: submit to
   Track A

   If a submission does not use labeled data in the target language but
   uses labeled data from another language: submit to Track C

   If a submission does not use labeled data in any language: submit to
   Track B

** Here ‘labeled data’ refers to labeled datasets pertaining to semantic
relatedness or semantic similarity between units of text more than two
words long.


The official evaluation metric for this task is the Spearman rank
correlation coefficient, which captures how well the system-predicted
rankings of test instances align with human judgments. You can find the
evaluation script for this shared task on our Github page

Important Dates


   Training data ready: 11 September 2023

   Evaluation Starts: 10 January 2024

   Evaluation End: 31 January 2024

   System Description Paper Due: February 2024

   SemEval workshop: Summer 2024 - (co-located with a major NLP conference)



Task Organizers

Nedjma Ousidhoum

Shamsuddeen Hassan Muhammad

Mohamed Abdalla

Krishnapriya Vishnubhotla

Vladimir Araujo

Meriem Beloucif

Idris Abdulmumin

Seid Muhie Yimam

Nirmal Surange

Christine De Kock

Sanchit Ahuja

Oumaima Hourrane

Manish Shrivastava

Alham Fikri Aji

Thamar Solorio

Saif M. Mohammad
