[Corpora-List] CFP | EvaHan2026 Ancient Chinese OCR Shared Tasks

李斌 via Corpora Sun, 21 Dec 2025 06:24:42 -0800

Dear editor,
    I am Bin Li, one of the organizers of EvaHan2026. Would you spread this CFP 
to the corpora list? Thank you so much!




--

Best wishes!
Bin Li
Phone: (86)13813878144
Homepage:  http://cognitivebase.com/lib/
School of Chinese Language and Literature, 
Nanjing Normal University,China




CFP | EvaHan2026 Ancient Chinese OCR Shared Tasks
EvaHan 2026

https://github.com/GoThereGit/EvaHan
EvaHan 2026 is the Fifth International Evaluation of Ancient Chinese 
Information Processing, focusing on OCR tasks for multimodal large language 
models in ancient Chinese.
Co-organized with LT4HALA 2026@LREC 2026, which will be held from May 11 to 16, 
2026, in Mallorca, Spain.
EvaHan 2026 is organized by Dongbo Wang, Bin Li, Minuxan Feng, Chao Xu, 
Weiguang Qu, Liu Liu, Si Shen.
Previous Tasks:
EvaHan 2022
The First Bake-off of Ancient Chinese Automatic Processing was successfully 
held in Marseille, France, in 2022, with a focus on automatic word segmentation 
and part-of-speech tagging of ancient Chinese.
EvaHan 2023
The Second Bake-off of Ancient Chinese Automatic Processing was successfully 
held in Macau, China, in 2023, with a focus on machine translation of ancient 
Chinese.
EvaHan 2024
The Third Bake-off of Ancient Chinese Automatic Processing was held in Turin, 
Italy, in 2024, with a focus on automatic sentence segmentation and punctuation 
of ancient Chinese.
EvaHan 2025
The Fourth Bake-off of Ancient Chinese Automatic Processing was held in New 
Mexico, USA, in 2025, with a focus on named entity recognition in ancient 
Chinese.
Important Dates for EvaHan 2026:
Registration deadline: January 30, 2026
Training data release: January 1, 2026
Test data release: February 1, 2026
Running results submission: February 6, 2026
Technical report submission deadline: February 28, 2026
Notification of acceptance: March 1, 2026
Camera-ready papers due: March 10, 2026
Participation
To participate in EvaHan 2026, you must complete the following steps:
Registration:
Submit a registration form to officially register your team for the task. 
Registration is open from December 1, 2025, to January 30, 2026. Only 
registered participants will gain access to the training dataset.
Accessing the Training Data:
After completing the registration process, participants will receive 
instructions for downloading the training dataset, which includes image--text 
pairs from ancient Chinese texts for OCR.
Submitting Results and Reports:
Participants must use the provided test data to generate results and submit 
their system outputs and a technical report as per the shared task schedule.
For inquiries or to request the registration form, please contact us at 
[email protected].
Data
The Evahan 2026 dataset comprises three datasets, covering image-text pairs: 
plain text images, mixed image-text images, and handwritten images-text. The 
data underwent initial automatic annotation, followed by meticulous correction 
and refinement by experts in classical Chinese language and history to ensure 
the highest quality of the training materials and gold-standard texts.
● Dataset A （ Printed Texts） consists of data selected from the Siku Quanshu 
(Complete Library of the Four Treasuries), including classics, history, 
philosophy, and literature, as well as various other ancient books.
● Dataset B （Mixed Layouts） contains mixed image-text data selected from the 
Siku Quanshu and other ancient books.
● Dataset C （Handwritten Texts） includes handwritten ancient books, primarily 
the Chinese Buddhist canon, including the Chinese Buddhist canon (TKH) dataset, 
and the Chinese Buddhist canon (MTH) dataset.
Training Data The training set consists of designated portions of subsets A, B, 
and C. All training samples are provided in image-text pair format, with text 
in Traditional Chinese (UTF-8), approximately 5000-10000 image-text pairs per 
subset. Registered participants will receive the training data via email.
Test Data The test set includes the remaining unseen portions of subsets A, B, 
and C to ensure comprehensive evaluation of all three challenge types. The data 
is also provided in image-text pair format, approximately 200-500 image-text 
pairs per subset. Detailed information and a download link for the test data 
will be provided to participants before the start of the formal evaluation 
period.
Task
This section offers a detailed description of the tasks encompassed in EvaHan 
2026.
OCR
In many Chinese language processing systems,OCR is a critical task, often 
performed in parallel with other processing functions. The accuracy and speed 
of OCR directly determine the overall system's performance and user experience 
in downstream applications such as document digitization, information 
extraction, and intelligent retrieval.
Evaluation
Metrics
Each team will only have access to the training data. Later, unlabeled test 
data will also be released. After the evaluation is complete, the labels for 
the test data will also be released. Tables 2,3 and 4 provide examples of the 
scorer output. The evaluation will align the system-generated text with the 
gold standard. Next, OCR will be evaluated: precision, recall, and F1 score 
will be calculated. BLEU ROUGE-1, ROUGE-2, and ROUGE-L will also be evaluated, 
bringing the competition's evaluation to multiple metrics. This evaluation adds 
layout analysis metrics: mAP and IoU. T he team's final ranking will be based 
on the overall score. The final ranking of teams will be based on the combined 
scores.
Two Modalities
Each participant can submit results for both modes. In the closed mode, each 
team has limited resources. Each team can only use training data and a 
pre-trained model. This model is a word embedding pre-trained on a large 
Traditional Chinese corpus. No other resources are allowed in the closed mode.
In the open mode, there are no restrictions on resources, data, or models. 
Annotated external data, such as processed images or text, may be used. 
However, each team must disclose all resources, data, and models used in each 
system in the final report.
How to Participate
Registration time is mentioned above. Participants will be required to submit 
their runs and to provide a technical report for the task they participated in.
Submitting Runs
Each team can submit runs for two tasks. A run should be produced according to 
the closed modality. The second run will be produced according to the open 
modality. The closed run is compulsory, while the open run is optional.
Once the system has produced the results for the task over the test set, 
participants have to follow these instructions to complete their submission:
The annotated results should be submitted as three plain text files encoded in 
UTF-8 (four-byte encoding). The specific submission format will be released 
along with the pre-trained dataset.
Organizers
Dongbo Wang, College of Information Management, Nanjing Agricultural 
University, China
Bin Li, School of Chinese Language and Literature, Nanjing Normal University, 
China
Minxuan Feng, School of Chinese Language and Literature, Nanjing Normal 
University, China
Chao Xu, School of Chinese Language and Literature, Nanjing Normal University, 
China
Weiguang Qu, School of Computer and Electronic Information /School of 
Artificial Intelligence, Nanjing Normal University, China
Liu Liu, College of Information Management, Nanjing Agricultural University, 
China
Si Shen, School of Economics and Management, Nanjing University of Science and 
Technology, China
Student Members
Dongmei Zhu, College of Information Management, Nanjing Agricultural 
University, China
Jieqiong Li, College of Information Management, Nanjing Agricultural 
University, China
Ruifeng Wu,College of Information Management, Nanjing Agricultural University, 
China
Junyi Yang，College of Information Management, Nanjing Agricultural University, 
China
Zhixing Xu, School of Chinese Language and Literature, Nanjing Normal 
University, China
Junjie Li, School of Chinese Language and Literature, Nanjing Normal 
University, China
Yue Zhu, School of Chinese Language and Literature, Nanjing Normal University, 
China
Mengting Xu, School of Chinese Language and Literature, Nanjing Normal 
University, China

_______________________________________________
Corpora mailing list -- [email protected]
https://list.elra.info/mailman3/postorius/lists/corpora.list.elra.info/
To unsubscribe send an email to [email protected]

[Corpora-List] CFP | EvaHan2026 Ancient Chinese OCR Shared Tasks

Reply via email to