New Version 0.3 of the Quranic Arabic Corpus (fwd)

Eric Atwell Sat, 12 Mar 2011 09:56:37 -0800

From: Kais Dukes <[email protected]>

Apologies for cross-posting.


The Quranic Arabic Corpus (http://corpus.quran.com) is an international 
collaborative linguistic project initiated at the University of Leeds that aims 
to bridge the gap between the traditional Arabic grammar of i'rab and 
techniques from modern computational linguistics. This open source resource 
includes word-by-word part-of-speech tagging for the Quran, morphological 
segmentation and a formal representation of Quranic Arabic syntax using 
dependency graphs. Version 0.3 of the corpus includes a number of significant 
improvements over the previous 0.2 release:

*** [Increased coverage for the syntactic treebank]. The treebank now covers 
30% of the Quran by word count (hence the version 0.3 release number). The 
syntactic treebank provides annotation using dependency grammar for chapters 
1-5 and 59-114, covering 23,292 out of 77,430 words in the Quran. The treebank 
also includes a revised set of non-terminal phrase tags for nominal sentences 
(jumlah ismiyah), verbal sentences (jumlah fi'iyah), and conditional sentences 
(jumlah shartiyah).

*** [Improved accuracy for tagging and morphological analysis] covering 100% of 
the Quranic text. Following online collaboration by volunteer annotators, over 
2,000 suggestions for improved part-of-speech and morphological tagging have 
been reviewed in detail and cross-checked against traditional sources of Arabic 
grammar, resulting in further improvements to the accuracy of the annotated 
resource.

*** [More consistent morphological segmentation]. Each of the 77,430 words in 
the Quran has been morphologically segmented, resulting in 128,076 individual 
morphemes. In accordance with traditional Arabic grammar, each morpheme has 
been separately tagged for part-of-speech and multiple morphological features 
including noun case and verb mood, gender, number and person. The improved 
segmentation used in version 0.3 of the corpus is more consistent with i'rab. 
For example, the suffixed nun of emphasis (nun l-tawkid) is now correctly 
analysed as a separate morphological segment.

*** [High-resolution vector graphics for the Quranic script] is now used to 
display Arabic words in dependency graphs, replacing the previous use of 
glyph-based fonts. The script is now based on electronic scans developed by the 
Quran Printing Complex. This has resulted in improved typographic accuracy for 
the Arabic words displayed in the syntactic treebank, most notably for 
ligatures, verse pause marks, and diacritic alignment. Previously a TrueType 
font was used to render Arabic words in dependency graphs, which did not always 
accurately represent the intricacies of the Quranic Uthmani script.

*** [An extended tagset with finer grained part-of-speech tags] including INT - 
particle of interpretation (harf tafsir), CIRC - for the circumstantial usage 
of the particle waw (waw l-haliyah), COM - for the comitative usage of the 
particle waw (waw l-ma'iyah) and RSLT (for the result usage of the particle 
fa). In addition, for better consistency with traditional Arabic grammar, the 
NUM tag has been replaced for numerical words with ADJ (adjective) or N (noun) 
tags, depending on syntactic function and context.

*** [Better natural language generation] for automatic summaries of linguistic 
annotation. For example, when a first person object pronoun suffix is 
represented only by a terminal kasrah diacritic (instead of the more usual ya 
suffix), this is now correctly mentioned in the word-by-word annotation 
displayed online.

*** [Links to updated academic publications] on the Quranic Arabic Corpus: 2 
LREC papers, INFOS 2010 paper, a FAL book chapter, and a LRE Journal paper, 
together with a link to an online review of the Quranic Arabic Corpus at 
Examiner.com. The full versions of these papers are now available as PDF 
downloads from the Quranic Arabic Corpus website. These publications and 
articles explain in detail the original research contributions of the Quranic 
Arabic Corpus project.

*** [Improved online documentation] for the corpus, and additional sections in 
the online annotation guidelines, most notably a new detailed section on the 
different types of verb forms in Quranic Arabic morphology.

*** [Enhanced morphological search] for the Quran, including the ability to 
search on additional part-of-speech tags and linguistic features.

*** [Version 0.3 of the reviewed morphologically annotated data] is freely 
available for download from the Quranic Arabic Corpus website.

The Quranic Arabic Corpus is an open source project. Contributions or questions 
about the research are more than welcome. Please direct any correspondence to 
Kais Dukes, PhD researcher at the School of Computing, University of Leeds:

web: www.kaisdukes.com
e-mail: [email protected]

New Version 0.3 of the Quranic Arabic Corpus (fwd)

Reply via email to