Please respond – A better way to organize the Qura nic Arabic Corpus dictionary for version 0.3?

Kais Dukes Tue, 30 Nov 2010 07:47:41 -0800

PLEASE HIT "REPLY ALL" WHEN RESPONDING TO THIS E-MAIL – THANKS!




Hello All,


Hopefully sometime over the new few weeks, there will be an updated version
of the Quranic Arabic Corpus (version 0.3 - see below). I am hoping to get
people's feedback on this upcoming release, but also on a specific idea.



My question is – do you think we can better organize the Quranic Arabic
Corpus dictionary? To be honest, this is more of a concordance. Please see:



http://corpus.quran.com/qurandictionary.jsp



At the moment, they way the dictionary page works is that you specify a
root, and then you get back a list of words. The word list for a specific
root is organized by form, then by part-of-speech (noun or verb) and then by
person, gender and number. If you click on a specific word form, you get
taken to that verse in the Quran.



Although this was a good starting point, I would be keen to better organize
this to be more like a dictionary. How about the following suggestion. We
still keep the top-level as root, but we then make the next subdivision to
be lemma. Under different lemmas we can show different forms of inflection.



Also what about website navigation and hyperlinks for the dictionary, any
ideas?


I’m really keen to improve the dictionary - the audience I have in mind is
everyday users of the website who are mostly people wanting to learn Arabic
specifically with the intent of understanding the original text of the
Quran.



It would also be great to get feedback on the web pages which show lists of
lemmas and verbs, e.g.



http://corpus.quran.com/verbs.jsp

http://corpus.quran.com/lemmas.jsp



Please note that I’m not looking to add any new information to the corpus at
the moment, just a reorganization of the data to make things more readable
and accessible for our average user.



PLEASE HIT "REPLY ALL" WHEN RESPONDING TO THIS E-MAIL – THANKS!



========================================


RELEASE NOTES -   Quranic Arabic Corpus version 0.3



The Quranic Arabic Corpus (http://corpus.quran.com) is an international
collaborative linguistic project initiated at the University of Leeds that
aims to bridge the gap between the traditional Arabic grammar of i'rab and
techniques from modern computational linguistics. This open source resource
includes word-by-word part-of-speech tagging for the Quran, morphological
segmentation and a formal representation of Quranic Arabic syntax using
dependency graphs. Version 0.3 of the corpus includes a number of
significant improvements over the previous 0.2 release:



Increased coverage for the syntactic treebank. The treebank now covers 30%
of the Quran by word count (hence the version 0.3 release number). The
syntactic treebank provides annotation using dependency grammar for chapters
1-5 and 59-114, covering 23,292 out of 77,430 words in the Quran. The
treebank also includes a revised set of non-terminal phrase tags for nominal
sentences (jumlah ismiyah), verbal sentences (jumlah fi'liyah), and
conditional sentences (jumlah shartiyah),



Improved accuracy for tagging and morphological analysis covering 100% of
the Quranic text. Following online collaboration by volunteer annotators,
the part-of-speech tags and morphological analyses for over 500 words have
been reviewed in detail and cross checked against traditional sources of
Arabic grammar, resulting in further improvements to the accuracy of the
annotated resource.



More consistent morphological segmentation. Each of the 77,430 words in the
Quran has been automatically segmented, resulting in 128,068 distinct
morphemes. In accordance with traditional Arabic grammar, each morpheme has
been separately tagged for part-of-speech and multiple morphological
features including noun case and verb mood, gender, number and person. The
improved segmentation used in version 0.3 of the corpus is more consistent
with i'rab. For example, the suffixed nun of emphasis (nun l-tawkeed) is now
correctly analysed as a separate morphological segment.



High-resolution vector graphics for the Quranic script is now used to
display Arabic words in dependency graphs, replacing the previous use of
glyph-based fonts. The script is now based on electronic scans developed by
the Quran Printing Complex. This has resulted in improved typographic
accuracy for the Arabic words displayed in the syntactic treebank, most
notably for ligatures, verse pause marks, and diacritic alignment.
Previously a TrueType font was used to render Arabic words in dependency
graphs, which did not always accurately represent the intricacies of the
Quranic Uthmani script.



An extended tagset with finer grained part-of-speech tags including INT -
particle of interpretation (ḥarf tafseer), CIRC - for the circumstantial
usage of the particle waw (waw l-haliyah), COM - for the comitative usage of
the particle waw (waw l-ma'iyah) and RSLT (for the result usage of the
particle fa). In addition, for better consistency with traditional Arabic
grammar, the NUM tag has been replaced for numerical words with ADJ
(adjective) or N (noun) tags, depending on syntactic function and context.



Better natural language generation for automatic summaries of linguistic
annotation. For example, when a first person object pronoun suffix is
represented only by a terminal kasrah diacritic (instead of the more usual
ya suffix), this is now correctly mentioned in the word-by-word annotation
displayed online.



Links to updated academic publications on the Quranic Arabic Corpus: 2 LREC
papers, INFOS 2010 paper, a FAL book chapter, and a submission to LRE
Journal, together with a link to an online review of the Quranic Arabic
Corpus at Examiner.com. The full versions of these papers are now available
as PDF downloads from the Quranic Arabic Corpus website. These publications
and articles explain in detail the original research contributions of the
Quranic Arabic Corpus project.



Improved online documentation for the corpus, and additional sections in the
online annotation guidelines, most notably a new detailed section on the
different types of verb forms in Quranic Arabic morphology.



Enhanced morphological search for the Quran, including the ability to search
on additional part-of-speech tags and linguistic features.



Version 0.3 of the reviewed morphologically annotated data is freely
available for download from the Quranic Arabic Corpus website.



The Quranic Arabic Corpus is an open source project. Contributions or
questions about the research are more than welcome. Please direct any
correspondence to Kais Dukes, PhD researcher at the School of Computing,
University of Leeds:



web: www.kaisdukes.com

e-mail: [email protected]


END RELEASE NOTES


========================================

Please respond – A better way to organize the Qura nic Arabic Corpus dictionary for version 0.3?

Reply via email to