PLEASE HIT "REPLY ALL" WHEN RESPONDING TO THIS E-MAIL – THANKS!
Hello All, Hopefully sometime over the new few weeks, there will be an updated version of the Quranic Arabic Corpus (version 0.3 - see below). I am hoping to get people's feedback on this upcoming release, but also on a specific idea. My question is – do you think we can better organize the Quranic Arabic Corpus dictionary? To be honest, this is more of a concordance. Please see: http://corpus.quran.com/qurandictionary.jsp At the moment, they way the dictionary page works is that you specify a root, and then you get back a list of words. The word list for a specific root is organized by form, then by part-of-speech (noun or verb) and then by person, gender and number. If you click on a specific word form, you get taken to that verse in the Quran. Although this was a good starting point, I would be keen to better organize this to be more like a dictionary. How about the following suggestion. We still keep the top-level as root, but we then make the next subdivision to be lemma. Under different lemmas we can show different forms of inflection. Also what about website navigation and hyperlinks for the dictionary, any ideas? I’m really keen to improve the dictionary - the audience I have in mind is everyday users of the website who are mostly people wanting to learn Arabic specifically with the intent of understanding the original text of the Quran. It would also be great to get feedback on the web pages which show lists of lemmas and verbs, e.g. http://corpus.quran.com/verbs.jsp http://corpus.quran.com/lemmas.jsp Please note that I’m not looking to add any new information to the corpus at the moment, just a reorganization of the data to make things more readable and accessible for our average user. PLEASE HIT "REPLY ALL" WHEN RESPONDING TO THIS E-MAIL – THANKS! ======================================== RELEASE NOTES - Quranic Arabic Corpus version 0.3 The Quranic Arabic Corpus (http://corpus.quran.com) is an international collaborative linguistic project initiated at the University of Leeds that aims to bridge the gap between the traditional Arabic grammar of i'rab and techniques from modern computational linguistics. This open source resource includes word-by-word part-of-speech tagging for the Quran, morphological segmentation and a formal representation of Quranic Arabic syntax using dependency graphs. Version 0.3 of the corpus includes a number of significant improvements over the previous 0.2 release: Increased coverage for the syntactic treebank. The treebank now covers 30% of the Quran by word count (hence the version 0.3 release number). The syntactic treebank provides annotation using dependency grammar for chapters 1-5 and 59-114, covering 23,292 out of 77,430 words in the Quran. The treebank also includes a revised set of non-terminal phrase tags for nominal sentences (jumlah ismiyah), verbal sentences (jumlah fi'liyah), and conditional sentences (jumlah shartiyah), Improved accuracy for tagging and morphological analysis covering 100% of the Quranic text. Following online collaboration by volunteer annotators, the part-of-speech tags and morphological analyses for over 500 words have been reviewed in detail and cross checked against traditional sources of Arabic grammar, resulting in further improvements to the accuracy of the annotated resource. More consistent morphological segmentation. Each of the 77,430 words in the Quran has been automatically segmented, resulting in 128,068 distinct morphemes. In accordance with traditional Arabic grammar, each morpheme has been separately tagged for part-of-speech and multiple morphological features including noun case and verb mood, gender, number and person. The improved segmentation used in version 0.3 of the corpus is more consistent with i'rab. For example, the suffixed nun of emphasis (nun l-tawkeed) is now correctly analysed as a separate morphological segment. High-resolution vector graphics for the Quranic script is now used to display Arabic words in dependency graphs, replacing the previous use of glyph-based fonts. The script is now based on electronic scans developed by the Quran Printing Complex. This has resulted in improved typographic accuracy for the Arabic words displayed in the syntactic treebank, most notably for ligatures, verse pause marks, and diacritic alignment. Previously a TrueType font was used to render Arabic words in dependency graphs, which did not always accurately represent the intricacies of the Quranic Uthmani script. An extended tagset with finer grained part-of-speech tags including INT - particle of interpretation (ḥarf tafseer), CIRC - for the circumstantial usage of the particle waw (waw l-haliyah), COM - for the comitative usage of the particle waw (waw l-ma'iyah) and RSLT (for the result usage of the particle fa). In addition, for better consistency with traditional Arabic grammar, the NUM tag has been replaced for numerical words with ADJ (adjective) or N (noun) tags, depending on syntactic function and context. Better natural language generation for automatic summaries of linguistic annotation. For example, when a first person object pronoun suffix is represented only by a terminal kasrah diacritic (instead of the more usual ya suffix), this is now correctly mentioned in the word-by-word annotation displayed online. Links to updated academic publications on the Quranic Arabic Corpus: 2 LREC papers, INFOS 2010 paper, a FAL book chapter, and a submission to LRE Journal, together with a link to an online review of the Quranic Arabic Corpus at Examiner.com. The full versions of these papers are now available as PDF downloads from the Quranic Arabic Corpus website. These publications and articles explain in detail the original research contributions of the Quranic Arabic Corpus project. Improved online documentation for the corpus, and additional sections in the online annotation guidelines, most notably a new detailed section on the different types of verb forms in Quranic Arabic morphology. Enhanced morphological search for the Quran, including the ability to search on additional part-of-speech tags and linguistic features. Version 0.3 of the reviewed morphologically annotated data is freely available for download from the Quranic Arabic Corpus website. The Quranic Arabic Corpus is an open source project. Contributions or questions about the research are more than welcome. Please direct any correspondence to Kais Dukes, PhD researcher at the School of Computing, University of Leeds: web: www.kaisdukes.com e-mail: s...@leeds.ac.uk END RELEASE NOTES ========================================