Re: Statistical Dependency Parsing of Quranic Arabic

Kais Dukes Thu, 16 Sep 2010 06:54:27 -0700

Hi Waleed,

At the moment the work I am doing on parsing the Quran statistically is very
much still in the experimental stages, although I can say that I have now
replaced the previous rule-based parser with this new statistical parser to
construct the treebank. With regards to your questions:


1) The parser is similar to MaltParser (it uses a shift/reduce stack/queue
algorithm), but has been designed with a particular grammatical formalism in
mind (see http://corpus.quran.com/documentation/dependencygraph.jsp), i.e.
it supports non-terminal phrase nodes directly.

2) Nothing is available for download right now. This is currently an
internal project to assist with construction of the Quranic Arabic Treebank,
BUT hopefully I will make it available for download at some stage. I later
plan to investigate applying the parsing algorithm to other datasets /
languages to see the results.

3) I would be interested to know your results from using MADA, MaltParser,
MSTParser, ... or other depedency parsers. How is your own research and
results coming along?

Looking forward to hearing from you.

Kind Regards,

-- Kais

On Thu, Sep 16, 2010 at 1:39 PM, Waleed Oransa <wora...@gmail.com> wrote:

> Hello Kais,
>
> This is very good enhancement and results. May I know what is
> the difference between your parser and other statistical based parser like
> MADA?
> Also is yours available for download ? or do you plan to do that in the
> future?
>
> I am working on MT and Automatic Tashkeel for Arabic and I am interested to
> utilize your parser in these researches.
>
> Best regards,
> Waleed
>
> On Sun, Sep 12, 2010 at 12:54 PM, Kais Dukes <k...@kaisdukes.com> wrote:
>
>> Hello Eric,
>>
>>
>> Some very exciting news … well at least exciting to me :-) Please accept
>> my apologies for not being very responsive on e-mail recently, but I had
>> locked myself in my study most evenings after coming home from work to
>> concentrate on something that I have found most interesting. For the past 12
>> months, development of the Quranic Arabic Dependency Treebank (
>> http://corpus.quran.com/treebank.jsp) has been slow involved me going
>> through the following steps repeatedly:
>>
>>
>> 1. Use a hand-written rule based parser to produce an initial draft
>> syntactic analysis of a verse of the Quran, e.g. see:
>> http://corpus.quran.com/treebank.jsp?chapter=67
>>
>>
>> 2. Correct the output of the parser and add the resulting proofread verse
>> to the treebank.
>>
>>
>> 3. Potentially improve the parser’s accuracy by reviewing its rules
>> against the new larger set of data in the Treebank. Improving the
>> hand-written parser has been a costly exercise, involving the addition of
>> new grammar rules and refining these many times over. However, the parser
>> had performed well. Run against the current draft treebank covering approx.
>> 20% of the Quran, the rule-based parser is 78.79% accurate in terms of it's
>> automatic grammatical analysis using traditional Arabic dependency grammar:
>>
>> *
>> *
>>
>> *Rule-based parser ... F-measure 78.79%* (precision=90.13%,
>> recall=69.99%)
>>
>>
>> Over the last few weeks I have been looking into moving away from the
>> rule-based parser and starting to a use probabilistic parser, trained
>> statistically via machine learning. This new parser automatically reads the
>> existing treebank and "learns" how to perform syntactic analysis for the
>> rest of the Quran automatically. Amazingly, I am very excited to announce
>> that I have found way to recast the problem of syntactic analysis in
>> traditional Arabic grammar as a statistical classification problem
>> (following a similar idea to Nivre’s dependency parsing algorithm). The
>> results for the new parser using machine learning are:
>>
>> *
>> *
>>
>> *Statistical parser ... F-measure 87.87%* (precision=90.02%,
>> recall=85.82%)
>>
>>
>> Not only is this a big jump in accuracy (from 79% to 88%), the parser only
>> takes 15 seconds to train on the existing Treebank, compared to many months
>> of development time for the rule based parser refining hand-crafted
>> constraint dependency rules. I am very excited about this! Immediately, what
>> comes to mind is:
>>
>>
>> 1) We are now using a data-driven statistical parser using
>> machine-learning, with accuracy comparable to state-of-the-art statistical
>> parsers for dependency grammar.
>>
>>
>> 2) The improved accuracy of the new parser means that continuing to
>> develop the syntactic treebank will be quicker since the resulting output is
>> now much more accurate, and also from reviewing the new syntactic analyses
>> they also appear to be more consistent.
>>
>>
>> 3) Completion of the treebank should also now move faster because I have
>> to spend less effort on the time-consuming task for building a rule-based
>> parser by hand, and I can spend more time on ensuring accuracy by
>> proofreading the automatic syntactic analyses.
>>
>>
>> 4) This should lead to a stronger journal paper submission on statistical
>> dependency parsing of Quranic Arabic. In fact, I am so excited about this
>> that I am keen to start working on this paper as soon as I have got the FAL
>> submission out of the way.
>>
>>
>> 5) I now intend to rework the PhD project plan to include this updated
>> information.
>>
>>
>> Looking forward to hearing from you! I hope it's okay, I have CC'd the
>> comp-quran mailing list, I would keen to here from others who have an
>> interest in, or experience with, statistical parsing. Any comments are most
>> welcome.
>>
>>
>> Kind Regards,
>>
>>
>> -- Kais
>>
>>
>

Re: Statistical Dependency Parsing of Quranic Arabic

Reply via email to