Re: [lingu-dev] lost postings[1/5]: [SoC] Grammar checker API

Matthew Strawbridge Thu, 01 Jun 2006 05:07:27 -0700

> > > I agree that determining the ends of sentences is non-trivial.
> > > However, I think that this is a good reason to do it once in OOo
> > > instead of each grammar checker having to figure it out manually. OOo
> > > already maintains a list of abbreviations (ending with .), so
> > > presumably this could be used. If the user adds custom abbreviations,
> > > these would then automatically be picked up by the sentence splitter,
> > > which wouldn't happen if each grammar checker implemented its own.
> 
> Just curious because I do not know about this:
> Where should the user add custom abbreviations in order to get
> recognized by the breakiterator?
> Is there already sth. like that or is this a kind of proposal of yours?


Tools, AutoCorrect, Exceptions has a list of abbreviations. If OOo 
were to have been used for splitting the sentences, this list might 
have been useful. Since the consensus seems to be to pass whole 
paragraphs to the grammar checker, this list probably isn't relevant 
any more.

> > > I haven't seen this mentioned explicitly, but I think there should be
> > > a menu option Tools, Grammar Check to launch the grammar checker to
> > > check through the whole document from beginning to end (as the spell
> > > checker can do).
> 
> Just to be a little more precise:
> It should start with the para first at the top of the current view then
> going on and wrap-around if necessary.
> If you are per chance in the mid of a large document you'd like to see
> some immediate results.

If there are to be two types of grammar checking -- real-time with 
underlining and a separate task for checking the whole document -- 
then I think it will be sufficient for the whole-document check to 
start at the beginning of the document. I suppose that there could be 
an option about whether to start at the current text insertion point 
or the beginning of the document; personally, I'd probably want to 
check the whole document once I'd finished writing it, so wouldn't 
use this extra option.

> >> >>    8. The API should provide a paragraph (for example) to grammar
> checker
> >> >>    and this one should return a list. If there is no mistake in this
> >> >>    paragraph, the list should be empty,  else the list should contain:
> >> >>       1. Where is the mistake in the paragraph (initial index + final
> >> >>       index).
> >> >>       2. A list of suggestions to correct that mistake (this list can
> >> >>       be empty if checker is not prepared to guess).
> >> >>       3. A comment about mistake, e.g. what a grammar book should say
> >> >>       about it.
> > > It might be a good idea to have two levels of comments -- one brief
> > > and one detailed. The view of the detailed portion could then be
> > > toggled on and off in the UI. Users could then see at a glance a
> > > single sentence describing the problem. If they needed more
> > > information, they could expand the view to show the detailed
> > > explanation, which might include a more detailed explanation of why
> > > the selected text is thought to be an error, examples of correct and
> > > incorrect use, and references for further information.
> 
> This could also be done by giving that short description as sth. like a
> header visually somewhat apart from the full description.

Agreed. However, I think it would be nice to be able to toggle the 
full descriptions on and off. If the same error is caught many times, 
the repeated long descriptions might be annoying (or, at least, users 
wouldn't re-read them each time).

> BTW: Do we all agree that it is sufficient to have that comment in the
> language of the sentence being checked only?
> That is if we have for example a French UI and check some English text
> the comment should be only in English. And if the next sentence would be
> German the comment will now be in German.
> 
> Otherwise it will probably get rather complicated here...

Good point, which I hadn't thought of. If someone writes, say, a 
latin grammar checker, should the comments be in latin? I suppose 
that the UI language could be passed as a parameter to the grammar 
checker, which could choose to ignore it or to return comments in 
that language if it knows how.

> >> >>  2. Grammar Checker API, future:
> >> >>
> >> >>    1. Let's suppose it's possible to manage several languages in a
> text
> >> >>    and there is a Language Guessing API. Then, when OpenOffice
> discover
> >> >>    language of a sentence, it automatically loads grammar checker to
> >> >>    correspondent language.
> > >
> > > As people have already mentioned, there may be more than one grammar
> > > checker loaded for any given language. This is particularly likely
> > > for my project: graviax. This tool is much simpler than most of the
> > > others -- it uses regular expressions and doesn't attempt to parse
> > > the sentences. However, because of this, it is easy for users to
> > > update the rules to match their own preferences (for example,
> > > publishers could create a rule set for their particular house style).
> > >
> > > Therefore, I would expect to have two English-language grammar
> > > checkers running at the same time: a heavyweight checker that can
> > > catch errors like "a red apples" (which is impossible to do reliably
> > > in graviax) and then graviax running as well (for example, to
> > > highlight cliches).
> 
> Just as info:
> Currently it is possible for the spell checker to have more than one
> implementation per language available.
> A word is considered to be Ok if it gets accepted by any of those spell
> checkers. (Also the order they get called can be defined in the UI.)
> 
> 
> > > I'm not sure how this would work in terms of squiggly lines. It would
> > > be nice if the user could set the colour of the line associated with
> > > each tool independently, but that might be overkill.
> 
> I think the idea of chaining grammar checkers should be Ok as well.
> Or am I missing something?
> Of course doing so is likely to have a much more negative impact on
> performance compared to spell checking where only a single word needs to
> be checked twice.
> 
> So I somewhat wonder if we should allow it.
> Several spell checkers won't be that bad. But if a user installs let's
> say 5 grammar checkers because he/she wants the ultimately best grammar
> checking this may result in serious performance problems when running in
>   the background. Thus I'm somewhat unsure here.
> 
> Also what would be the rule to accept/reject a sentence?
> Should all grammar checkers report the sentence as correct or would it
> be sufficient if only one does so? The first is likely to be trouble for
> performance.

When people add extra spell checkers, they are trying to reduce the 
number of correct words that are incorrectly marked as mistakes, for 
example by adding a medical dictionary. So it makes sense that a word 
is OK if it is found in any of the spell checkers' dictionaries.

Conversely, I would think that people would add extra grammar 
checkers to catch more errors (much like having more than one virus 
scanner on a PC). Users should understand that doing this will affect 
performance. I think that all of the grammar checkers should be 
called, regardless of whether each finds any errors.

Take the following sentence passed into three grammar checkers A and 
B and C

   She be a man eating fish.

A might not find anything wrong, based on its rules.
B might suggest "She" --> "He" and "be" --> "is".
C might suggest "be" --> "is" and "man eating" --> "man-eating".

I don't think that the "be" --> "is" change should be shown to the 
user twice, if we can avoid it. Perhaps the first grammar checker in 
the list should take precedence when the same error is found more 
than once (the other option would be to merge the comments they 
return, but I think this would be messy).

We also need to think about what happens to the underlining when 
several potential errors overlap (whether their source is one or more 
grammar checkers). For example, if a whole sentence is marked for 
some reason (for example, for being a fragment) then this will hide 
any other errors that occur within it.

> > > My suggestion would be to create the absolute simplest API to start
> > > with. Use full stops to determine sentence breaks, even though some
> > > of these will be wrong. Just add a menu item and dialog box for
> > > working through the whole document -- don't do real-time checking
> > > yet. Check whole document against a single grammar checker in a
> > > single language.
> 
> I think the API should be up for the complete task and be defined in the
> following weeks if possible. Otherwise we'll have about the same
> discussion next year and maybe required to modify the API or maybe
> several grammar checkers existing by that time as well.
> 
> Of course the actual first step integration of already existing grammar
> checkers may use that assumption. It will just be a first step
> approximation on the way to the final implementation.

I take your point, but subscribe to the 'do the simplest possible 
thing that works' approach. Various complications are bound to arise, 
and will be caught earlier and will be more easy to fix under an 
incremental approach rather than a big-bang model.

I don't think we should be in too much of a rush with this. The worst 
thing we could do would be to rush out a solution that people try, 
don't like and switch off. Once you've lost people in this way, it's 
hard to win them back.

> One other question:
> I once noticed when toying with the MS grammar checker that if there are
>  too many errors in a text (e.g. because the language attribute is
> inappropriate) it displays a message like "too much errors encountered.
> Maybe it is foreign text..." and stops grammar checking and turns the
> display of errors off.
> I don't know the reason for this. Is it a possible performance problem
> or do they just not want to have that much text marked as wrong?
> The question is: Do we need something like this as well?

Another interesting thought. I guess it probably comes under 
'premature optimisation', but would be worth bearing in mind for the 
future.

Best wishes
Matthew

-- 
Matthew Strawbridge   http://www.philoxenic.com   (01353) 663650
Bespoke software development and freelance technical copy editing
{ A year spent in artificial intelligence is enough to make one 
believe in God. }

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] lost postings[1/5]: [SoC] Grammar checker API

Reply via email to