Re: [lingu-dev] [SoC] Grammar checker API

Thomas Lange Thu, 15 Jun 2006 08:53:46 -0700

Hi Bruno, :-)

>     > Another thing I'm thinking about, should automatic checking always use
>     > sentence method? I think yes since the user don't need to finish a
>     > paragraph for autochecking start, what do you think?
> 
>     Would be acceptable to me.
>     But keep in mind that the most single thing the user wants is that the
>     very same text parts an interactive grammar check (starting with that
>     paragraph) would find will get marked by the automatic grammar checking.
>     Everything else will be quite irritating and probably result in
>     automatic grammar checking being thought of as more or less useless.


Having thought again I found that the 'sentence method' will not be
acceptable if you intend to pass on only that very single sentence to
the grammar checker. This is because we got told that for the Asian
languages it will be essential that the grammar checker is allowed to
detect the end-of-sentence on it's own. If you pass only that single
sentence (as determined by the breakiterator for example) this may be
insufficient since the grammar checker can not look further and thus
is unable to move the end-of-sentence some more characters towards
the end.

Thus the only reliable thing is to use the very same units for automatic
checking and interactive checking. And as we already agreed this would
be to pass the whole paragraph and to indicate the current sentence to
be checked by indices.


> If the user ask Interactive checking in a paragraph we could reset all
> errors in this paragraph (found by automatic checking) and recheck it
> again with interactive checking (Sending the whole paragraph to grammar
> checker) what do you think?

What do you mean by 'reset all errors in this paragraph'?
Is there someone to keep track of them?

I think the grammar checker itself does not want to do this.
It will probably only care about the current paragraph, that is aside
possibly having some state information from all the previous paragraphs...

And the document will likely at most keep track if two things:
- for optimization: if the paragraph was already checked and everything
  was fine. This would be used to not check that paragraph unless it
  gets modified. (We still need to discuss about the usefulness of
  this!)
- The positions and Just maybe(!) the text parts being underlined.

I do not see the need to reset the errors right now.
If nothing was wrong nothing needs to be done but it should still be
fine to check that paragraph.
And if it was wrong if nothing gets edited nothing needs to be done,
and if sth. was corrected the this will result in the paragraph being
modified and whatever list of errors may be kept by the document is
need to be updated in the same way it needs always be done when the
text was edited.


I do agree though that there is the need to reset all found errors at
a different place. And that will be the grammar checker itself:
As was seen in the discussion the grammar checker(s) should be allowed
to build state information
a) for the whole document (e.g. to check consistent use
   of e-mail vs. email)
b) for a single paragraph since the content of previous sentences
   may effect the result for following sentences.
Thus the API must always tell the grammar checker when a new paragraph
gets processed in order to allow it to through away the old state
information. This implies that it must be taken care of that if two
paragraphs get checked simultaneously (which can't be avoided
considering that everyone can use the API at any time) their state
information will not be mixed up.
Similar problem is on the document level.

I'll come back to these problems later on.


> Normally autochecking is done automatically and a user starts
> interactive checking when he finishes his text, so auto-checking would
> be a text development part (since it helps a user when he is writing)
> and interactive-checking would be a text revision part (since it
> provides detailed descriptions about rules). For this reason I think
> that both of them are useful.
> Correct me if I'm wrong ok?

Correct!
Of course interactive checking may be started not only if the whole
documented is finished, but likely also if a paragraph was edited to
it's supposed final state.


> It sounds better and possible now, bus some points are unclear to me:
> - How we will define a language atribute of a word? It already exists in
> UNO?

Yes! (see com.sun.star.lang.Locale)
And as mentioned elsewhere there are three of them.
(But this difference is document level and you should have no
need at all to handle them directly.)

For use in the API you will probably only have need for
the language the text is to be checked with, and thus only need
one attribute.


> - How we will determine a "main language" of a sentence, we are
> supposing a language guesser?

As my professors usually said:
"This is left as an exercise."
For the dummy implementation it will be fine if you either
- hardcode them for fixed sample text
- use the language of the first word
- use language with the most words
After all this is not the focus of interest for your SoC.
For the sample implementation should make everything as stupid-simple
as possible to keep things easy. Otherwise I fear we may easily run
into time problems. You main goal is the API and to show that it is
functional. Spending time on other issues is misleading use any simple
assumption or fixed scenario that can be used to speed up things.
If you have time left (which I will guess would not be too much) you
can elaborate on those things later on.


>     Thomas->Bruno:
>     Mathias and I have talked about the model to use and some other details.
>     The results are as following:
>     - Your dummy implementation should use C or C++ to avoid the overhead
>     of involving a UNO bridge for a different language binding.
> 
>  
> Fine, It should be faster.
> 
>     - Sine there was no discussion taking place for the pros and cons
>     of the actual models of iteration to use which were
>        1) have it done by each applications core
>           similar to current spell checking
>        2) have everything done by the component that comes along
>           with the grammar checker (as currently be done by CoGrOO)
>     and
>        3) having a mediating object that takes care of iterating through
>           the document, having it check the text by the grammar checker,
>           raising a dialog to edit the text if necessary and writing the
>           modified text back to the document.
>     we took this in our hands.
>     We agreed to use the model with the separate object that calls the
>     actual grammar checker and obtains the paragraphs to be checked
>     from the document. Also for the dialog to modify the text: it should
>     be a different implementation with an API of it's own in order
>     to have the UI properly separated from the grammar checker and
>     iteration object.
> 
>  
> YES! Great, we (me and menezes) were discussing this before, we divided
> the whole process in 3 components:
>  
> - API that provides text blocks and does the user interface.
> - Grammar checker that receives pure strings, checks it and return an
> object.
> - A middleware we called "driver" similar whats done with databases.
> It function will be connecting the API to  grammar checker (or grammar
> checkers). This is specially useful since each developper can create it
> own driver to work with it own grammar checker, not needing to rewrite
> the grammar checker. What do you think about it?

What do you mean by 'each developer can create it's own driver'?
>From my point of view a main goal of that Driver/Iterator was to unify
the behaviour of iterating and accessing/modifying the text and using
the UI in order to show the same behavior for all grammar checkers.
And also we wanted to spare the developers the need to implement that
thing of their own.

So I'm wondering why now you like point out that this should be possible...



>     We think the initial sequence should be something like this:
>     -- The document should notify the iterating object that there is
>         sth. to be done by providing the paragraph to check (e.g. via
>         an XTextRange or some other UNO interface that allows text access).
>         Having that very paragraph processed by means of calling the
>         grammar checker API and maybe the dialog. The iterating object
>         (let's call it for short Iterator from now on) should ask the
>         document for the next paragraph and so on until the document is
>         processed.
>         (The inter paragraph sentence iteration should be done by the
>         Iterator, of course having the respective grammar checker
>         determine the end of sentence where ever possible).
>         By having the Iterator asking for the next paragraph instead of
>         the document pushing all the paragraphs to the Iterator we limit
>         the possibility of piling up paragraphs to be checked and possibly
>         being already deleted/moved etc. when their turn comes up.
>         Of course the problem is potential still available there but it
>         should be a bit less likely to actually happen. 
> 
>  
> You mean create an spool of paragraphs to be checked? 

If you mean there will be list, queue, array or whatever kind of set
of paragraphs to be checked kept by whomever, this would be: No!
The idea was to only provide the initial paragraph to start with and
when that is processed to have the Iterator/Driver ask for the next
single one paragraph to check.

BTW: We still need to discuss in the ML if e.g. for the sake of
document level state information previously checked paragraphs
that have no errors need to be checked as well when the whole document
is checked.


> It sounds ok, it
> should be possible since we are dealing with interactive and we can
> manage the actual paragraph completely, it is a state machine in my
> point of view. Let me complement with this pseudo algorithm:
>  
> send a paragraph for checking and wait a while for results (we have to
> manage exceptions here).

I'm missing the sentence wise iteration here as the top level loop...

> create a list of error objects (if it find errors)
>  
> while (error)
> {
>     show the wrong sentence to user;

The above implies that you already have all the errors of the whole
paragraph. I thought we agreed not to do so because the API will get
more complicated. And you may also be required to call different grammar
checkers.
Please correct me where I'm wrong.
Maybe I'm just missing an important point right now...
(Sometimes that happens. ^^°)

I would have said the process to be like this:

while (paragraph not finished)
{
        aBounds = getBoundsForNextSentence
        aErrorList = check( paragraph, aBounds )
        if (aErrorList not empty)
        {
                startDialog( paragraph, aBounds, aErrorList )
        }
}


And most of the following being part of the dialog which
should be a separate component with API of it's own:

>     show a guessing if its possible;
>     show a detailed rule comment if its possible;
>  
>     ask the user about correct or ignore;
>    
>     if user change
>     {
>         if guessing exists
>         { replace sentence with the gessing }

The user is also allowed to do manual changes and not use
any of the suggestions even if there are suggestions!

Please compare with the spellcheck dialog:
"Next" advances within the sentence to the next error.
If we advance past the end-of-sentence and nothing was changed
we continue with checking the next sentence.
If anything was changed manually it should only advance to the next
sentence if the "Change" button was pressed.
Well, that is of course not all in detail but the behaviour should
be sth like that.


>         else (if guessing not exists)
>         { ask a new sentence to user and replace the previous sentece
> with it }
>     }

As explained above the user must always be allowed to modify the sentence.

>     else (if user chooses to ignore)
>     {
>         // we somehow have to flag this sentence in case of rechecking,
>         // it should probably bores the user if he asks to ignore and latter
>         // he recheck int again and it pop again... my idea is creating
> an list of

>From my point of view ignore (be it "ignore once" or "ignore all" and
the same with "change all") will always be word based only! I don't see
a meaning in ignoring or changing complete sentences automatically.
It is also not likely to have the very same sentence more than once
in a document.
Thus I see no need or advantage in flagging a sentence.



Mathias and I discussed today somewhat more about the model and how the
need to allow for preserving paragraph and document state information
will reflect in the API.

I will write a longer posting about this tomorrow.
This will give you a more detailed idea of how we think the interaction
between the objects must work and thus a better base to design the API.


Regards,
Thomas


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Re: [lingu-dev] [SoC] Grammar checker API

Reply via email to