[lingu-dev] SoC grammar checking model

Bruno Sant'Anna Mon, 19 Jun 2006 03:36:36 -0700

Hi Bruno and all,

Im going to describe how Mathias and I thought that the interaction
between the
various objects in the course of grammar checking should be.

The objetcs available are:
- several documents
- various grammar checker implementations
- an iterating object that mediates between the documents to be checked
and the actual grammar checkers. It may also trigger a dialog to show/edit
the current sentence to be corrected.
- some clients that want to have a specific document or part of a document
being checked. The definition of client would be as simple as an instance
that wants sth. to get grammar checked.
- The UI to edit the text should be a component with it's own API in order
to allow to easily replace it with a different implementation.
->Bruno: For your dummy implementation this not not axtually be a
separate
component. But be sure that it has a separate API that allows for
such an implementation.

The problem we liked to solve was specifically how it could be allowed for
each grammar checker to maintain state information about the document
and the current paragraph to be checked. (We guessed that no one wants
to go through the trouble of keepig state information for an unknown number
of paragraphs.) Also it must be avoided that the state information gets
mixed
up or broken because the same document or paragraph gets checked by
different clients at the same time or maybe because the checking takes place
for various paragraphs of the same document at the same time (be it because
of a single client or different clients).

Also we wanted to allow for the later possibility of implementing the
actual
grammar checking API in C only, because this would allow to be platform,
application and compiler independ. Thus at some later point we can introduce
such an C API and a grammar checker implementing could be used by
OpenOffice and other applications as well. Of course the specific
application
is likely to be required to wrap that C API.
->Bruno: That is nothing you need to care about now at all.
The only meaning it will have for you is that aside from UNO strings
(which can't be helped) you must only use basic uno types or structs
in the
API definition for the grammar checkers. Especially you must not pass on
any UNO reference as argument of a function call to the grammar checker.

->Bruno:
There must be only a single instance of that object at all. Thus you should
make it a 'one instance' service (the component equivalent of a singleton).
This needs be because it is the instance that controls the whole process of
grammar checking and also since it is required to convert some UNO interface
references to a unique(!) ID to be used when calling the grammar checker
API.
And thus there must not be more than one instance of this object!

And all clientst that want to have a text check should make use of that
Iterator.

In order for the actual grammar checker to know about the need or actual
life time of state information we need to have 4 functions in it's API
to indicate
the start and end of it.
It could probably be as simple as:
- startOfDocumentStateInfo( nDocId )
- endOfDocumentStateInfo( nDocId )
- startOfParaStateInfo( nDocId )
- endOfParaStateInfo( nDocId )
where nDocId will be one of those unique ID's generated by the Iterator.
Since we like to keep state information for at most one paragraph it is
obvious that startOfParaStateInfomust not be called twice with the same ID
without calling endOfParaStateInfo first.
->Bruno: I have not yet thought about possible pros and cons of having
the start...
functions only and implying that a start... always induces the discarding
of the current respective state information...

Of course if a grammar checker does not need to keep state information
those functions will have an empty implementation for that checker.

It is the Iterators responsibility to call those functions corresponding
to how
it iterates through the document and thus ensure that the state information
is always valid.

A client will start grammar checking for a document by calling the Iterator
with a call like:
Iterator->startChecking( XInterface xC, XTextDocument xDoc, XText
xPara, ...)
where xC denotes an interface reference of the client that can later on
be used to
identify the client to the iterator (e.g . by passing it as argument
whenever it needs
to be clear which client is calling the Iterator).
For similar means xDoc is used. It also gives access to the document to
be checked
and the Iterator is required to keep a map of al the documents in use
and must
map this reference to the above mentioned nDocId that is used to
identify a specific
document when the grammar checker is called.
And xPara will be a reference to the XText interface (or a similar
interface that can
be used) to access the initial paragraph to be checked.

In return (in the execution scope of the very same startChecking
function) the
Iterator will register itself as listener to the document to get
notified when it
is closed, this allowing the iterator to call the endOfDocumentStateInfo
function for the grammar checkers. Thus the document level state info can
be maintained for several startChecking calls by the respective grammar
checker.

In the followng the Iterator will process the initial paragraph by
calling the grammar
checker API and the API for the UI when needed.
If the paragraph was checked. It will call the respective document via
the xDoc
reference kept to provide the next paragraph to be checked. That paragraph
may be returned e.g. by another XText reference that will be empty if
everything
was checked what needed to be checked. (Thus aside from the initial
paragraph
the order of paragraphs being checked and if a specific paragraph gets
checked or
not will be up to the document.)

Sidenote:
However we may have need that the Interator may ask for a specific paragraph
e.g. the one before or following the current one. This needs to be discussed
further.
Also for the sake of document level state information we may still need
to discuss
if for if it would be Ok to skip paragraphs that were checked before and
where
no errors were found.
The same question may arise for sentences within a paragraph.
For example if a paragraph consists of an English, French and another
English sentence. Need both grammar checkers be called with all the sentence
just for the sake of paragraph level state information?
It would be nice if that is not required...

Now back to the main topic...

The Iterator calls the respective grammar checkers(!) (this is in the
case of
chained grammar checkers) for the current sentence by using a function like
checkText( nDocId, String aText, ...)
Providing nDocId allows each grammar checker implementation to identify
the actual state information to be used.
If the grammar checker uses/builds state information at all there will
be at most
two such data sets per nDocId the checker gets called with.
One for the document level state info and one for the paragraph state info.
The lifetime of the data sets (or at least the validity of the stored
data) is
determined by the function calls mentioned at the beginning.
Please note that each grammar checker implementtion may have it's own
data set for the state information of the very same document!

Now in order for he state information not to get mixed up or broken because
of different clients wanting to check the same document or checking more
than
one paragraph at the same time the Iterator must simply enforce that
- if a single paragraph of the document is currently being checked
already, no other
checking must take place in this document unless that paragraph was
checked.
(Remember that the unit for processing is a paragraph and that
therefor another
client will be blocked at most for the duration of checking the
current paragraph.)
Having a basically UI driven application that is also not particularly
good in
multi-threading this seems not to much a limitation.
What we will also gain by this is that the state information of a specific
grammar checker can be shared between different clients, because there won't
be parallel access to it.

If while a paragraph of the document is being checked the request for a
new text
part to be checked (be it by the same client or a different one) is made
to the
Iterator we have to either
- reject this call, e.g. by means of an exception and the client has to try
again later, or
- queue up that request for later processing
We have to discuss what seems more appropriate.

Note: If we were to require that paragraph level state information can
be kept
for multiple paragrpahs at the same time the Iterator needs only to
ensure that
that very paragraph in the document will not be checked simultaneously.
This seems to be more flexible but I would think that the more strict
and easier to
implement approach will make no big difference to this since after all
there could
must be at most one interactive checking of the document and if it gets
checked
in the background simultaneously or not will make no visible difference
to the user.

Well, this being a rather large posting I'm sure to have missed sth to
point out
or should have been more specific in some places.
We'll see when the questions and comments will come in.

-> Please comment and ask where things are not clear!

Kind regards,
Thomas

[lingu-dev] SoC grammar checking model

Reply via email to