Op Vr, 2006-05-26 om 15:23 +0200, skryf Thomas Lange:

...

> But I somewhat doubt the ability of a grammar to identify the end of
> sentence in a mixed language text. For example if an English grammar
> checker encounters the upside-down question-mark following the Spanish
> word at the end. Thus I'm wondering if the API should allow for a
> suggested-end-of-sentence when calling the grammar checker. Thus if the
> implementation encounters unknown characters it has at least a hint.
> 
> BTW: The I18N break-iterator is not that bad with abbreviations. I think
> it has a list of those. But citations and similar things might pose a
> huge problem to it.
> 

Forgive me if this is a stupid line of reasoning (a misstep on this list
seems to have dire consequences), but doesn't all text have an
associated language attribute? Isn't this how the spell checker knows
which checker it should use? Surely then we'll use the same way for
grammar checking, not so? If English and Spanish sentences are mixed,
the user must indicate which is which (unless we have language guessing
- which would benefit both spell checking and grammar checking
equally). 

Now, if we pass paragraph per paragraph to the grammar checker, the
grammar checker needs to find sentence boundaries, but if the paragraph
is not homogeneous (in terms of language) we cannot pass it to a single
language's grammar checker anyway. Surely we will break the paragraph
according to the languages specified. 

Suppose a paragraph such as:
    Oh dear! I dropped the bottle! Wat gaan my ma sê? 

The third sentence is not English. The user already needs to mark it as
Afrikaans for the sake of spell checking. OOo can therefore break this
paragraph as follows:
    (English, "Oh dear! I dropped the bottle!")
    (Afrikaans, "Wat gaan my ma sê?")
and dispatch it to the two different grammar checkers (if both are
present, of course). In reality we therefore have two paragraphs that
are checked independently of each other. 

Now, suppose we have a single word in a foreign language inside a
sentence (probably quite common in many situations - don't know if users
will necessarily always mark the word as such). Suppose we have a simple
paragraph such as:
    "What is an inyanga?" 
With "inyanga" marked as Zulu. 

Now we get (English, "What is an"), (Zulu, "inyanga"). The Zulu part
probably passes (trivial single word sentence). The English part
probably fails, and this will be a harder problem to think about. It
might mean that OOo will have to do basic sentence division, simply to
see if it roughly correlates with the language boundaries.

Making any sense?


Friedel

---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

Reply via email to