Hi Friedel, >> But I somewhat doubt the ability of a grammar to identify the end of >> sentence in a mixed language text. For example if an English grammar >> checker encounters the upside-down question-mark following the Spanish >> word at the end. Thus I'm wondering if the API should allow for a >> suggested-end-of-sentence when calling the grammar checker. Thus if the >> implementation encounters unknown characters it has at least a hint. >> >> BTW: The I18N break-iterator is not that bad with abbreviations. I think >> it has a list of those. But citations and similar things might pose a >> huge problem to it. >> > > Forgive me if this is a stupid line of reasoning (a misstep on this list > seems to have dire consequences), but doesn't all text have an > associated language attribute? Isn't this how the spell checker knows > which checker it should use?
Sure! > Surely then we'll use the same way for > grammar checking, not so? If English and Spanish sentences are mixed, > the user must indicate which is which (unless we have language guessing > - which would benefit both spell checking and grammar checking > equally). As far as I see this would either require a new text attribute like "language of sentence" (and that is likely to be still no help in identifying the end of the sentence) or the needs to be asked interactively what language the actual one is in. And no one wants to do this for every sentence being checked. The problem is that a word has only a single language (well actually each character may have a different one but we only use the one from the first char) whereas in a single sentence there can already be words of different languages. And just assuming that the first word in the sentence has the correct language set is likely to be a too simple algorithm. For example "'Alea iacta est' said Caesar when he crossed the rubicon." This is an English sentence starting with a Latin word. (Maybe it is not grammatically correct in English but at least in German sentences arranged that way are Ok.) > Now, if we pass paragraph per paragraph to the grammar checker, the > grammar checker needs to find sentence boundaries, but if the paragraph > is not homogeneous (in terms of language) we cannot pass it to a single > language's grammar checker anyway. Surely we will break the paragraph > according to the languages specified. > > Suppose a paragraph such as: > Oh dear! I dropped the bottle! Wat gaan my ma sê? Well, aside from identifying the actual sentences, that's the rather easy case ^_^ because all sentences itself use only a single language. It would be nice if we only had to take care of this setting. > The third sentence is not English. The user already needs to mark it as > Afrikaans for the sake of spell checking. OOo can therefore break this > paragraph as follows: > (English, "Oh dear! I dropped the bottle!") > (Afrikaans, "Wat gaan my ma sê?") > and dispatch it to the two different grammar checkers (if both are > present, of course). In reality we therefore have two paragraphs that > are checked independently of each other. This would be Ok in the above scenario. > Now, suppose we have a single word in a foreign language inside a > sentence (probably quite common in many situations - don't know if users > will necessarily always mark the word as such). Suppose we have a simple > paragraph such as: > "What is an inyanga?" > With "inyanga" marked as Zulu. > > Now we get (English, "What is an"), (Zulu, "inyanga"). The Zulu part > probably passes (trivial single word sentence). The English part > probably fails, and this will be a harder problem to think about. It > might mean that OOo will have to do basic sentence division, simply to > see if it roughly correlates with the language boundaries. > > Making any sense? I got the idea. But my thought would be that the chances for correct grammar checking improve a lot if the sentence is not broken up. Since I'm no linguist and have not written any grammar checker this question would be better answered by one of the developers of a grammar checker. But consider this example: "An inyanga is a traditional healer." If the attributes were set correctly and the sentence will get broken up accordingly we will have: a) "An" b) "inyanga" c) "is a traditional healer." Since all of those text are incomplete as sentence the grammar checker would have to mark all of them as *grammatically* wrong and has a hard time to give suggestions. But if the whole sentence is passed on to an English grammar checker it could do sth. like this: 'Well it is wrong because I do not know the word inyanga but if it would be a noun everything would be fine' and thus only claim a spelling error for inyanga and not a grammatical error for any part of the sentence. And if you think about it a grammar checker may already have such kind of heuristics (to treat unknown words as noun, verb, ...) implemented because it already has to deal with this problem when someone makes a typo and it can not properly be determined what word it should have been. And it is surely a requirement for a grammar checker to not give upon grammar checking because it does not know a specific word. Otherwise it would be required that a text has no spelling errors before it can be grammar checked. Thomas --------------------------------------------------------------------- To unsubscribe, e-mail: [EMAIL PROTECTED] For additional commands, e-mail: [EMAIL PROTECTED]
