Hi all,

As some of you probably know, LanguageTool doesn't work with sentence tokenization in OpenOffice.org nicely. We get whole paragraphs in doProofreading() API and we return the info for whole paragraphs. This is of course wrong for multilingual paragraphs... Well, the reason is that I found some small problems that I don't really know how to solve. There is at least one, IMHO very useful, rule that checks if brackets, quotation marks etc. come in pairs in the text. Obviously, you want to check this in a whole paragraph, as quotations often contain many sentences. Now, the problem is that if I tokenize the text on the sentence level, I get next bits of paragraph text with every call, and that makes it very hard to track the number of unmatched quotation marks. Let me explain:

"Blah blah. Blah blah".

gets 0 matches on paragraph tokenization, as I can retain information on the number of rules matched in a single go in a paragraph. It gets two false alarms on sentence tokenization. (The algorithm that I use is to add an error match to an array, and then add it to a removed array if there's a corresponding quotation mark later on. Then, when I'm finished with a paragraph, I delete the matches that are marked as deleted, and all unpaired matches are displayed. This cannot work in sentence-tokenized mode, however.) There are two reasons:

(1) I don't get the possibility to remove previous matches - I can only remove the match in the current sentence, not in the previous one. So backtracking seems impossible when I get the second sentence ('Blah blah"'). I could try to store the previous matches internally along with the text, but I would have to call OOo APIs to set some errors as ignored, it seems to me, to be able to remove the blue underlining of the first quotation mark ('"Blah blah'). It seems quite an overkill, and is not reliable as the user can simply edit the text and the previous match will end up in a different position. I could try to signal "recheck" to OOo, but I don't want to recheck the whole document... There is no way to call "recheck text" on a single paragraph, which would be needed in such a case.

(2) Worse still, if an English paragraph contains a single French word, I would loose the rule state info as well, or I would have to store all instances of LanguageTool per supported language in memory, as rules and their state are implemented on the language level. It's possible, of course, until we have just a couple of languages in a document. But in a multilingual document (which is easy for any European Union leaflet with 10 languages or something) you would have all that checkers in memory... Of course, I could try to store just the state of the paragraph-level rules instead of the whole checker but that would complicate the code a lot.

I was playing with different design strategies, and it seems to me that it would be easiest for me if we had two more features in the API:

(1) Checking the whole paragraphs and

(2) triggering a recheck of the whole paragraphs for special paragraph-level rules.

I could try to implement normal sentence-level checks via doProofreading and iterate the text manually via the paragraph-text APIs, where those special rules would be called on whole paragraphs; but maybe the same functionality would be needed for other checkers, and I would be duplicating code... This would involve another change of APIs, which isn't the nicest thing, to say the least.

What do others think? Any thoughts or advices on that?

Thanks in advance
Marcin
--
www.languagetool.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

Reply via email to