[lingu-dev] Proofreading: Sentence tokenization problem

Marcin Miłkowski Tue, 14 Apr 2009 06:34:25 -0700

Hi all,

As some of you probably know, LanguageTool doesn't work with sentencetokenization in OpenOffice.org nicely. We get whole paragraphs indoProofreading() API and we return the info for whole paragraphs. Thisis of course wrong for multilingual paragraphs... Well, the reason isthat I found some small problems that I don't really know how to solve.There is at least one, IMHO very useful, rule that checks if brackets,quotation marks etc. come in pairs in the text. Obviously, you want tocheck this in a whole paragraph, as quotations often contain manysentences. Now, the problem is that if I tokenize the text on thesentence level, I get next bits of paragraph text with every call, andthat makes it very hard to track the number of unmatched quotationmarks. Let me explain:


"Blah blah. Blah blah".

gets 0 matches on paragraph tokenization, as I can retain information onthe number of rules matched in a single go in a paragraph. It gets twofalse alarms on sentence tokenization. (The algorithm that I use is toadd an error match to an array, and then add it to a removed array ifthere's a corresponding quotation mark later on. Then, when I'm finishedwith a paragraph, I delete the matches that are marked as deleted, andall unpaired matches are displayed. This cannot work insentence-tokenized mode, however.) There are two reasons:

(1) I don't get the possibility to remove previous matches - I can onlyremove the match in the current sentence, not in the previous one. Sobacktracking seems impossible when I get the second sentence ('Blahblah"'). I could try to store the previous matches internally along withthe text, but I would have to call OOo APIs to set some errors asignored, it seems to me, to be able to remove the blue underlining ofthe first quotation mark ('"Blah blah'). It seems quite an overkill, andis not reliable as the user can simply edit the text and the previousmatch will end up in a different position. I could try to signal"recheck" to OOo, but I don't want to recheck the whole document...There is no way to call "recheck text" on a single paragraph, whichwould be needed in such a case.

(2) Worse still, if an English paragraph contains a single French word,I would loose the rule state info as well, or I would have to store allinstances of LanguageTool per supported language in memory, as rules andtheir state are implemented on the language level. It's possible, ofcourse, until we have just a couple of languages in a document. But in amultilingual document (which is easy for any European Union leaflet with10 languages or something) you would have all that checkers in memory...Of course, I could try to store just the state of the paragraph-levelrules instead of the whole checker but that would complicate the code a lot.

I was playing with different design strategies, and it seems to me thatit would be easiest for me if we had two more features in the API:


(1) Checking the whole paragraphs and

(2) triggering a recheck of the whole paragraphs for specialparagraph-level rules.

I could try to implement normal sentence-level checks via doProofreadingand iterate the text manually via the paragraph-text APIs, where thosespecial rules would be called on whole paragraphs; but maybe the samefunctionality would be needed for other checkers, and I would beduplicating code... This would involve another change of APIs, whichisn't the nicest thing, to say the least.


What do others think? Any thoughts or advices on that?

Thanks in advance
Marcin
--
www.languagetool.org

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org

[lingu-dev] Proofreading: Sentence tokenization problem

Reply via email to