Hi all,
As some of you probably know, LanguageTool doesn't work with sentence
tokenization in OpenOffice.org nicely. We get whole paragraphs in
doProofreading() API and we return the info for whole paragraphs. This
is of course wrong for multilingual paragraphs... Well, the reason is
that I found some small problems that I don't really know how to solve.
There is at least one, IMHO very useful, rule that checks if brackets,
quotation marks etc. come in pairs in the text. Obviously, you want to
check this in a whole paragraph, as quotations often contain many
sentences. Now, the problem is that if I tokenize the text on the
sentence level, I get next bits of paragraph text with every call, and
that makes it very hard to track the number of unmatched quotation
marks. Let me explain:
"Blah blah. Blah blah".
gets 0 matches on paragraph tokenization, as I can retain information on
the number of rules matched in a single go in a paragraph. It gets two
false alarms on sentence tokenization. (The algorithm that I use is to
add an error match to an array, and then add it to a removed array if
there's a corresponding quotation mark later on. Then, when I'm finished
with a paragraph, I delete the matches that are marked as deleted, and
all unpaired matches are displayed. This cannot work in
sentence-tokenized mode, however.) There are two reasons:
(1) I don't get the possibility to remove previous matches - I can only
remove the match in the current sentence, not in the previous one. So
backtracking seems impossible when I get the second sentence ('Blah
blah"'). I could try to store the previous matches internally along with
the text, but I would have to call OOo APIs to set some errors as
ignored, it seems to me, to be able to remove the blue underlining of
the first quotation mark ('"Blah blah'). It seems quite an overkill, and
is not reliable as the user can simply edit the text and the previous
match will end up in a different position. I could try to signal
"recheck" to OOo, but I don't want to recheck the whole document...
There is no way to call "recheck text" on a single paragraph, which
would be needed in such a case.
(2) Worse still, if an English paragraph contains a single French word,
I would loose the rule state info as well, or I would have to store all
instances of LanguageTool per supported language in memory, as rules and
their state are implemented on the language level. It's possible, of
course, until we have just a couple of languages in a document. But in a
multilingual document (which is easy for any European Union leaflet with
10 languages or something) you would have all that checkers in memory...
Of course, I could try to store just the state of the paragraph-level
rules instead of the whole checker but that would complicate the code a lot.
I was playing with different design strategies, and it seems to me that
it would be easiest for me if we had two more features in the API:
(1) Checking the whole paragraphs and
(2) triggering a recheck of the whole paragraphs for special
paragraph-level rules.
I could try to implement normal sentence-level checks via doProofreading
and iterate the text manually via the paragraph-text APIs, where those
special rules would be called on whole paragraphs; but maybe the same
functionality would be needed for other checkers, and I would be
duplicating code... This would involve another change of APIs, which
isn't the nicest thing, to say the least.
What do others think? Any thoughts or advices on that?
Thanks in advance
Marcin
--
www.languagetool.org
---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lingucomponent.openoffice.org
For additional commands, e-mail: dev-h...@lingucomponent.openoffice.org