Re: [lingu-dev] About proofreader and spell checker interaction

Thomas Lange - Sun Germany - ham02 - Hamburg Tue, 21 Apr 2009 07:28:28 -0700

The currently last (3rd) posting:



Hello Marcin,


>> > > Currently spell checkers are chained (that is up for discussion as well
>> > > though, since without chaining the route to take seems to be rather
>> > > obvious). That means if any of several spell checker for a given
>> > > language says this text is correct than no error will be reported. That
>> > > would allow for spell checker A to check normal English text, and for
>> > > spell checker B to know only about English medical words. Those two
>> > > spell checkers can easily be chained and you will get a result that is
>> > > better than using just a single one. Without chaining you would need a
>> > > spell checker that has to take care of both tasks in one sweep.
>>     
> >
> > I'd say that chaining is OK as far as normal (non-context) spellers are 
> > concerned. For grammar checkers it should be different, as they work on 
> > a different principle (most of the time): instead of accepting a word 
> > from a finite list, they search for an error from a finite list to say 
> > that they don't accept the text. So instead of using OR (a disjunction) 
> > of results, use AND (a conjunction) here - all proof-readers should not 
> > raise any errors, but if any of them raises one, display it.
> >   
>   

If want you meant here was chaining of grammar checkers than that
probably will never happen. Currently there is only one per language
allowed.
Originally we also had in mind that grammar checkers could be chained.
And we were also told that the relation for chaining them should be AND.
But in the end we dropped that idea for two reasons:

1) chaining grammar checkers will likely be very time consuming and
often enough the process is already somewhat slow.

2) that however is far outweighed by the reason that there is absolutely
NO chance to sort out the problem that there is no solution to the
problem what to do if both grammar checkers had different ideas about
what the sentence end should be. And we don't want to go with separate
line ends for each checker. After all the whole process is sentence
based. Thus a disagreement about the sentence end would be a major
problem. And the only always working way to prevent such disagreements
is to have only one grammar checker per language, since the sentence end
detection must be left to the specific implementation.
Chaining of grammar checkers would only be Ok if OOo would do the
sentence end analysis and enforce the results. But that is not an option.


> > [...]
> >
>   
>> > > Thus the problems at hand and to be discussed are:
>> > > 
>> > > a) should we give up on chained spell checkers even though there are
>> > > good uses for them? The simple fact that vanilla OOo has only one spell
>> > > checker does not mean there aren't other spell checkers around that
>> > > already make use of that chaining... Or that someone would like to make
>> > > use of it in the future.
>>     
> >
> > The easiest solution would be to define that a proofreader that has 
> > isSpellChecker() should be chained as all checkers are. 
>   

Nope.
All other spell checkers already have the limitation that they are word
based.
Thus chaining is also only possible for word based spell checkers. After
all an easy chaining would require the same kind of API interface...
Of course the proofreader component is free to also implement a 'normal'
spell checker as well. (Actually the third party component we coded does
this.)
But you can't chain the word 'Correct' from the example to the proof
reader API on its own without the context (sentence).

Thus I believe it should be something like this:
The grammar checker or more likely the grammar checking iterator has to
make a separate run for all words of the current sentence with the
respective spell checkers. If we decide on a fixed logic of merging the
results with spell checking results from the proofreader then it can be
implemented in the gciterator, otherwise it probably needs to be
implemented by the proofreader itself. In the latter case we should
provide an API for the proofreader to make use of that. At least it
should already take care of presenting only the overall result after
chaining all independent word based spell checkers.

My preference would be to have an overall logic that can be implemented
in the gciterator since it would prevent extra burden from the
proofreader implementation.
Thus the question would be if we can decide on a fixed logic for merging
the results.



> > If not, then it 
> > should be treated in the following manner: whenever a proofreader 
> > returns an error marked as spellcheck, display it in red, unless this 
> > error has been found earlier by another checker. Yet, in such a case, a 
> > comment should be in place, so only change the color, nothing else. 
> > (Even in a spellchecking dialog, the error could be reported later than 
> > normal spelling errors).
> >   
>   
The spelling errors found by a proofreader need to reported (and taken
care of by the user) first. The reason for this is that grammar checking
requires the proofreader to properly identify/tokenize each word, and
usually that can't be done if there are spelling errors. Thus the
quality of proofreading depends on the spelling errors being resolved first.
In which order the spelling errors from different sources are displayed
does not matter much. But probably they should be sorted by their
occurrence in the sentence.


>> > > But even if we give up on chaining but still have a grammar checker that
>> > > is also a spell checker AND a second only spell checker, we still have
>> > > to decide if we want to make use of the second one. If we want to make
>> > > use of that one as well, how to merge the results? Should it simply be
>> > > that the grammar checkers spell checker is only allowed to mark errors
>> > > where the second one hat found none? 
>>     
> >
> > That seems reasonable, otherwise multiple errors would be displayed in 
> > the same position.
> >   
>   
Yes, avoiding overlapping errors should also be done if possible. So
which error is going to win if the chained spell checkers and the proof
reader report a spelling error at overlapping but NOT identical positions?



>> > > Or should it be allowed to
>> > > overrule errors found by the second one as not-to-be-reported as well?
>>     
> >
> > That is interesting. Well, I didn't think of it as we never say "this is 
> > acceptable", we only return errors. 
>   
Sure.
I also don't expect any proofreader to implement a
'this-is-100%-correct' check function. It was probably just a useless
thought of mine, since if the spell checker can not provide some
detailed information about the type of error found, then the only choice
for overruling the spell checker results in this case would be for the
proofreader to discard all of them. Thus essentially saying: if the
proofreader returns spelling errors as well, then don't use word-only
spell checkers at all. Thus lets just forget about this thought of mine,
since providing any additional information from the word-only spell
checker will probably need a complete new dictionary implementation to
provide that kind of information.

Or can additional information be provided by Hunspell only?
And more pressing what kind of information can it be that a spell
checker can return in order for a proofreader (or the gciterator) to
decide if a specific error found by e.g. Hunspell should be discarded now?


> > The API has no way of overruling 
> > results. I would say an easier solution would be to explicitly say that 
> > spellcheckers should accept all words disregarding the context, so they 
> > would accept "Sri" without "Lanka" or "Burkino" without "Fasa". Next, a 
> > grammar checker would see if Lanka is preceded with Sri, or Sri is 
> > followed by Lanka etc.
> >
> > Of course, this presupposes that developers of proofreaders are in touch 
> > with developers of spellchecker dictionaries so that dictionaries would 
> > be properly prepared.
> >   
>   
Why would that be the case?
The word-only spell checkers like Hunspell will be just fine if "Sri"
and "Lanka" are encountered by themselves. And later on the proofreader
can decide to raise a spelling error if it encounters "Sri" without "Lanka".
Therefore in this case I see no need fore a more close collaboration
between dictionary providers and proofreader implementation.

They only thing that needs to be done is to extend the words (i.e. the
breakiterators definition of a word) to such a level that the spell
checker will not get handed over text parts that are not acceptable as a
single word. This is to prevent it from marking text parts as wrong that
are actually correct.
(A good example for such problems is issue #64400)
The rest is left to the proofreader.


> > Yet, as you probably know, Laci Nemeth wants to add some limited 
> > context-check to hunspell. 
>   

Yes, I know. And I'm sorry for still not having found the time to
provide him with a rudimentary C++ implementation. :-(



> > Ps. BTW, I've heard that the comment being visible only after clicking 
> > "Explain" is definitely less usable than the previous dialog box that we 
> > had in LanguageTool. Users I talked to prefer to have the explanation 
> > displayed without clicking. I find this intuitive as well. Maybe we 
> > should ask people from the UX project to comment on this?
> >   
>   

Sure you can.
The last time I asked I was told that the dialog is already cramped but
the size of the dialog should not increase also. Thus nothing was done
to display the text from the 'Explain' button in a more directly visible
way.


Thomas

Re: [lingu-dev] About proofreader and spell checker interaction

Reply via email to