On 20/01/2013 01:45, Dominique Pellé wrote:
> Mauro Condarelli wrote:
>
>> Currently I have multi-dictionary capability and I (slightly) modified
>> MorfologikSpellerRule to accept without further action words having POS
>> tags.
> Hi Mauro
>
> We need to be able to turn this on/off per language.
> Is this the case?
>
> What you describe will be useful in Breton at least, where the dictionary
> for POS tag has some good words which are not in Hunspell.
>
> In Esperanto, it will not work at all because the POS tagger is not
> dictionary based. Some of the words which have a POS tag can
> still be considered as a typo. It may seem strange but the Esperanto
> Hunspell has many missing words: it's hard to list all valid words
> in Esperanto because it's an agglutinative language. But because
> the language is regular, instead of using a dictionary, the Esperanto
> tagger can use an algorithm based on word endings: words ending
> in *o are nouns, *oj are plural nouns, *a are adjectives, *e are
> adverbs, etc.
>
> In French, I will also turn it off, because the POS tag dictionary
> and Hunspell are based on the same dictionary (http://www.dicollect.org),
> but they have different tokenization. Tokenization for Hunspell for
> example does not split on apostrophe so "l'haricot" is recognized
> as typo. But for grammar checking, it is split on the apostrophe.
> So ignoring typos for words that have POS will ignore valid typos
> in French such as: L'haricot. There is nothing to gain with this
> change anyway for French because the Hunspell dictionary is very
> good.
>
> Regards
> Dominique
>
This needs to be discussed a bit before I proceed.
To clarify:
Current patch can't be disabled per-language.
It would not be a problem to modify it in order to disable it.
Reason for the patch is I'm building a Multi-tier tagger based on both 
BaseTagger and ManualTagger.
General idea is to have tree possible tagging dictionaries:
1) The global language dictionary.
2) An User Dictionary.
3) A dictionary specific for the file being processed.

this is mainly useful for proper names, including places, neologism and 
foreign words we might want to intersperse in the checked text, but 
could also be a good way to improve the standard dictionary if we could 
ask users to send over their "improvements".
It would be possible (and I plan to implement) full update of main 
tagger dict with "user suggestions"

I plan to add some API to dynamically manage these dictionaries.
Options should be:
a) Ignore for the current session. Nothing is saved on disk.
b) save as:
   i) local word used in the document (or group of documents or 
application using LT)
   ii) User specific
   iii) Global: this is a real word not covered by current tagger: add it!
This means I can have up to seven tagging dictionaries currently active.
On the other hand the current hunspell-based strategy is static, for 
this reason I need a way to know if the word has already been found 
somewhere or if a further check is in order.
I chose to use the presence of a POS tag and hence the patch, but I'm 
open to suggestions.

Two possible course of action come to mind:
either add Yet Another Flag to AnalyzedTokenReadings (e.g.: spellOk)
or recheck all dictionaries in MorfolgikSpellRule (this seems really an 
overkill, especially for all languages where tokenizer and speller are 
based on the same dictionary).

Please advise.
Mauro

------------------------------------------------------------------------------
Master Visual Studio, SharePoint, SQL, ASP.NET, C# 2012, HTML5, CSS,
MVC, Windows 8 Apps, JavaScript and much more. Keep your skills current
with LearnDevNow - 3,200 step-by-step video tutorials by Microsoft
MVPs and experts. ON SALE this month only -- learn more at:
http://p.sf.net/sfu/learnmore_123012
_______________________________________________
Languagetool-devel mailing list
Languagetool-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to