On 04/21/2013 03:11 AM, Jaume Ortolà i Font wrote:
2013/4/21 Andriy Rysin <[email protected] <mailto:[email protected]>>

    1) I would like to treat several apostrophes equally (apostrophes are
    part of the word in Ukrainian), e.g. in dictionary and rules I
    could use
    ' (0x27) but I would like to be able to parse text that has U+2019
    (and
    potentially U+02BC) the same way, I guess I could do a simple
    replace in
    word tokenizer but I was wondering if there's a better way

This is what is done in Catalan. So far  I have found no problem.

Jaume
Thanks, will try that. Another one: what's the recommended way to store knowledge about alternative spellings for the word, e.g. color vs colour? It looks like it would make sense to code this relation in the dictionary so that we don't have to introduce regex for alternative spelling and repeat it multiple times in the rules. But I looked at the English module and it looks like such relation is not present in the dictionary but instead hardcoded in the rules.

Thanks
Andriy
------------------------------------------------------------------------------
Precog is a next-generation analytics platform capable of advanced
analytics on semi-structured data. The platform includes APIs for building
apps and a phenomenal toolset for data science. Developers can use
our toolset for easy data analysis & visualization. Get a free account!
http://www2.precog.com/precogplatform/slashdotnewsletter
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Reply via email to