Re: equivalent and optional characters in words

Andriy Rysin Tue, 21 May 2013 04:57:28 -0700

On May 21, 2013 4:17 AM, "Marcin Miłkowski" <[email protected]> wrote:
>
> W dniu 2013-05-21 05:26, Andriy Rysin pisze:
> > On 04/21/2013 03:11 AM, Jaume Ortolà i Font wrote:
> >> 2013/4/21 Andriy Rysin <[email protected] <mailto:[email protected]>>
> >>
> >>     1) I would like to treat several apostrophes equally (apostrophes
are
> >>     part of the word in Ukrainian), e.g. in dictionary and rules I
> >>     could use
> >>     ' (0x27) but I would like to be able to parse text that has U+2019
> >>     (and
> >>     potentially U+02BC) the same way, I guess I could do a simple
> >>     replace in
> >>     word tokenizer but I was wondering if there's a better way
> >>
> >> This is what is done in Catalan. So far  I have found no problem.
> >>
> > This seems to work pretty nice for *replacing* chars, but if I also
> > *remove* accent (U+0301) from words in word tokenizer it looks like it
> > messes up the error position in the sentence (at least in the web
> > interface). Is there a right way to remove symbols I don't care about?
>
> Yes, but you'd need to change processing a bit: I had an idea to mark up
> some AnalyzedTokenReadings as ignorable, so that the rules wouldn't see
> them. Basically, a single attribute should suffice, and in several
> places (where you get tokens without spaces, for example) these tokens
> would be excluded. Also, the code for checking for the preceding space
> would need to be checked so that the ignorable symbol would not mess up
> with it.
Marcin


I'm not sure I understood, I don't want to exclude tokens, I want to remove
a character from token as it wasn't there. But it looks when the position
of the error after that token in a sentence is calculated the removed
character is not taken to account.
It feels like if I want to remove character I need to remember previous
token position and length and use it later for position calculation.

Andriy

------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may

_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: equivalent and optional characters in words

Reply via email to