Re: equivalent and optional characters in words

Marcin Miłkowski Tue, 21 May 2013 07:48:08 -0700

W dniu 2013-05-21 13:57, Andriy Rysin pisze:
> On May 21, 2013 4:17 AM, "Marcin Miłkowski" <[email protected]
> <mailto:[email protected]>> wrote:
>  >
>  > W dniu 2013-05-21 05:26, Andriy Rysin pisze:
>  > > On 04/21/2013 03:11 AM, Jaume Ortolà i Font wrote:
>  > >> 2013/4/21 Andriy Rysin <[email protected] <mailto:[email protected]>
> <mailto:[email protected] <mailto:[email protected]>>>
>  > >>
>  > >>     1) I would like to treat several apostrophes equally
> (apostrophes are
>  > >>     part of the word in Ukrainian), e.g. in dictionary and rules I
>  > >>     could use
>  > >>     ' (0x27) but I would like to be able to parse text that has U+2019
>  > >>     (and
>  > >>     potentially U+02BC) the same way, I guess I could do a simple
>  > >>     replace in
>  > >>     word tokenizer but I was wondering if there's a better way
>  > >>
>  > >> This is what is done in Catalan. So far  I have found no problem.
>  > >>
>  > > This seems to work pretty nice for *replacing* chars, but if I also
>  > > *remove* accent (U+0301) from words in word tokenizer it looks like it
>  > > messes up the error position in the sentence (at least in the web
>  > > interface). Is there a right way to remove symbols I don't care about?
>  >
>  > Yes, but you'd need to change processing a bit: I had an idea to mark up
>  > some AnalyzedTokenReadings as ignorable, so that the rules wouldn't see
>  > them. Basically, a single attribute should suffice, and in several
>  > places (where you get tokens without spaces, for example) these tokens
>  > would be excluded. Also, the code for checking for the preceding space
>  > would need to be checked so that the ignorable symbol would not mess up
>  > with it.
> Marcin
>
> I'm not sure I understood, I don't want to exclude tokens, I want to
> remove a character from token as it wasn't there. But it looks when the
> position of the error after that token in a sentence is calculated the
> removed character is not taken to account.
> It feels like if I want to remove character I need to remember previous
> token position and length and use it later for position calculation.


Right. Then we'd need to add a property to store the original 
characters, and then restore them when making substitutions. This is 
quite tricky but it would be useful for our XML mode, where we have 
wrong positions just because we loose the info about the original 
characters.

Search the archive, I had a suggestion about it some time ago. But I 
forgot how it was supposed to work, and never had time to work on it.

Marcin

>
> Andriy
>
>
>
> ------------------------------------------------------------------------------
> Try New Relic Now & We'll Send You this Cool Shirt
> New Relic is the only SaaS-based application performance monitoring service
> that delivers powerful full stack analytics. Optimize and monitor your
> browser, app, & servers with just a few lines of code. Try New Relic
> and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
>
>
>
> _______________________________________________
> Languagetool-devel mailing list
> [email protected]
> https://lists.sourceforge.net/lists/listinfo/languagetool-devel
>


------------------------------------------------------------------------------
Try New Relic Now & We'll Send You this Cool Shirt
New Relic is the only SaaS-based application performance monitoring service 
that delivers powerful full stack analytics. Optimize and monitor your
browser, app, & servers with just a few lines of code. Try New Relic
and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may
_______________________________________________
Languagetool-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/languagetool-devel

Re: equivalent and optional characters in words

Reply via email to