W dniu 2013-05-21 13:57, Andriy Rysin pisze: > On May 21, 2013 4:17 AM, "Marcin Miłkowski" <[email protected] > <mailto:[email protected]>> wrote: > > > > W dniu 2013-05-21 05:26, Andriy Rysin pisze: > > > On 04/21/2013 03:11 AM, Jaume Ortolà i Font wrote: > > >> 2013/4/21 Andriy Rysin <[email protected] <mailto:[email protected]> > <mailto:[email protected] <mailto:[email protected]>>> > > >> > > >> 1) I would like to treat several apostrophes equally > (apostrophes are > > >> part of the word in Ukrainian), e.g. in dictionary and rules I > > >> could use > > >> ' (0x27) but I would like to be able to parse text that has U+2019 > > >> (and > > >> potentially U+02BC) the same way, I guess I could do a simple > > >> replace in > > >> word tokenizer but I was wondering if there's a better way > > >> > > >> This is what is done in Catalan. So far I have found no problem. > > >> > > > This seems to work pretty nice for *replacing* chars, but if I also > > > *remove* accent (U+0301) from words in word tokenizer it looks like it > > > messes up the error position in the sentence (at least in the web > > > interface). Is there a right way to remove symbols I don't care about? > > > > Yes, but you'd need to change processing a bit: I had an idea to mark up > > some AnalyzedTokenReadings as ignorable, so that the rules wouldn't see > > them. Basically, a single attribute should suffice, and in several > > places (where you get tokens without spaces, for example) these tokens > > would be excluded. Also, the code for checking for the preceding space > > would need to be checked so that the ignorable symbol would not mess up > > with it. > Marcin > > I'm not sure I understood, I don't want to exclude tokens, I want to > remove a character from token as it wasn't there. But it looks when the > position of the error after that token in a sentence is calculated the > removed character is not taken to account. > It feels like if I want to remove character I need to remember previous > token position and length and use it later for position calculation.
Right. Then we'd need to add a property to store the original characters, and then restore them when making substitutions. This is quite tricky but it would be useful for our XML mode, where we have wrong positions just because we loose the info about the original characters. Search the archive, I had a suggestion about it some time ago. But I forgot how it was supposed to work, and never had time to work on it. Marcin > > Andriy > > > > ------------------------------------------------------------------------------ > Try New Relic Now & We'll Send You this Cool Shirt > New Relic is the only SaaS-based application performance monitoring service > that delivers powerful full stack analytics. Optimize and monitor your > browser, app, & servers with just a few lines of code. Try New Relic > and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may > > > > _______________________________________________ > Languagetool-devel mailing list > [email protected] > https://lists.sourceforge.net/lists/listinfo/languagetool-devel > ------------------------------------------------------------------------------ Try New Relic Now & We'll Send You this Cool Shirt New Relic is the only SaaS-based application performance monitoring service that delivers powerful full stack analytics. Optimize and monitor your browser, app, & servers with just a few lines of code. Try New Relic and get this awesome Nerd Life shirt! http://p.sf.net/sfu/newrelic_d2d_may _______________________________________________ Languagetool-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/languagetool-devel
