Re: Nim code to Remove Accented Letters

OderWat Mon, 03 Oct 2016 18:30:03 +0200

As I said.. this and your original c-code will not work with UTF-8 at all.

To make it work for UTF-8 you can use the 
[unicode.nim](http://forum.nim-lang.org///nim-lang.org/docs/unicode.html) 
module. But switching to UTF-8 and therefor Unicode will make all of that much 
more complex.


For example your input string will change its length (byte wise) because UTF-8 
representation of chars will change.

There are different ways to solve your problem. Even such, which do not at all 
use the unicode module and do not even "really" know about UTF-8 which even may 
be the best solution for your task.

The most efficient solution for your exact problem would IMHO be an array of 
"utf-8" strings which get searched in your string while using two indices into 
the string. The first will be used to check if you find a replacement in the 
array the second is the position of the "resulting" string. You could also just 
create a new string but that would be slightly less efficient. If you do, make 
sure you preallocate the maximum space and setLen later for it.

The procedure is pretty simple: Every time you find a replacement you advance 
the (raw) length of the found string (aka RuneLen) in the first index and add 
the replacement char (from a string which uses the index of the first array to 
carry the replacement chars) at the second index.

if nothing is found you just copy one char and search again. As the result is 
always shorter than the original string that will work and end with the second 
index giving the new length of your string.

It is slightly inefficient to do that scanning byte wise because you will also 
search for matching substrings inside of other UTF-8 encoded sequences.

To avoid that you could use the same technique as the utf8 iterator uses:
    
    
    iterator utf8*(s: string): string =
      var o = 0
      while o < s.len:
        let n = runeLenAt(s, o)
        yield s[o.. (o+n-1)] # <- this is what you need to search for and 
replace with your unacceted chars
        o += n
    

I hope of being some help without writing a working version down :)

Re: Nim code to Remove Accented Letters

Reply via email to