Re: String multi-replace

Dave Angel Thu, 18 Nov 2010 06:22:50 -0800

On 2:59 PM, Sorin Schwimmer wrote:

Steven D'Aprano: the original file is 139MB (that's the typical size for it). 
Eliminating diacritics is just a little toping on the cake; the processing is 
something else.


Thanks anyway for your suggestion,
SxN

PS Perhaps I should have mention that I'm on Python 2.7

In the message you were replying to, Steven had a much more importantsuggestion to make than the size one, and you apparently didn't noticeit. Chris made a similar implication. I'll try a third time.

The file is obviously encoded, and you know the encoding. Judging fromthe first entry in your table, it's in utf-8. If so, then your approachis all wrong. Treating it as a pile of bytes, and replacing pairs islikely to get you in trouble, since it's quite possible that you mayget a match with the last byte of one character and the first byte ofanother one. If you substitute such a match, you'll make a hash of thewhole region, and quite likely end up with a byte stream that is nolonger even utf-8.

Fortunately, you can solve that problem, and simplify your code greatlyin the bargain, by doing something like what was suggested by Steven.

Change your map of encoded bytes into unicode_nodia, usingdecode("utf-8") on the keys, and u"" on the values

Read in each line of the file, decode it to the unicode it represents,and do a simple translate once it's valid unicode.


Assuming the line is in utf-8, use
  uni = line.decode("utf-8")
  newuni = uni.trans(unicode_nodia)
  newutf8 = newuni.encode("utf-8")

incidentally, to see what a given byte pair in your table is, you can dosomething like:


   import unicodedata
   a = chr(196)+chr(130)
    unicodedata.name(a.decode("utf-8"))
'LATIN CAPITAL LETTER A WITH BREVE'



DaveA


--
http://mail.python.org/mailman/listinfo/python-list

Re: String multi-replace

Reply via email to