Re: Aw: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

Frédéric Grosshans Wed, 30 Oct 2013 09:08:59 -0700

Le 30/10/2013 16:13, "Jörg Knappen" a écrit :

Thanks again!
My updated sed pattern generator now looks like:
r = range(0xa0, 0x170)
file = open("fixu8.sed", "w")
for i in r:
pat1 ="s/"+unichr(i).encode("utf-8").decode("latin-1").encode("utf-8") + "/"+ unichr(i).encode("utf-8") +"/g"
  print >>file, pat1
  try:
pat2 ="s/"+unichr(i).encode("utf-8").decode("windows-1252").encode("utf-8")+ "/" + unichr(i).encode("utf-8") +"/g"
  except:
    pat2 = pat1
  if (pat1 != pat2):
    print >>file, pat2
doing both latin-1 and windows-1252 mangled double utf-8. This isprobably enough for now, the rate of errors is lowenough for practical purposes (i.e., lower than the natural error rateintroduced by typing errors)

Why to you do both latin1 and windows-1252 ? Windows-1252 is supposed tobe a superset of latin1, so it should be enough. Or is there a problemwith the few undefined bytes of windows-1252 (81, 8D, 8F, 90, 9D) ?



    Frédéric

Re: Aw: Re: Re: Re: Do you know a tool to decode "UTF-8 twice"

Reply via email to