Re: converting to and from octal escaped UTF--8
On Dec 3, 8:10 am, Michael Goerz <[EMAIL PROTECTED]> wrote: > MonkeeSage wrote: > > On Dec 3, 1:31 am, MonkeeSage <[EMAIL PROTECTED]> wrote: > >> On Dec 2, 11:46 pm, Michael Spencer <[EMAIL PROTECTED]> wrote: > > >>> Michael Goerz wrote: > Hi, > I am writing unicode stings into a special text file that requires to > have non-ascii characters as as octal-escaped UTF-8 codes. > For example, the letter "Í" (latin capital I with acute, code point 205) > would come out as "\303\215". > I will also have to read back from the file later on and convert the > escaped characters back into a unicode string. > Does anyone have any suggestions on how to go from "Í" to "\303\215" and > vice versa? > >>> Perhaps something along the lines of: > >>> >>> def encode(source): > >>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8')) > >>> ... > >>> >>> def decode(encoded): > >>> ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:]) > >>> ... return bytes.decode('utf8') > >>> ... > >>> >>> encode(u"Í") > >>> '\\303\\215' > >>> >>> print decode(_) > >>> Í > >>> HTH > >>> Michael > >> Nice one. :) If I might suggest a slight variation to handle cases > >> where the "encoded" string contains plain text as well as octal > >> escapes... > > >> def decode(encoded): > >> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)): > >> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) > >> return encoded.decode('utf8') > > >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146" > >> as well as "adf\\303\\215adf". > > >> Regards, > >> Jordan > > > err... > > > def decode(encoded): > > for octc in re.findall(r'\\(\d{3})', encoded): > > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) > > return encoded.decode('utf8') > > Great suggestions from both of you! I came up with my "final" solution > based on them. It encodes only non-ascii and non-printables, and stays > in unicode strings for both input and output. Also, low ascii values now > encode into a 3-digit octal sequence also, so that decode can catch them > properly. > > Thanks a lot, > Michael > > > > import re > > def encode(source): > encoded = "" > for character in source: > if (ord(character) < 32) or (ord(character) > 128): > for byte in character.encode('utf8'): > encoded += ("\%03o" % ord(byte)) > else: > encoded += character > return encoded.decode('utf-8') > > def decode(encoded): > decoded = encoded.encode('utf-8') > for octc in re.findall(r'\\(\d{3})', decoded): > decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8))) > return decoded.decode('utf8') > > orig = u"blaÍblub" + chr(10) > enc = encode(orig) > dec = decode(enc) > print orig > print enc > print dec An optimization...in decode() store matches as keys in a dict, so you only do the string replacement once for each unique character... def decode(encoded): decoded = encoded.encode('utf-8') matches = {} for octc in re.findall(r'\\(\d{3})', decoded): matches[octc] = None for octc in matches: decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8))) return decoded.decode('utf8') Untested... Regards, Jordan -- http://mail.python.org/mailman/listinfo/python-list
Re: converting to and from octal escaped UTF--8
> Michael Goerz <[EMAIL PROTECTED]> (MG) wrote: >MG> if (ord(character) < 32) or (ord(character) > 128): If you encode chars < 32 it seems more appropriate to also encode 127. Moreover your code is quadratic in the size of the string so if you use long strings it would be better to use join. -- Piet van Oostrum <[EMAIL PROTECTED]> URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4] Private email: [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
Re: converting to and from octal escaped UTF--8
MonkeeSage wrote: > On Dec 3, 1:31 am, MonkeeSage <[EMAIL PROTECTED]> wrote: >> On Dec 2, 11:46 pm, Michael Spencer <[EMAIL PROTECTED]> wrote: >> >> >> >>> Michael Goerz wrote: Hi, I am writing unicode stings into a special text file that requires to have non-ascii characters as as octal-escaped UTF-8 codes. For example, the letter "Í" (latin capital I with acute, code point 205) would come out as "\303\215". I will also have to read back from the file later on and convert the escaped characters back into a unicode string. Does anyone have any suggestions on how to go from "Í" to "\303\215" and vice versa? >>> Perhaps something along the lines of: >>> >>> def encode(source): >>> ... return "".join("\%o" % ord(c) for c in source.encode('utf8')) >>> ... >>> >>> def decode(encoded): >>> ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:]) >>> ... return bytes.decode('utf8') >>> ... >>> >>> encode(u"Í") >>> '\\303\\215' >>> >>> print decode(_) >>> Í >>> HTH >>> Michael >> Nice one. :) If I might suggest a slight variation to handle cases >> where the "encoded" string contains plain text as well as octal >> escapes... >> >> def decode(encoded): >> for octc in (c for c in re.findall(r'\\(\d{3})', encoded)): >> encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) >> return encoded.decode('utf8') >> >> This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146" >> as well as "adf\\303\\215adf". >> >> Regards, >> Jordan > > err... > > def decode(encoded): > for octc in re.findall(r'\\(\d{3})', encoded): > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) > return encoded.decode('utf8') Great suggestions from both of you! I came up with my "final" solution based on them. It encodes only non-ascii and non-printables, and stays in unicode strings for both input and output. Also, low ascii values now encode into a 3-digit octal sequence also, so that decode can catch them properly. Thanks a lot, Michael import re def encode(source): encoded = "" for character in source: if (ord(character) < 32) or (ord(character) > 128): for byte in character.encode('utf8'): encoded += ("\%03o" % ord(byte)) else: encoded += character return encoded.decode('utf-8') def decode(encoded): decoded = encoded.encode('utf-8') for octc in re.findall(r'\\(\d{3})', decoded): decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8))) return decoded.decode('utf8') orig = u"blaÍblub" + chr(10) enc = encode(orig) dec = decode(enc) print orig print enc print dec -- http://mail.python.org/mailman/listinfo/python-list
Re: converting to and from octal escaped UTF--8
On Dec 3, 1:31 am, MonkeeSage <[EMAIL PROTECTED]> wrote: > On Dec 2, 11:46 pm, Michael Spencer <[EMAIL PROTECTED]> wrote: > > > > > Michael Goerz wrote: > > > Hi, > > > > I am writing unicode stings into a special text file that requires to > > > have non-ascii characters as as octal-escaped UTF-8 codes. > > > > For example, the letter "Í" (latin capital I with acute, code point 205) > > > would come out as "\303\215". > > > > I will also have to read back from the file later on and convert the > > > escaped characters back into a unicode string. > > > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and > > > vice versa? > > > Perhaps something along the lines of: > > > >>> def encode(source): > > ... return "".join("\%o" % ord(c) for c in source.encode('utf8')) > > ... > > >>> def decode(encoded): > > ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:]) > > ... return bytes.decode('utf8') > > ... > > >>> encode(u"Í") > > '\\303\\215' > > >>> print decode(_) > > Í > > > HTH > > Michael > > Nice one. :) If I might suggest a slight variation to handle cases > where the "encoded" string contains plain text as well as octal > escapes... > > def decode(encoded): > for octc in (c for c in re.findall(r'\\(\d{3})', encoded)): > encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) > return encoded.decode('utf8') > > This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146" > as well as "adf\\303\\215adf". > > Regards, > Jordan err... def decode(encoded): for octc in re.findall(r'\\(\d{3})', encoded): encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) return encoded.decode('utf8') -- http://mail.python.org/mailman/listinfo/python-list
Re: converting to and from octal escaped UTF--8
On Dec 2, 11:46 pm, Michael Spencer <[EMAIL PROTECTED]> wrote: > Michael Goerz wrote: > > Hi, > > > I am writing unicode stings into a special text file that requires to > > have non-ascii characters as as octal-escaped UTF-8 codes. > > > For example, the letter "Í" (latin capital I with acute, code point 205) > > would come out as "\303\215". > > > I will also have to read back from the file later on and convert the > > escaped characters back into a unicode string. > > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and > > vice versa? > > Perhaps something along the lines of: > > >>> def encode(source): > ... return "".join("\%o" % ord(c) for c in source.encode('utf8')) > ... > >>> def decode(encoded): > ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:]) > ... return bytes.decode('utf8') > ... > >>> encode(u"Í") > '\\303\\215' > >>> print decode(_) > Í > >>> > > HTH > Michael Nice one. :) If I might suggest a slight variation to handle cases where the "encoded" string contains plain text as well as octal escapes... def decode(encoded): for octc in (c for c in re.findall(r'\\(\d{3})', encoded)): encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8))) return encoded.decode('utf8') This way it can handle both "\\141\\144\\146\\303\\215\\141\\144\\146" as well as "adf\\303\\215adf". Regards, Jordan -- http://mail.python.org/mailman/listinfo/python-list
Re: converting to and from octal escaped UTF--8
Michael Goerz wrote: > Hi, > > I am writing unicode stings into a special text file that requires to > have non-ascii characters as as octal-escaped UTF-8 codes. > > For example, the letter "Í" (latin capital I with acute, code point 205) > would come out as "\303\215". > > I will also have to read back from the file later on and convert the > escaped characters back into a unicode string. > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and > vice versa? > Perhaps something along the lines of: >>> def encode(source): ... return "".join("\%o" % ord(c) for c in source.encode('utf8')) ... >>> def decode(encoded): ... bytes = "".join(chr(int(c, 8)) for c in encoded.split('\\')[1:]) ... return bytes.decode('utf8') ... >>> encode(u"Í") '\\303\\215' >>> print decode(_) Í >>> HTH Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: converting to and from octal escaped UTF--8
MonkeeSage wrote: > Looks like escape() can be a bit simpler... > > def escape(s): > result = [] > for char in s: > result.append("\%o" % ord(char)) > return ''.join(result) > > Regards, > Jordan Very neat! Thanks a lot... Michael -- http://mail.python.org/mailman/listinfo/python-list
Re: converting to and from octal escaped UTF--8
On Dec 2, 8:38 pm, Michael Goerz <[EMAIL PROTECTED]> wrote: > Michael Goerz wrote: > > Hi, > > > I am writing unicode stings into a special text file that requires to > > have non-ascii characters as as octal-escaped UTF-8 codes. > > > For example, the letter "Í" (latin capital I with acute, code point 205) > > would come out as "\303\215". > > > I will also have to read back from the file later on and convert the > > escaped characters back into a unicode string. > > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and > > vice versa? > > > I know I can get the code point by doing > "Í".decode('utf-8').encode('unicode_escape') > > but there doesn't seem to be any similar method for getting the octal > > escaped version. > > > Thanks, > > Michael > > I've come up with the following solution. It's not very pretty, but it > works (no bugs, I hope). Can anyone think of a better way to do it? > > Michael > _ > > import binascii > > def escape(s): > hexstring = binascii.b2a_hex(s) > result = "" > while len(hexstring) > 0: > (hexbyte, hexstring) = (hexstring[:2], hexstring[2:]) > octbyte = oct(int(hexbyte, 16)).zfill(3) > result += "\\" + octbyte[-3:] > return result > > def unescape(s): > result = "" > while len(s) > 0: > if s[0] == "\\": > (octbyte, s) = (s[1:4], s[4:]) > try: > result += chr(int(octbyte, 8)) > except ValueError: > result += "\\" > s = octbyte + s > else: > result += s[0] > s = s[1:] > return result > > print escape("\303\215") > print unescape('adf\\303\\215adf') Looks like escape() can be a bit simpler... def escape(s): result = [] for char in s: result.append("\%o" % ord(char)) return ''.join(result) Regards, Jordan -- http://mail.python.org/mailman/listinfo/python-list
Re: converting to and from octal escaped UTF--8
Michael Goerz wrote: > Hi, > > I am writing unicode stings into a special text file that requires to > have non-ascii characters as as octal-escaped UTF-8 codes. > > For example, the letter "Í" (latin capital I with acute, code point 205) > would come out as "\303\215". > > I will also have to read back from the file later on and convert the > escaped characters back into a unicode string. > > Does anyone have any suggestions on how to go from "Í" to "\303\215" and > vice versa? > > I know I can get the code point by doing "Í".decode('utf-8').encode('unicode_escape') > but there doesn't seem to be any similar method for getting the octal > escaped version. > > Thanks, > Michael I've come up with the following solution. It's not very pretty, but it works (no bugs, I hope). Can anyone think of a better way to do it? Michael _ import binascii def escape(s): hexstring = binascii.b2a_hex(s) result = "" while len(hexstring) > 0: (hexbyte, hexstring) = (hexstring[:2], hexstring[2:]) octbyte = oct(int(hexbyte, 16)).zfill(3) result += "\\" + octbyte[-3:] return result def unescape(s): result = "" while len(s) > 0: if s[0] == "\\": (octbyte, s) = (s[1:4], s[4:]) try: result += chr(int(octbyte, 8)) except ValueError: result += "\\" s = octbyte + s else: result += s[0] s = s[1:] return result print escape("\303\215") print unescape('adf\\303\\215adf') -- http://mail.python.org/mailman/listinfo/python-list