Re: converting to and from octal escaped UTF--8

2007-12-04 Thread Piet van Oostrum
 Michael Goerz [EMAIL PROTECTED] (MG) wrote:

MG if (ord(character)  32) or (ord(character)  128):

If you encode chars  32 it seems more appropriate to also encode 127.

Moreover your code is quadratic in the size of the string so if you use
long strings it would be better to use join.
-- 
Piet van Oostrum [EMAIL PROTECTED]
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: [EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: converting to and from octal escaped UTF--8

2007-12-04 Thread MonkeeSage
On Dec 3, 8:10 am, Michael Goerz [EMAIL PROTECTED] wrote:
 MonkeeSage wrote:
  On Dec 3, 1:31 am, MonkeeSage [EMAIL PROTECTED] wrote:
  On Dec 2, 11:46 pm, Michael Spencer [EMAIL PROTECTED] wrote:

  Michael Goerz wrote:
  Hi,
  I am writing unicode stings into a special text file that requires to
  have non-ascii characters as as octal-escaped UTF-8 codes.
  For example, the letter Í (latin capital I with acute, code point 205)
  would come out as \303\215.
  I will also have to read back from the file later on and convert the
  escaped characters back into a unicode string.
  Does anyone have any suggestions on how to go from Í to \303\215 and
  vice versa?
  Perhaps something along the lines of:
 def encode(source):
... return .join(\%o % ord(c) for c in source.encode('utf8'))
...
 def decode(encoded):
... bytes = .join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
 encode(uÍ)
'\\303\\215'
 print decode(_)
Í
  HTH
  Michael
  Nice one. :) If I might suggest a slight variation to handle cases
  where the encoded string contains plain text as well as octal
  escapes...

  def decode(encoded):
for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
  encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

  This way it can handle both \\141\\144\\146\\303\\215\\141\\144\\146
  as well as adf\\303\\215adf.

  Regards,
  Jordan

  err...

  def decode(encoded):
for octc in re.findall(r'\\(\d{3})', encoded):
  encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return encoded.decode('utf8')

 Great suggestions from both of you! I came up with my final solution
 based on them. It encodes only non-ascii and non-printables, and stays
 in unicode strings for both input and output. Also, low ascii values now
 encode into a 3-digit octal sequence also, so that decode can catch them
 properly.

 Thanks a lot,
 Michael

 

 import re

 def encode(source):
 encoded = 
 for character in source:
 if (ord(character)  32) or (ord(character)  128):
 for byte in character.encode('utf8'):
 encoded += (\%03o % ord(byte))
 else:
 encoded += character
 return encoded.decode('utf-8')

 def decode(encoded):
 decoded = encoded.encode('utf-8')
 for octc in re.findall(r'\\(\d{3})', decoded):
 decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
 return decoded.decode('utf8')

 orig = ublaÍblub + chr(10)
 enc  = encode(orig)
 dec  = decode(enc)
 print orig
 print enc
 print dec

An optimization...in decode() store matches as keys in a dict, so you
only do the string replacement once for each unique character...

def decode(encoded):
  decoded = encoded.encode('utf-8')
  matches = {}
  for octc in re.findall(r'\\(\d{3})', decoded):
matches[octc] = None
  for octc in matches:
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
  return decoded.decode('utf8')

Untested...

Regards,
Jordan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: converting to and from octal escaped UTF--8

2007-12-03 Thread Michael Goerz
MonkeeSage wrote:
 On Dec 3, 1:31 am, MonkeeSage [EMAIL PROTECTED] wrote:
 On Dec 2, 11:46 pm, Michael Spencer [EMAIL PROTECTED] wrote:



 Michael Goerz wrote:
 Hi,
 I am writing unicode stings into a special text file that requires to
 have non-ascii characters as as octal-escaped UTF-8 codes.
 For example, the letter Í (latin capital I with acute, code point 205)
 would come out as \303\215.
 I will also have to read back from the file later on and convert the
 escaped characters back into a unicode string.
 Does anyone have any suggestions on how to go from Í to \303\215 and
 vice versa?
 Perhaps something along the lines of:
def encode(source):
   ... return .join(\%o % ord(c) for c in source.encode('utf8'))
   ...
def decode(encoded):
   ... bytes = .join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
   ... return bytes.decode('utf8')
   ...
encode(uÍ)
   '\\303\\215'
print decode(_)
   Í
 HTH
 Michael
 Nice one. :) If I might suggest a slight variation to handle cases
 where the encoded string contains plain text as well as octal
 escapes...

 def decode(encoded):
   for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
 encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
   return encoded.decode('utf8')

 This way it can handle both \\141\\144\\146\\303\\215\\141\\144\\146
 as well as adf\\303\\215adf.

 Regards,
 Jordan
 
 err...
 
 def decode(encoded):
   for octc in re.findall(r'\\(\d{3})', encoded):
 encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
   return encoded.decode('utf8')
Great suggestions from both of you! I came up with my final solution
based on them. It encodes only non-ascii and non-printables, and stays
in unicode strings for both input and output. Also, low ascii values now
encode into a 3-digit octal sequence also, so that decode can catch them
properly.

Thanks a lot,
Michael



import re

def encode(source):
encoded = 
for character in source:
if (ord(character)  32) or (ord(character)  128):
for byte in character.encode('utf8'):
encoded += (\%03o % ord(byte))
else:
encoded += character
return encoded.decode('utf-8')

def decode(encoded):
decoded = encoded.encode('utf-8')
for octc in re.findall(r'\\(\d{3})', decoded):
decoded = decoded.replace(r'\%s' % octc, chr(int(octc, 8)))
return decoded.decode('utf8')


orig = ublaÍblub + chr(10)
enc  = encode(orig)
dec  = decode(enc)
print orig
print enc
print dec

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: converting to and from octal escaped UTF--8

2007-12-02 Thread Michael Goerz
Michael Goerz wrote:
 Hi,
 
 I am writing unicode stings into a special text file that requires to
 have non-ascii characters as as octal-escaped UTF-8 codes.
 
 For example, the letter Í (latin capital I with acute, code point 205)
 would come out as \303\215.
 
 I will also have to read back from the file later on and convert the
 escaped characters back into a unicode string.
 
 Does anyone have any suggestions on how to go from Í to \303\215 and
 vice versa?
 
 I know I can get the code point by doing
 Í.decode('utf-8').encode('unicode_escape')
 but there doesn't seem to be any similar method for getting the octal
 escaped version.
 
 Thanks,
 Michael

I've come up with the following solution. It's not very pretty, but it
works (no bugs, I hope). Can anyone think of a better way to do it?

Michael
_

import binascii

def escape(s):
hexstring = binascii.b2a_hex(s)
result = 
while len(hexstring)  0:
(hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
octbyte = oct(int(hexbyte, 16)).zfill(3)
result += \\ + octbyte[-3:]
return result

def unescape(s):
result = 
while len(s)  0:
if s[0] == \\:
(octbyte, s) = (s[1:4], s[4:])
try:
result += chr(int(octbyte, 8))
except ValueError:
result += \\
s = octbyte + s
else:
result += s[0]
s = s[1:]
return result

print escape(\303\215)
print unescape('adf\\303\\215adf')
-- 
http://mail.python.org/mailman/listinfo/python-list

Re: converting to and from octal escaped UTF--8

2007-12-02 Thread MonkeeSage
On Dec 2, 8:38 pm, Michael Goerz [EMAIL PROTECTED] wrote:
 Michael Goerz wrote:
  Hi,

  I am writing unicode stings into a special text file that requires to
  have non-ascii characters as as octal-escaped UTF-8 codes.

  For example, the letter Í (latin capital I with acute, code point 205)
  would come out as \303\215.

  I will also have to read back from the file later on and convert the
  escaped characters back into a unicode string.

  Does anyone have any suggestions on how to go from Í to \303\215 and
  vice versa?

  I know I can get the code point by doing
  Í.decode('utf-8').encode('unicode_escape')
  but there doesn't seem to be any similar method for getting the octal
  escaped version.

  Thanks,
  Michael

 I've come up with the following solution. It's not very pretty, but it
 works (no bugs, I hope). Can anyone think of a better way to do it?

 Michael
 _

 import binascii

 def escape(s):
 hexstring = binascii.b2a_hex(s)
 result = 
 while len(hexstring)  0:
 (hexbyte, hexstring) = (hexstring[:2], hexstring[2:])
 octbyte = oct(int(hexbyte, 16)).zfill(3)
 result += \\ + octbyte[-3:]
 return result

 def unescape(s):
 result = 
 while len(s)  0:
 if s[0] == \\:
 (octbyte, s) = (s[1:4], s[4:])
 try:
 result += chr(int(octbyte, 8))
 except ValueError:
 result += \\
 s = octbyte + s
 else:
 result += s[0]
 s = s[1:]
 return result

 print escape(\303\215)
 print unescape('adf\\303\\215adf')

Looks like escape() can be a bit simpler...

def escape(s):
  result = []
  for char in s:
result.append(\%o % ord(char))
  return ''.join(result)

Regards,
Jordan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: converting to and from octal escaped UTF--8

2007-12-02 Thread Michael Goerz
MonkeeSage wrote:
  Looks like escape() can be a bit simpler...
 
 def escape(s):
   result = []
   for char in s:
 result.append(\%o % ord(char))
   return ''.join(result)
 
 Regards,
 Jordan
Very neat! Thanks a lot...
Michael
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: converting to and from octal escaped UTF--8

2007-12-02 Thread Michael Spencer
Michael Goerz wrote:
 Hi,
 
 I am writing unicode stings into a special text file that requires to
 have non-ascii characters as as octal-escaped UTF-8 codes.
 
 For example, the letter Í (latin capital I with acute, code point 205)
 would come out as \303\215.
 
 I will also have to read back from the file later on and convert the
 escaped characters back into a unicode string.
 
 Does anyone have any suggestions on how to go from Í to \303\215 and
 vice versa?
 
Perhaps something along the lines of:

   def encode(source):
  ... return .join(\%o % ord(c) for c in source.encode('utf8'))
  ...
   def decode(encoded):
  ... bytes = .join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
  ... return bytes.decode('utf8')
  ...
   encode(uÍ)
  '\\303\\215'
   print decode(_)
  Í
  

HTH
Michael

-- 
http://mail.python.org/mailman/listinfo/python-list

Re: converting to and from octal escaped UTF--8

2007-12-02 Thread MonkeeSage
On Dec 2, 11:46 pm, Michael Spencer [EMAIL PROTECTED] wrote:
 Michael Goerz wrote:
  Hi,

  I am writing unicode stings into a special text file that requires to
  have non-ascii characters as as octal-escaped UTF-8 codes.

  For example, the letter Í (latin capital I with acute, code point 205)
  would come out as \303\215.

  I will also have to read back from the file later on and convert the
  escaped characters back into a unicode string.

  Does anyone have any suggestions on how to go from Í to \303\215 and
  vice versa?

 Perhaps something along the lines of:

def encode(source):
   ... return .join(\%o % ord(c) for c in source.encode('utf8'))
   ...
def decode(encoded):
   ... bytes = .join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
   ... return bytes.decode('utf8')
   ...
encode(uÍ)
   '\\303\\215'
print decode(_)
   Í
   

 HTH
 Michael

Nice one. :) If I might suggest a slight variation to handle cases
where the encoded string contains plain text as well as octal
escapes...

def decode(encoded):
  for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
  return encoded.decode('utf8')

This way it can handle both \\141\\144\\146\\303\\215\\141\\144\\146
as well as adf\\303\\215adf.

Regards,
Jordan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: converting to and from octal escaped UTF--8

2007-12-02 Thread MonkeeSage
On Dec 3, 1:31 am, MonkeeSage [EMAIL PROTECTED] wrote:
 On Dec 2, 11:46 pm, Michael Spencer [EMAIL PROTECTED] wrote:



  Michael Goerz wrote:
   Hi,

   I am writing unicode stings into a special text file that requires to
   have non-ascii characters as as octal-escaped UTF-8 codes.

   For example, the letter Í (latin capital I with acute, code point 205)
   would come out as \303\215.

   I will also have to read back from the file later on and convert the
   escaped characters back into a unicode string.

   Does anyone have any suggestions on how to go from Í to \303\215 and
   vice versa?

  Perhaps something along the lines of:

 def encode(source):
... return .join(\%o % ord(c) for c in source.encode('utf8'))
...
 def decode(encoded):
... bytes = .join(chr(int(c, 8)) for c in encoded.split('\\')[1:])
... return bytes.decode('utf8')
...
 encode(uÍ)
'\\303\\215'
 print decode(_)
Í

  HTH
  Michael

 Nice one. :) If I might suggest a slight variation to handle cases
 where the encoded string contains plain text as well as octal
 escapes...

 def decode(encoded):
   for octc in (c for c in re.findall(r'\\(\d{3})', encoded)):
 encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
   return encoded.decode('utf8')

 This way it can handle both \\141\\144\\146\\303\\215\\141\\144\\146
 as well as adf\\303\\215adf.

 Regards,
 Jordan

err...

def decode(encoded):
  for octc in re.findall(r'\\(\d{3})', encoded):
encoded = encoded.replace(r'\%s' % octc, chr(int(octc, 8)))
  return encoded.decode('utf8')
-- 
http://mail.python.org/mailman/listinfo/python-list