Re: Validate string as UTF-8?

2005-11-06 Thread david mugnai
On Sun, 06 Nov 2005 18:58:50 +, Tony Nelson wrote:

[snip]

 Is there a general way to call GLib functions?

ctypes?
http://starship.python.net/crew/theller/ctypes/

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Validate string as UTF-8?

2005-11-06 Thread Fredrik Lundh
Tony Nelson wrote:

 I'd like to have a fast way to validate large amounts of string data as
 being UTF-8.

define validate.

 I don't see a fast way to do it in Python, though:

 unicode(s,'utf-8').encode('utf-8)

if validate means make sure the byte stream doesn't use invalid
sequences, a plain

unicode(s, utf-8)

should be sufficient.

/F



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Validate string as UTF-8?

2005-11-06 Thread Diez B. Roggisch
Tony Nelson wrote:
 I'd like to have a fast way to validate large amounts of string data as 
 being UTF-8.
 
 I don't see a fast way to do it in Python, though:
 
 unicode(s,'utf-8').encode('utf-8)
 
 seems to notice at least some of the time (the unicode() part works but 
 the encode() part bombs).  I don't consider a RE based solution to be 
 fast.  GLib provides a routine to do this, and I am using GTK so it's 
 included in there somewhere, but I don't see a way to call GLib 
 routines.  I don't want to write another extension module.

I somehow doubt that the encode bombs. Can you provide some more 
details? Maybe of some allegedly not working strings?

Besides that, it's unneccessary - the unicode(s, utf-8) should be 
sufficient. If there are any undecodable byte sequences in there, that 
should find them.

Regards,

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Validate string as UTF-8?

2005-11-06 Thread Tony Nelson
In article [EMAIL PROTECTED],
 david mugnai [EMAIL PROTECTED] wrote:

 On Sun, 06 Nov 2005 18:58:50 +, Tony Nelson wrote:
 
 [snip]
 
  Is there a general way to call GLib functions?
 
 ctypes?
 http://starship.python.net/crew/theller/ctypes/

Umm.  Might be easier to write an extension module.

TonyN.:'[EMAIL PROTECTED]
  '  http://www.georgeanelson.com/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Validate string as UTF-8?

2005-11-06 Thread Waitman Gobble
I have done this using a sytem call to the program recode. Recode a
file UTF-8 and do a diff on the original and recoded files. Not an
elegant solution but did seem to function properly.

Take care,

Waitman Gobble

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Validate string as UTF-8?

2005-11-06 Thread Tony Nelson
In article [EMAIL PROTECTED],
 Fredrik Lundh [EMAIL PROTECTED] wrote:

 Tony Nelson wrote:
 
  I'd like to have a fast way to validate large amounts of string data as
  being UTF-8.
 
 define validate.

All data conforms to the UTF-8 encoding format.  I can stand if someone 
has made data that impersonates UTF-8 that isn't really Unicode.


  I don't see a fast way to do it in Python, though:
 
  unicode(s,'utf-8').encode('utf-8)
 
 if validate means make sure the byte stream doesn't use invalid
 sequences, a plain
 
 unicode(s, utf-8)
 
 should be sufficient.

You are correct.  I misunderstood what was happening in my code.  I 
apologise for wasting bandwidth and your time (and I wasted my own time 
as well).

Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough 
for my purpose, adding about 25% to the time to load a file.

TonyN.:'[EMAIL PROTECTED]
  '  http://www.georgeanelson.com/
-- 
http://mail.python.org/mailman/listinfo/python-list