Re: Validate string as UTF-8?
On Sun, 06 Nov 2005 18:58:50 +, Tony Nelson wrote: [snip] Is there a general way to call GLib functions? ctypes? http://starship.python.net/crew/theller/ctypes/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Validate string as UTF-8?
Tony Nelson wrote: I'd like to have a fast way to validate large amounts of string data as being UTF-8. define validate. I don't see a fast way to do it in Python, though: unicode(s,'utf-8').encode('utf-8) if validate means make sure the byte stream doesn't use invalid sequences, a plain unicode(s, utf-8) should be sufficient. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: Validate string as UTF-8?
Tony Nelson wrote: I'd like to have a fast way to validate large amounts of string data as being UTF-8. I don't see a fast way to do it in Python, though: unicode(s,'utf-8').encode('utf-8) seems to notice at least some of the time (the unicode() part works but the encode() part bombs). I don't consider a RE based solution to be fast. GLib provides a routine to do this, and I am using GTK so it's included in there somewhere, but I don't see a way to call GLib routines. I don't want to write another extension module. I somehow doubt that the encode bombs. Can you provide some more details? Maybe of some allegedly not working strings? Besides that, it's unneccessary - the unicode(s, utf-8) should be sufficient. If there are any undecodable byte sequences in there, that should find them. Regards, Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: Validate string as UTF-8?
In article [EMAIL PROTECTED], david mugnai [EMAIL PROTECTED] wrote: On Sun, 06 Nov 2005 18:58:50 +, Tony Nelson wrote: [snip] Is there a general way to call GLib functions? ctypes? http://starship.python.net/crew/theller/ctypes/ Umm. Might be easier to write an extension module. TonyN.:'[EMAIL PROTECTED] ' http://www.georgeanelson.com/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Validate string as UTF-8?
I have done this using a sytem call to the program recode. Recode a file UTF-8 and do a diff on the original and recoded files. Not an elegant solution but did seem to function properly. Take care, Waitman Gobble -- http://mail.python.org/mailman/listinfo/python-list
Re: Validate string as UTF-8?
In article [EMAIL PROTECTED], Fredrik Lundh [EMAIL PROTECTED] wrote: Tony Nelson wrote: I'd like to have a fast way to validate large amounts of string data as being UTF-8. define validate. All data conforms to the UTF-8 encoding format. I can stand if someone has made data that impersonates UTF-8 that isn't really Unicode. I don't see a fast way to do it in Python, though: unicode(s,'utf-8').encode('utf-8) if validate means make sure the byte stream doesn't use invalid sequences, a plain unicode(s, utf-8) should be sufficient. You are correct. I misunderstood what was happening in my code. I apologise for wasting bandwidth and your time (and I wasted my own time as well). Indeed, unicode(s, 'utf-8') will catch the problem and is fast enough for my purpose, adding about 25% to the time to load a file. TonyN.:'[EMAIL PROTECTED] ' http://www.georgeanelson.com/ -- http://mail.python.org/mailman/listinfo/python-list