Sakcee wrote: > thanks very much for the info, it really helped > > we are using the text from file to display on webpage and we have a > method for conversion the parsed data to utf-8 and then displaying, all > the data looks fine after parsing except the > surrogate pair, > since i can not guess what it was supposed to be , is it ok to strip it > using regex re.complie(' [\xed|\xa0] ')?
As martin said: that alters the meaning of the bytes. If that has to bother you or not, that's yours to decide. If for example you stripped all vocals from a text, it still might be comprehensible for most people, so if vocals bother you for whatever reason, remove them. Bt myb y bttr try nd fx th prblm n th frst plc. Regards, Diez -- http://mail.python.org/mailman/listinfo/python-list