thanks very much for the info, it really helped we are using the text from file to display on webpage and we have a method for conversion the parsed data to utf-8 and then displaying, all the data looks fine after parsing except the surrogate pair, since i can not guess what it was supposed to be , is it ok to strip it using regex re.complie(' [\xed|\xa0] ')?
Martin v. Löwis wrote: > Sakcee wrote: > > Hi > > > > In one of the data files that I have , I am seeing these characters > > \xed\xa0\xa0 . They seem to break the xsl. > [...] > > is this a unicode utf-16 surrogate pair ? > > Yes and no. This is the UTF-8 encoding of U+D820, which is a high > surrogate code point. So yes. It's not yet a pair; there would have to > be a second such code point. So no. > > Furthermore, in UTF-8, you should never ever have encoded surrogate > codes; instead, whoever generated the UTF-8 should have combined the > two surrogate code point into a single coded character, and should > have encoded *that* character. So no - this byte sequence isn't > even valid UTF-8. > > > for displaying it on xml/xsl, should I extract only \xa0? > > You should tell your parser to reject the file as ill-formed. > > > since this is hingher than 00-7f range can i just strip it? > > Depending an what you want to achieve: sure! It will modify > the meaning of the bytes, of course. > > > under what condition the encoding software put this string in? > > If it has a bug. > > Regards, > Martin -- http://mail.python.org/mailman/listinfo/python-list