[ https://issues.apache.org/jira/browse/TRAFODION-2477?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15889485#comment-15889485 ]
Hans Zeller commented on TRAFODION-2477: ---------------------------------------- My plan is to do two things: First, raise errors when we encounter an invalid UCS-2 code point in UCS-2 to UTF-8 conversion. This could lead to new errors in places where we didn't see them before. Second, I'm planning to add two new CQDs: TRANSLATE_ERROR (default is ON) can be set to OFF to allow invalid code points in translation and replace them with a replacement character. A second CQD, TRANSLATE_ERROR_UNICODE_TO_UNICODE (default ON) is just for Unicode to Unicode translations, which should not normally cause any errors. If the newly raised errors cause trouble, they can be turned off, but they no longer cause additional characters to be lost. The invalid characters (invalid UTF-16 surrogate pairs) will be translated into the "replacement character". Luckily, convDoIt has everything that's needed already, all we have to do is set a flag. > Invalid characters in UCS2 to UTF8 translation are not handled correctly > ------------------------------------------------------------------------ > > Key: TRAFODION-2477 > URL: https://issues.apache.org/jira/browse/TRAFODION-2477 > Project: Apache Trafodion > Issue Type: Bug > Components: sql-cmp > Affects Versions: 2.0-incubating > Reporter: Hans Zeller > Assignee: Hans Zeller > Fix For: 2.2-incubating > > > When translating from UCS-2 to UTF-8, using CAST or TRANSLATE(... > UCS2TOUTF8), all valid characters will map easily to a UTF-8 character. > However, if we encounter invalid code points or invalid UTF-16 surrogate > pairs, those could raise errors. Right now we just suppress those errors. > Instead we should either translate them to the Unicode "replacement > character" U+FFFD or we should raise an error. Ideally, we should have a CQD > that decides which of these two actions to take. > Test case: > create table tbaducs2(a char(10) character set ucs2); > -- DC00 is a low-order UTF-16 surrogate, on its own this is invalid > insert into tbaducs2 values(_ucs2 X'DC000041'); > select translate(a using ucs2toutf8) from tbaducs2; > -- this returns an empty string - no error, no replacement character -- This message was sent by Atlassian JIRA (v6.3.15#6346)