[ 
https://issues.apache.org/jira/browse/XERCESC-1947?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12920219#action_12920219
 ] 

Ben Griffin edited comment on XERCESC-1947 at 10/12/10 10:33 AM:
-----------------------------------------------------------------

Hi Boris,

I'm pretty sure that any serializer that uses TranscodeToStr::transcode(const 
XMLCh *in, XMLSize_t len, XMLTranscoder* trans) will have this problem when the 
nature of the encoding that the transcoder is for is such that characters have 
variable sizes, most especially when the number of bytes needed to transcode a 
character is greater than the number of bytes used by the existing encoding. 
The problem is most easily exposed by the patch.  Essentially, the failure 
happens because there isn't enough memory given to return any bytes eaten - 
even though there is a need to eat them.

So when using UCS2 --> UTF-8, there is no problem until you get to 3-byte or 
more UTF-8 encodings:- characters larger than U+0x0800.  When there is a single 
character to be transcoded then the initial allocSize is not going to be large 
enough to hold that one character, so the transcoder will return 0 'charsRead'.

This error was exposed to me when querying attributes that were set with single 
character Unicode values from around U+2500.

My code was doing something like...
        DOMAttr* enoda = enod->getAttributeNode(a_name);
        const XMLCh* x_attrval = enoda->getNodeValue();
        if (x_attrval != NULL && x_attrval[0] != 0 ) {
                std::string attrval;
                char* value = (char*)TranscodeToStr(x_attrval,"UTF-8").adopt();
        }

I am not sure whether or not the supplied serializer uses TranscodeToStr in 
that sort of way - you are probably better informed than me about that.  
Maybe the component that I put the bug under shouldn't be 'Utilities' ?  I'm 
not sure that I understand why you are interested in whether it affects 
parsing/serializing?  It certainly affects being able to use 
TranscodeToStr::transcode().  I don't believe that the error is in 
XMLUTF8Transcoder::transcodeTo(), because AFAIK it doesn't have storage for 
semi-consumed characters. I believe that the error is with 
TranscodeToStr::transcode().



      was (Author: mrthoughtful):
    Hi Boris,

I'm pretty sure that any serializer that uses TranscodeToStr::transcode(const 
XMLCh *in, XMLSize_t len, XMLTranscoder* trans) will have this problem when the 
nature of the encoding that the transcoder is for is such that characters have 
variable sizes, most especially when the number of bytes needed to transcode a 
character is greater than the number of bytes used by the existing encoding. 
The problem is most easily exposed by the patch.  Essentially, the failure 
happens because there isn't enough memory given to return any bytes eaten - 
even though there is a need to eat them.

So when using UCS2 --> UTF-8, there is no problem until you get to 3-byte or 
more UTF-8 encodings:- characters larger than U+0x0800.  When there is a single 
character to be transcoded then the initial allocSize is not going to be large 
enough to hold that one character, so the transcoder will return 0 'charsRead'.

This error was exposed to me when querying attributes that were set with single 
byte Unicode values from around U+2500.

My code was doing something like...
        DOMAttr* enoda = enod->getAttributeNode(a_name);
        const XMLCh* x_attrval = enoda->getNodeValue();
        if (x_attrval != NULL && x_attrval[0] != 0 ) {
                std::string attrval;
                char* value = (char*)TranscodeToStr(x_attrval,"UTF-8").adopt();
        }

I am not sure whether or not the supplied serializer uses TranscodeToStr in 
that sort of way - you are probably better informed than me about that.  
Maybe the component that I put the bug under shouldn't be 'Utilities' ?  I'm 
not sure that I understand why you are interested in whether it affects 
parsing/serializing?  It certainly affects being able to use 
TranscodeToStr::transcode().  I don't believe that the error is in 
XMLUTF8Transcoder::transcodeTo(), because AFAIK it doesn't have storage for 
semi-consumed characters. I believe that the error is with 
TranscodeToStr::transcode().


  
> XMLUTF8Transcoder::transcodeTo  fails with an exception when transcoding 
> single characters that require 3 or more bytes as UTF8.
> --------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: XERCESC-1947
>                 URL: https://issues.apache.org/jira/browse/XERCESC-1947
>             Project: Xerces-C++
>          Issue Type: Bug
>          Components: Utilities
>    Affects Versions: 3.1.0, 3.1.1
>         Environment: Tested on mac os and debian linux. The failure is only 
> manifest on v3.1.x
>            Reporter: Ben Griffin
>            Priority: Critical
>         Attachments: TransService.patch, transtest.cpp
>
>
> This can be demonstrated with the following 2 lines of code.
>       const XMLCh uval [] = { 0x254B, 0x0000}; //BOX DRAWINGS HEAVY VERTICAL 
> AND HORIZONTAL (needs 3 bytes for utf-8)
>       char* uc = (char*)TranscodeToStr(uval,"UTF-8").adopt(); cout << uc << 
> endl << flush; XMLString::release(&uc); //faulty exception;
> The error is: "terminate called after throwing an instance of 
> 'xercesc_3_1::TranscodingException'"

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: c-dev-unsubscr...@xerces.apache.org
For additional commands, e-mail: c-dev-h...@xerces.apache.org

Reply via email to