Thanks for the tip. Actually, I haven't been using ICU till now. I tried out 
the following function:

ucnv_convertEx(UConverter *targetCnv, UConverter *sourceCnv,
                                                   char **target, const char 
*targetLimit,
                                                   const char **source, const 
char *sourceLimit,
                                                   UChar *pivotStart, UChar 
**pivotSource,
                                                   UChar **pivotTarget, const 
UChar *pivotLimit,
                                                   UBool reset, UBool flush,
                                           UErrorCode *pErrorCode);

My targetConverter is UTF-8. My source converter is determined at runtime 
(using the ucsdet_detect() function) since I do not know what would be the 
charset of the input buffer.

This seems to work fine. However, going through the ICU v4 manual, I could not 
find a function to collapse whitespaces. 
So the only alternative seems to be this:

Use ICU to find out charset of input buffer, convert the input buffer into 
Unicode (UTF-16). Then use Xerces to collapse ws and finally, convert the 
white-spaced collapsed unicode buffer to UTF-8.

Here's the code:

const UCharsetMatch *ucm;
const char *name;
ucsdet_setText(csd, szCSTABuffer, sizeBuffer, &status);
ucm = ucsdet_detect(csd, &status);

if (ucm == NULL) 
{
                                
}
else
{
         name  = ucsdet_getName(ucm, &status);
}

// set up the converter
converter = ucnv_open(name, &status);

// Convert from the source encoding to Unicode 
len = ucnv_toUChars(converter , target, targetSize, szCSTABuffer, sizeBuffer, 
&status);                         

//Collapse whitespaces
XMLString::collapseWS(target);

//Using Xerces transcoder to convert the Unicode (UTF-16) to UTF-8
uiTotalChars = m_pUTF8Transcoder->transcodeTo((const XMLCh* const)target,
                                                                targetSize,
                                                                
xmlTranscodedOutput,
                                                                uiOutLength,
                                                                
uiCharsTranscoded,
                                                                
XMLTranscoder::UnRep_RepChar);

Does this make sense? Is this a good idea? 

Regards,
Swati

-----Original Message-----
From: John Lilley [mailto:[email protected]] 
Sent: Monday, April 05, 2010 7:15 PM

Have you tried using UCharsetDetector to determine the code page of the data?  
It sounds like you are transcoding but don't know what the source code page is. 
 Regarding the crash in your transcoder, can you create a small standalone 
sample with data that crashes consistently?  Then the kind ICU people can see 
if there's a fix to be made along the lines of "Don't crash when given invalid 
input data". john

-----Original Message-----
From: Swatilekha Doloi [mailto:[email protected]] 
Sent: Monday, April 05, 2010 5:59 AM

Hi,

Apologies for reposting this. I'm hoping that this time it is readable. I'm 
trying to transcode some data into UTF-8. The data I receive is not in UTF-16 
format, so I can't use it as the input to the transcodeTo() function.

Please help me!

>-----Original Message-----
>From: David Bertoni [mailto:[email protected]]
>
>On 3/25/2010 12:03 AM, Swatilekha Doloi wrote:
>> Hi,
>>
>> Sorry for the delay in responding. My usage of the word 
>> 'non-printable' is probably incorrect. It displays something that 
>> looks like this:  æc¾Òw×s %# S1# ÔwÔi�...@õ OQ  S # õ”3
>>
>> Using XMLString::transcode before giving the buffer to the UTF-8 
>> Transcoder helped. This is my code now:
>OK, this is a very bad idea if your data is not in the machine's local
>code page. You need to provide more information about what the encoding 
>of the data in szCSTABuffer is.

I actually don't know - it comes from a different program running on a 
different computer. And yes, you're right, there's no guarantee that it would 
match my machine's locale settings. I would have to create a UTF16 transcoder 
and use 'transcodeTo' to convert the buffer to UTF-16. But I'm a bit confused 
with the various options:
fgUTF16BEncodingString                           'UTF-16 (BE)'
fgUTF16BEncodingString2                          'UTF-16BE'
fgUTF16EncodingString                            'UTF-16'
fgUTF16EncodingString2                           'UCS2'
fgUTF16EncodingString3                           'IBM1200'
fgUTF16EncodingString4                           'IBM-1200'
fgUTF16EncodingString5                           'UTF16'
fgUTF16EncodingString6                           'UCS-2'
fgUTF16EncodingString7                           'ISO-10646-UCS-2'
fgUTF16LEncodingString                           'UTF-16 (LE)'
fgUTF16LEncodingString2                          'UTF-16LE'

My target system is BE.
Should I use the ones for BE (fgUTF16BEncodingString/ fgUTF16BEncodingString2)? 
Or would these be fine (fgUTF16EncodingString/ fgUTF16EncodingString5)? 

Addendum: Also, I would like to know how to do this cascading transcode?
  some-encoding-->UTF-16BE--->UTF-8
After I transcode the buffer to UTF-16, the output is of type XMLByte. The 
Transcoder for UTF-8 expects XMLCh* and not XMLByte* as the input.

One last addition: transcodeTo for UTF-16 crashes sometimes. I don't know why 
this is happening. The call stack shows somewhere inside 
xercesc_2_8::XMLUTF16Transcoder::transcodeTo() a memcpy is crashing. This does 
not happen every time, though.

/** Transcode to UTF-8 */
 uiInLength             =       strlen(szCSTABuffer); 
 uiOutLength    =       uiInLength * UTF16_BYTES_PER_CHARACTER;
        //UTF16_BYTES_PER_CHARACTER is set to 4

/** Allocate memory for the output of the transcode operation*/ xmlInput = new 
XMLByte[uiOutLength + 1]; 

if(xmlInput)
{
        /** Transcode */
        uiTotalChars = m_pUTF16Transcoder->transcodeTo((const XMLCh*            
                                                                
const)szCSTABuffer,
                                                                        
uiInLength,
                                                                        
xmlInput,
                                                                        
uiOutLength,
                                                                        
uiCharsTranscoded,
                                                          
XMLTranscoder::UnRep_RepChar);

        xmlInput[uiTotalChars] = '\0'; 
}
What am I doing wrong? Is it the typecast to XMLCh* from char* when calling 
transcodeTo? Any other way to convert char* to XMLCh*? Please help!
>
>> /*******************************************************************/
>> if(szCSTABuffer)
>> {
>>
>>      /** Transcode the CSTA Buffer into XMLCh* */
>>      xmlInput = XMLString::transcode(szCSTABuffer);
>>
>>      uiInLength      =       XMLString::stringLen(xmlInput);
>>      uiOutLength     =       uiInLength * UTF8_BYTES_PER_CHARACTER;
>>      //UTF8_BYTES_PER_CHARACTER is set to 4
>>
>>      /** Allocate memory for the output of the transcode operation*/
>>      xmlTranscodedOutput = new XMLByte[uiOutLength + 1];
>>
>>
>>      if(xmlTranscodedOutput)
>>      {
>>              /** Transcode */
>>              // m_pUTF8Transcoder is of type  XMLTranscoder*
>>              uiTotalChars = m_pUTF8Transcoder->transcodeTo(
>>                                                 (const XMLCh* const)xmlInput,
>This cast is not necessary.
>
Thanks I will remove it.
>>                                                  uiInLength,
>>                                                  xmlTranscodedOutput,
>>                                                  uiOutLength,
>>                                                  uiCharsTranscoded,
>>                                                  
>> XMLTranscoder::UnRep_RepChar);
>>
>>              xmlTranscodedOutput[uiTotalChars] = '\0';
>>              XMLString::release(&xmlInput);
>>      }
>> }
>>              
>> Variables are defined as follows:
>> char*                szCSTABuffer            =       NULL;
>> XMLCh*               xmlInput                        =       NULL;
>> XMLByte*             xmlTranscodedOutput     =       NULL;
>> unsigned int uiInLength                      =       0;
>> unsigned int uiOutLength                     =       0;
>> unsigned int uiCharsTranscoded               =       0;
>> unsigned int uiTotalChars            =       0;
>> /*******************************************************************/

Reply via email to