RE: CFHTTP Charset Question

Phillip Holmes Wed, 17 May 2006 12:11:27 -0700

I disagree Claude. The problem he is having is that the source charset is
not lining up with the actual data coming across. I think his issue is NCR
data.


Java should recognize the correct charset appropriately and in coordination
with that, you can use the IsUnmappable() method to make sure you have no
garbled text. You would pass the response of getEncoding() to the
isUnmappable() as the arguments.charsetA.

The is the meat of an isUnmappable() method would be something like this:

<cfscript>
        // get instance of java objects
        this.jcharset = createObject('java', 'java.nio.charset.Charset');
        this.byteBuffer = createObject('java','java.nio.ByteBuffer'); 
        this.charBuffer = createObject('java','java.nio.CharBuffer'); 
        this.codingErrorAction =
createObject('java','java.nio.charset.CodingErrorAction');
        
        // format into your function starting here
        
        // get needed space of unicode transformation (16 bit Unicode)
        bLength = len(trim(arguments.field)) * 2;                       
        // tell java what unicode code point you are coming from
        charsetBefore = this.jcharset.forName(arguments.charsetA); 
        // allocate memory for output char buffer
        outTextCharBuffer =
this.charBuffer.allocate(javaCast('int',bLength)); 
        // encode data into byte array
        inTextByteBuffer = charsetBefore.encode(arguments.field);
        // tell java what unicode code point you are going to
        charsetAfter = this.jcharset.forName(arguments.charsetB); 
        // get instance of new decoder
        decoderForCharsetAfter = charsetAfter.newDecoder();
        // raise exception class
        
decoderForCharsetAfter.onUnmappableCharacter(this.codingErrorAction.REPORT);

        // compare unicode results of both datasets and get boolean response
from isUnmappable
        decoderCoderResult =
decoderForCharsetAfter.decode(inTextByteBuffer,outTextCharBuffer,true).isUnM
appable().toString(); 
        // return boolean value into struct
        ret.data.isUnmappable = decoderCoderResult;
</cfscript>

This would make sure you have no unmappables before you proceed. If you do
have NCR data in the text, you can check for that with a regex and then run
the return of that through the ncr2unicode to replace those characters (NCRs
will be ascii).

Here is an ncr2unicode servlet I compiled.

http://phillipholmes.com/Java/ncr2unicode.rar

You'll need winrar to 'unzip' that.


Warmest Regards,
 
Phillip B. Holmes
http://phillipholmes.com


=======================>




-----Original Message-----
From: Claude Schneegans [mailto:[EMAIL PROTECTED] 
Sent: Wednesday, May 17, 2006 1:45 PM
To: CF-Talk
Subject: Re: CFHTTP Charset Question

 >>getEncoding()

This will "Retrieves the Charset as guessed from the underlying
InputStream".
But if the charset is not specified in the response header and if CF does
not interpret characters correctly, it is probabilly that CF "guesses
wrong", so this won't really help.

The only way I can see would be to get the page first, in whatever charset,
decode the line <?xml version="1.0" encoding="iso-8859-1" ?> and repeat the
HTTP request specifying the right charset.

--
_______________________________________
REUSE CODE! Use custom tags;
See http://www.contentbox.com/claude/customtags/tagstore.cfm
(Please send any spam to this address: [EMAIL PROTECTED]) Thanks.




~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~|
Message: http://www.houseoffusion.com/lists.cfm/link=i:4:240807
Archives: http://www.houseoffusion.com/cf_lists/threads.cfm/4
Subscription: http://www.houseoffusion.com/lists.cfm/link=s:4
Unsubscribe: http://www.houseoffusion.com/cf_lists/unsubscribe.cfm?user=89.70.4
Donations & Support: http://www.houseoffusion.com/tiny.cfm/54

RE: CFHTTP Charset Question

Reply via email to