[appengine-java] Incorrect UTF-8 conversion of supplementary unicode

Tommy Thu, 01 Sep 2011 13:53:08 -0700

The default UTF-8 to String conversion isn't working for me when the string 
has supplementary unicode characters (characters that are 0x10000 or 
greater).  I've attached a servlet that demonstrates this problem.


You can test the servlet with the following url
http://localhost:8888/stringtest?username=%F0%9F%90%A7
This passes the unicode character 0x1F427 encoded in UTF-8 0xF09F90A7.

The correct string does come through if you pass the alternative UTF-8 for 
the surrogate pair version of 0x1F427.  Here is the url.
http://localhost:8888/stringtest?username=%ED%A0%BD%ED%B0%A7
This passes the UTF-8 value 0xEDA0BD and 0xEDB0A7.  This is a UTF-8 
conversion of the surrogate pairs 0xD83D and 0xDC27 which is how the unicode 
character 0x1F427 is represented in the String in memory. However this is 
the incorrect encoding of the character.

Is there a way to get the getParameter function to decode the first version 
correctly, since that is the correct version of UTF-8 for the character? 
 The UTF-8 encoding String functions actually do this encoding/decoding 
correctly so it seems possible.  I could parse the raw url parameters 
myself, but that seems like a lot of work when the default UTF-8 decoding 
should work.

Thanks.

-- 
You received this message because you are subscribed to the Google Groups 
"Google App Engine for Java" group.
To view this discussion on the web visit 
https://groups.google.com/d/msg/google-appengine-java/-/sy4hNKyUKckJ.
To post to this group, send email to google-appengine-java@googlegroups.com.
To unsubscribe from this group, send email to 
google-appengine-java+unsubscr...@googlegroups.com.
For more options, visit this group at 
http://groups.google.com/group/google-appengine-java?hl=en.

package testserver;

import java.io.IOException;
import java.net.URLDecoder;
import java.net.URLEncoder;
import java.nio.charset.Charset;
import java.util.List;
import java.util.logging.Logger;
import javax.jdo.PersistenceManager;
import javax.jdo.Query;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
import java.util.Map;

public class StringTestServlet extends HttpServlet
{
	private static final Logger log = Logger.getLogger(StringTestServlet.class.getName());
	
	////////////////////////////////////////////////////////////////////////////////
	public void doGet(HttpServletRequest req, HttpServletResponse resp)
		throws IOException
	{
		resp.setContentType("text/plain; charset=UTF-8");

		// Print string as retrieved as a parameter
		String username = req.getParameter("username");
		printString(username, resp);
		
		// Construct a string as it's supposed to be
		byte[] byteArray2 = new byte[4];
		byteArray2[0] = (byte) 0xF0;
		byteArray2[1] = (byte) 0x9F;
		byteArray2[2] = (byte) 0x90;
		byteArray2[3] = (byte) 0xA7;
		String byteString2 = new String(byteArray2, 0, 4, "UTF-8");
		printString(byteString2, resp);
		
		resp.getWriter().println("");
		resp.getWriter().println("");
		
		// Compare the strings
		if (username.compareTo(byteString2) == 0)
		{
			resp.getWriter().println("Strings match");
		}
		else
		{
			resp.getWriter().println("Strings DON'T match");
		}
	}
	
	////////////////////////////////////////////////////////////////////////////////
	private void printString(String str, HttpServletResponse resp) throws IOException
	{
		resp.getWriter().println("");
		
		char[] ach = str.toCharArray();  // a char array copied from str
		int len = ach.length;            // the length of ach
		int j = 0;                       // an index for acp

		resp.getWriter().println("codePointCount=" + Integer.toString(Character.codePointCount(ach, 0, len)));
		for (int i = 0; i < len; )
		{
		    int cp = Character.codePointAt(ach, i);
			resp.getWriter().println("Char(" + Integer.toString(j) + ")  " + Integer.toHexString(cp));
			j++;
			i += Character.charCount(cp);
		}
	}
}

[appengine-java] Incorrect UTF-8 conversion of supplementary unicode

Reply via email to