Re: RemoteContentFetcher and i18n

John Panzer Fri, 01 Feb 2008 14:54:07 -0800


-John

On Feb 1, 2008, at 12:59 PM, "Kevin Brown" <[EMAIL PROTECTED]> wrote:

On Feb 1, 2008 12:41 PM, Brian Eaton <[EMAIL PROTECTED]> wrote:
The current fetchJson implementation uses "new
String(results.getByteArray())" to convert the response bytes to a
string for inclusion in the JSON reply to the gadget. The behaviorofnew String(byte[]) is undefined "when the given bytes are not validin
the default charset".
The default charset could be anything, and the returned bytes fromthe
remote server could also be anything.  This is likely to cause
problems (data corruption) for gadgets fetching data from non-english
web sites.
The default charset is almost always utf-8 in practice (unlessyou've done
something particularly bizarre, like modifying system properties),

On some OS/JDK combos, this can be picked up from environmentvariables (ack.)

but
you're right that the back end could be anything. Honestly, the realanswerhere is that this should *NOT* be a string at all -- it should be asequenceof bytes. RemoteContentFetcher should not care about encoding. Whatif I'm
using this to fetch non-text data, such as an image file, for the open
proxy?

+1, but this context is about converting to JSON, right? So you can'tjust push bytes through.

For text data (such as what you would fetch fromgadgets.io.makeRequest), itshould always be utf-8. This does mean that we need to do encodingdetection/ conversion in here. It has nothing to do with "non-English" websites, butrather websites which use regional character encodings (ISO-8859-1probablybeing the most problematic since it "looks like" ASCII or UTF8 untilyoustart using diacritics; BIG5 is another likely problem for chineselanguage
sites).

I'll open up a JIRA issue for this, but I wanted to see whether anyone
had proposals for a solution.  The fix will probably involve using
CharsetDecoder, so we at least have well-defined behavior.  How we
pick the CharsetDecoder to use is an open question.  What to do when
the CharsetDecoding fails is another issue.  I'm tempted to put in a
quick fix that specifies UTF-8 for the character set.  That will
prevent anyone from depending on the current undefined behavior while
we work out what should happen.
If it can't be converted to utf8, or we can't detect the encoding,we simplyfail the request. This is consistent with the behavior on igoogletoday.

+1.

Well behaved origins should declare their source charset encoding(though with text/XML it can admittedly get Byzantine). Ones thatdon't do not should get 'best effort' at most.

Re: RemoteContentFetcher and i18n

Reply via email to