Hi Arun, Arun Isaac <arunis...@systemreboot.net> writes: > * module/web/response.scm (text-content-type?): Recognize JSON content > type as text.
While this would seem reasonable at first glance, it seems to me that this will result in JSON texts with non-ASCII characters being mishandled in many cases. Within Guile, 'text-content-type?' is currently used in two places: * 'decode-response-body' in (web client), and * 'response-body-port' in (web response). In both places, if 'text-content-type?' returns true, the encoding of the response is assumed to be "ISO-8859-1" if not otherwise specified by an explicit 'charset' parameter. This is what RFC 2616 specifies for text/plain, although RFC 6657 would change the default to US-ASCII, as it was in RFC 2046, and maybe we should look into that. However, things are quite different for the application/json MIME type, as specified in RFCs 4627 and 7159. Those RFCs specify that JSON text "SHALL" (i.e. MUST) be encoded in Unicode (UTF-8, UTF-16 or UTF-32), that the default encoding is UTF-8, and furthermore that no charset parameter is defined for application/json. So, we can expect at least some conforming implementations to omit the 'charset' parameter, and yet in that case we must assume that the encoding is Unicode, and most definitely not ISO-8859-1. RFC 4627 makes the additional interesting observation (in section 3, "encoding") that since the first two characters of JSON text will always be ASCII, and since UTF-8/UTF-16/UTF-32 are the only valid encodings for JSON text, we can reliably determine the encoding by looking at the pattern of nul bytes in the first four octets: 00 00 00 xx UTF-32BE 00 xx 00 xx UTF-16BE xx 00 00 00 UTF-32LE xx 00 xx 00 UTF-16LE xx xx xx xx UTF-8 Given that any of these encodings above are possible, and that there is no 'charset' parameter defined for "application/json", it seems to me that we have no choice but to be prepared to auto-detect the encoding, as described in RFC 4627 section 3 if the 'charset' parameter is missing. What do you think? Mark