[ 
https://issues.apache.org/jira/browse/HTTPCLIENT-2418?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Gilles Compienne CFX updated HTTPCLIENT-2418:
---------------------------------------------
    Description: 
Hello,

Ticket 2159 (https://issues.apache.org/jira/browse/HTTPCLIENT-2159) has 
resolved several issues around the handling of the content type parameter, but 
there is still one that I can see:

If I have small JSON payload in the response body (and the content type is set 
by the server to be "Content-Type: application/json") then it is still 
currently decoding it using US-ASCII when it should have used UTF-8...

It works fine when we create a JSON payload for a request (as the ContentType 
class now specifies UTF-8 as being the default for "application/json") but the 
code fails if the JSON payload is in the response.

This is caused by the fact by the line `final Charset charset = (contentType != 
null ? contentType : ContentType.DEFAULT_TEXT).getCharset();` in the 
`getBodyText()` method of the `SimpleBody` class will set the `charset` 
variable to null, which in turns causes StandardCharsets.US_ASCII to be used in 
the line that follows, wrecking any Emojis or non-english symbols that string 
could contain (which decodes the string).

In my humble opinion, it is reasonable to be using SimpleBody (and the Simple 
API in general) when we know we are expecting small payloads and all we are 
doing is passing the "string" along to some other clients. In those cases, we 
don't need the advanced capabilities of httpcomponents-jackson or similar (we 
are not even parsing the JSON, just treating it as a string)...

But we still need the 'getBodyText()' to handle the charset properly, even when 
it is "assumed", and not wreck the string (or only "getBodyBytes()" should be 
offered and 'getBodyText()' removed).

I suspect an improved variant of that code would use the (currently deprecated) 
`ContentType.getByMimeType()` method to find out if the mime type is known and 
if it has an associated default encoding (and if so, use it if the charset is 
not present).

I have attached a code sample highlighting the problem in the associated ZIP 
file.

It can be run with maven:
{noformat}
mvn clean compile assembly:single
java -jar 
target/httpclient5-demo-1.0-SNAPSHOT-jar-with-dependencies.jar{noformat}
And it will need a dummy server on localhost:12345 that returns a UTF-8 JSON 
payload with some emojis or similar.

If you have a tool like `dummyhttp` and a terminal console set to use UTF-8 
then you can setup the dummy server with the following command:
{noformat}
dummyhttp -p 12345 -v -c 200 -b "{\"msg\": \"Test emoji 👋\"}" -H 
Content-Type:application/json{noformat}

Running the test client app will then cause this to appear (again assuming your 
terminal is set to the UTF-8 locale):

 
{noformat}
Fetching: http://localhost:12345/
-----------------------------
Status code : 200
Reason      : OK
Content-Type: application/json
Body (via getBodyText):
{"msg": "Test emoji ����"}
Body (via proper UTF-8 decoding):
{"msg": "Test emoji 👋"}
{noformat}
 

  was:
Hello,

Ticket 2159 (https://issues.apache.org/jira/browse/HTTPCLIENT-2159) has 
resolved several issues around the handling of the content type parameter, but 
there is still one that I can see:

If I have small JSON payload in the response body (and the content type is set 
by the server to be "Content-Type: application/json") then it is still 
currently decoding it using US-ASCII when it should have used UTF-8...

It works fine when we create a JSON payload for a request (as the ContentType 
class now specifies UTF-8 as being the default for "application/json") but the 
code fails if the JSON payload is in the response.

This is caused by the fact by the line `final Charset charset = (contentType != 
null ? contentType : ContentType.DEFAULT_TEXT).getCharset();` in the 
`getBodyText()` method of the `SimpleBody` class will set the `charset` 
variable to null, which in turns causes StandardCharsets.US_ASCII to be used in 
the line that follows, wrecking any Emojis or non-english symbols that string 
could contain (which decodes the string).

In my humble opinion, it is reasonable to be using SimpleBody (and the Simple 
API in general) when we know we are expecting small payloads and all we are 
doing is passing the "string" along to some other clients. In those cases, we 
don't need the advanced capabilities of httpcomponents-jackson or similar (we 
are not even parsing the JSON, just treating it as a string)...

But we still need the 'getBodyText()' to handle the charset properly, even when 
it is "assumed", and not wreck the string (or only "getBodyBytes()" should be 
offered and 'getBodyText()' removed).

I suspect an improved variant of that code would use the (currently deprecated) 
`ContentType.getByMimeType()` method to find out if the mime type is known and 
if it has an associated default encoding (and if so, use it if the charset is 
not present).

I have attached a code sample highlighting the problem in the associated ZIP 
file.

It can be run with maven:

??mvn clean compile assembly:single??
??java -jar target/httpclient5-demo-1.0-SNAPSHOT-jar-with-dependencies.jar??

And it will need a dummy server on localhost:12345 that returns a UTF-8 JSON 
payload with some emojis or similar.

If you have a tool like `dummyhttp` and a terminal console set to use UTF-8 
then you can setup the dummy server with the following command:

```
dummyhttp -p 12345 -v -c 200 -b "\{\"msg\": \"Test emoji 👋\"}" -H 
Content-Type:application/json
```

Running the test client app will then cause this to appear (again assuming your 
terminal is set to the UTF-8 locale):
```
Fetching: [http://localhost:12345/]
-----------------------------
Status code : 200
Reason      : OK
Content-Type: application/json

Body (via getBodyText):

{"msg": "Test emoji ����"}

Body (via proper UTF-8 decoding):

{"msg": "Test emoji 👋"}

```


> Another case of invalid handling of charset content type parameter on the 
> Simple Async API
> ------------------------------------------------------------------------------------------
>
>                 Key: HTTPCLIENT-2418
>                 URL: https://issues.apache.org/jira/browse/HTTPCLIENT-2418
>             Project: HttpComponents HttpClient
>          Issue Type: Bug
>          Components: HttpClient (async)
>    Affects Versions: 5.4.1, 5.6
>         Environment: java 21 on macOS 26.4.1
>            Reporter: Gilles Compienne CFX
>            Priority: Major
>         Attachments: json-utf8-issue.zip
>
>
> Hello,
> Ticket 2159 (https://issues.apache.org/jira/browse/HTTPCLIENT-2159) has 
> resolved several issues around the handling of the content type parameter, 
> but there is still one that I can see:
> If I have small JSON payload in the response body (and the content type is 
> set by the server to be "Content-Type: application/json") then it is still 
> currently decoding it using US-ASCII when it should have used UTF-8...
> It works fine when we create a JSON payload for a request (as the ContentType 
> class now specifies UTF-8 as being the default for "application/json") but 
> the code fails if the JSON payload is in the response.
> This is caused by the fact by the line `final Charset charset = (contentType 
> != null ? contentType : ContentType.DEFAULT_TEXT).getCharset();` in the 
> `getBodyText()` method of the `SimpleBody` class will set the `charset` 
> variable to null, which in turns causes StandardCharsets.US_ASCII to be used 
> in the line that follows, wrecking any Emojis or non-english symbols that 
> string could contain (which decodes the string).
> In my humble opinion, it is reasonable to be using SimpleBody (and the Simple 
> API in general) when we know we are expecting small payloads and all we are 
> doing is passing the "string" along to some other clients. In those cases, we 
> don't need the advanced capabilities of httpcomponents-jackson or similar (we 
> are not even parsing the JSON, just treating it as a string)...
> But we still need the 'getBodyText()' to handle the charset properly, even 
> when it is "assumed", and not wreck the string (or only "getBodyBytes()" 
> should be offered and 'getBodyText()' removed).
> I suspect an improved variant of that code would use the (currently 
> deprecated) `ContentType.getByMimeType()` method to find out if the mime type 
> is known and if it has an associated default encoding (and if so, use it if 
> the charset is not present).
> I have attached a code sample highlighting the problem in the associated ZIP 
> file.
> It can be run with maven:
> {noformat}
> mvn clean compile assembly:single
> java -jar 
> target/httpclient5-demo-1.0-SNAPSHOT-jar-with-dependencies.jar{noformat}
> And it will need a dummy server on localhost:12345 that returns a UTF-8 JSON 
> payload with some emojis or similar.
> If you have a tool like `dummyhttp` and a terminal console set to use UTF-8 
> then you can setup the dummy server with the following command:
> {noformat}
> dummyhttp -p 12345 -v -c 200 -b "{\"msg\": \"Test emoji 👋\"}" -H 
> Content-Type:application/json{noformat}
> Running the test client app will then cause this to appear (again assuming 
> your terminal is set to the UTF-8 locale):
>  
> {noformat}
> Fetching: http://localhost:12345/
> -----------------------------
> Status code : 200
> Reason      : OK
> Content-Type: application/json
> Body (via getBodyText):
> {"msg": "Test emoji ����"}
> Body (via proper UTF-8 decoding):
> {"msg": "Test emoji 👋"}
> {noformat}
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to