[jira] [Commented] (IO-638) Infinite loop in CharSequenceInputStream.read for 4-byte characters with UTF-8 and 3-byte buffer.

Gary D. Gregory (Jira) Mon, 18 Jan 2021 13:15:04 -0800


    [ 
https://issues.apache.org/jira/browse/IO-638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267534#comment-17267534
 ]


Gary D. Gregory commented on IO-638:
------------------------------------

[~thayne2]

Thank you for your report.

Please feel free to provide a PR on GitHub with a unit test.

 

 

> Infinite loop in CharSequenceInputStream.read for 4-byte characters with 
> UTF-8 and 3-byte buffer.
> -------------------------------------------------------------------------------------------------
>
>                 Key: IO-638
>                 URL: https://issues.apache.org/jira/browse/IO-638
>             Project: Commons IO
>          Issue Type: Bug
>          Components: Streams/Writers
>    Affects Versions: 2.6
>            Reporter: Thayne McCombs
>            Priority: Major
>
> In the constructor of `CharSequenceInputStream` there is the following code 
> to ensure the buffer is large enough to hold one character:
> {code:java}
>  // Ensure that buffer is long enough to hold a complete character   
> final float maxBytesPerChar = encoder.maxBytesPerChar();      
> if (bufferSize < maxBytesPerChar) {
>     throw new IllegalArgumentException("Buffer size " + bufferSize + " is 
> less than maxBytesPerChar " +
>     maxBytesPerChar);
> }
> {code}
> However, for UTF-8, `maxBytesPerChar` returns 3.0 not 4.0, even though some 
> characters (such as emoji) require 4 bytes to encode.  As a result you can 
> create a `CharSequenceInputStream` with a buffer size of 3, but when 
> attempting to fill the buffer, `CharsetEncoder.encode` will succeed with an 
> OVERFLOW result without actually writing anything to buffer if attempting to 
> encode a 4 byte character. This in turn results in an infinite loop in read 
> methods, since the buffer never actually gets anything written to it.
>  
> NOTE: as I understand it, the reason the encoder returns 3 and not 4 is 
> because 3 is the maximum number of byte that a single java `char` can 
> represent, since a 4 byte encoding in UTF-8 would require two a surragate 
> pair of two `char`s.
>  
> This is may be a problem for other encodings as well, but I've only tested it 
> for utf-8.
>  
> Requiring the buffer to be at least twice the maxBytesPerChar would ensure 
> this doesn't happen.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

[jira] [Commented] (IO-638) Infinite loop in CharSequenceInputStream.read for 4-byte characters with UTF-8 and 3-byte buffer.

Reply via email to