[ https://issues.apache.org/jira/browse/IO-638?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17267534#comment-17267534 ]
Gary D. Gregory commented on IO-638: ------------------------------------ [~thayne2] Thank you for your report. Please feel free to provide a PR on GitHub with a unit test. > Infinite loop in CharSequenceInputStream.read for 4-byte characters with > UTF-8 and 3-byte buffer. > ------------------------------------------------------------------------------------------------- > > Key: IO-638 > URL: https://issues.apache.org/jira/browse/IO-638 > Project: Commons IO > Issue Type: Bug > Components: Streams/Writers > Affects Versions: 2.6 > Reporter: Thayne McCombs > Priority: Major > > In the constructor of `CharSequenceInputStream` there is the following code > to ensure the buffer is large enough to hold one character: > {code:java} > // Ensure that buffer is long enough to hold a complete character > final float maxBytesPerChar = encoder.maxBytesPerChar(); > if (bufferSize < maxBytesPerChar) { > throw new IllegalArgumentException("Buffer size " + bufferSize + " is > less than maxBytesPerChar " + > maxBytesPerChar); > } > {code} > However, for UTF-8, `maxBytesPerChar` returns 3.0 not 4.0, even though some > characters (such as emoji) require 4 bytes to encode. As a result you can > create a `CharSequenceInputStream` with a buffer size of 3, but when > attempting to fill the buffer, `CharsetEncoder.encode` will succeed with an > OVERFLOW result without actually writing anything to buffer if attempting to > encode a 4 byte character. This in turn results in an infinite loop in read > methods, since the buffer never actually gets anything written to it. > > NOTE: as I understand it, the reason the encoder returns 3 and not 4 is > because 3 is the maximum number of byte that a single java `char` can > represent, since a 4 byte encoding in UTF-8 would require two a surragate > pair of two `char`s. > > This is may be a problem for other encodings as well, but I've only tested it > for utf-8. > > Requiring the buffer to be at least twice the maxBytesPerChar would ensure > this doesn't happen. -- This message was sent by Atlassian Jira (v8.3.4#803005)