Marcono1234 commented on PR #293:
URL: https://github.com/apache/commons-io/pull/293#issuecomment-1100772224
Sorry for not having responded earlier, and for not properly following up on
this.
The issues mentioned in the description still exist on `master`:
- `ReaderInputStream` erroneously overwrites `lastCoderResult`
This can for example be seen with the following code snippet. The string has
a trailing unpaired surrogate, which is therefore malformed. The encoder is
created to throw on malformed input. However, currently `read()` erroneously
just returns -1 instead of throwing an exception.
```java
// Encoder which throws on malformed or unmappable input
CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
ReaderInputStream in = new ReaderInputStream(new StringReader("\uD800"),
encoder);
System.out.println("Read: " + in.read());
```
- `CharSequenceInputStream.available()` makes erroneous assumptions about
conversion ratio
Currently the implementation assumes that for each `char` at least one
`byte` will be encoded. This might be incorrect for some specific encodings
(have not found an example for this yet though), but is definitely incorrect
for the default malformed input action of `REPLACE`. With this, encodings
replace supplementary code points (consisting of two `char`s) with a single
`byte`. This can been seen here:
```java
Charset charset = Charset.forName("Big5");
CharSequenceInputStream in = new CharSequenceInputStream("\uD800\uDC00",
charset);
System.out.println("Available: " + in.available());
// Note: readAllBytes() is a method added in Java 9
System.out.println("Actually read: " + in.readAllBytes().length);
```
`available()` returns 2, but only 1 byte can be read.
I asked for feedback for the following reasons, but did not describe them
very well, or at all, in the original description:
- I was not sure what your opinion about these larger refactorings are
- The refactored `CharSequenceInputStream` made the `available()`
implementation more correct (see above mentioned issue), but made the results
less useful. This is also why the tests were failing because they were still
expecting specific `available()` results, which cannot necessarily be
guaranteed anymore. The tests could be rewritten to only check the
`available()` result against the maximum, but not impose any minimum anymore.
But I first wanted to hear your opinion on whether that is fine.
- The new `CharSequenceInputStream` implementation is basically a wrapped
`ReaderInputStream` with a bunch of boilerplate code to support `mark` and
`reset`. Additionally the new `mark` implementation considers its parameter
value now, which differs from the previous behavior. Is this new
`CharSequenceInputStream` implementation reasonable?
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org