[GitHub] [commons-io] Marcono1234 commented on pull request #293: Refactor `ReaderInputStream` implementation

2022-09-11 Thread GitBox


Marcono1234 commented on PR #293:
URL: https://github.com/apache/commons-io/pull/293#issuecomment-1242977393

   For completeness sake I have pushed one more commit to this 
`marcono1234/ReaderInputStream-refactor` branch which uses the `bufferSize` 
parameter of the `CharSequenceInputStream` constructor again. (Probably does 
not show up here because this PR is closed.)
   
   Though that branch would still need adjustments to the tests as mentioned in 
my comments above.
   
   Regardless of what will happen with this pull request I have now created two 
issues on Jira which track the bugs mentioned above:
   - [IO-780](https://issues.apache.org/jira/browse/IO-780): ReaderInputStream 
discards some encoding errors
   - [IO-781](https://issues.apache.org/jira/browse/IO-781): 
CharSequenceInputStream.available() returns too large numbers in some cases
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [commons-io] Marcono1234 commented on pull request #293: Refactor `ReaderInputStream` implementation

2022-04-25 Thread GitBox


Marcono1234 commented on PR #293:
URL: https://github.com/apache/commons-io/pull/293#issuecomment-1108955380

   Regardless of what will happen with this pull request, should I report the 
two issues mentioned above on Jira?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [commons-io] Marcono1234 commented on pull request #293: Refactor `ReaderInputStream` implementation

2022-04-16 Thread GitBox


Marcono1234 commented on PR #293:
URL: https://github.com/apache/commons-io/pull/293#issuecomment-1100772608

   Also, will these changes require me to sign the Apache CLA?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



[GitHub] [commons-io] Marcono1234 commented on pull request #293: Refactor `ReaderInputStream` implementation

2022-04-16 Thread GitBox


Marcono1234 commented on PR #293:
URL: https://github.com/apache/commons-io/pull/293#issuecomment-1100772224

   Sorry for not having responded earlier, and for not properly following up on 
this.
   
   The issues mentioned in the description still exist on `master`:
   - `ReaderInputStream` erroneously overwrites `lastCoderResult`
   This can for example be seen with the following code snippet. The string has 
a trailing unpaired surrogate, which is therefore malformed. The encoder is 
created to throw on malformed input. However, currently `read()` erroneously 
just returns -1 instead of throwing an exception.
   ```java
   // Encoder which throws on malformed or unmappable input
   CharsetEncoder encoder = StandardCharsets.UTF_8.newEncoder();
   ReaderInputStream in = new ReaderInputStream(new StringReader("\uD800"), 
encoder);
   System.out.println("Read: " + in.read());
   ```
   - `CharSequenceInputStream.available()` makes erroneous assumptions about 
conversion ratio
   Currently the implementation assumes that for each `char` at least one 
`byte` will be encoded. This might be incorrect for some specific encodings 
(have not found an example for this yet though), but is definitely incorrect 
for the default malformed input action of `REPLACE`. With this, encodings 
replace supplementary code points (consisting of two `char`s) with a single 
`byte`. This can been seen here:
   ```java
   Charset charset = Charset.forName("Big5");
   CharSequenceInputStream in = new CharSequenceInputStream("\uD800\uDC00", 
charset);
   System.out.println("Available: " + in.available());
   // Note: readAllBytes() is a method added in Java 9
   System.out.println("Actually read: " + in.readAllBytes().length);
   ```
   
   `available()` returns 2, but only 1 byte can be read.
   
   I asked for feedback for the following reasons, but did not describe them 
very well, or at all, in the original description:
   - I was not sure what your opinion about these larger refactorings are
   - The refactored `CharSequenceInputStream` made the `available()` 
implementation more correct (see above mentioned issue), but made the results 
less useful. This is also why the tests were failing because they were still 
expecting specific `available()` results, which cannot necessarily be 
guaranteed anymore. The tests could be rewritten to only check the 
`available()` result against the maximum, but not impose any minimum anymore. 
But I first wanted to hear your opinion on whether that is fine.
   - The new `CharSequenceInputStream` implementation is basically a wrapped 
`ReaderInputStream` with a bunch of boilerplate code to support `mark` and 
`reset`. Additionally the new `mark` implementation considers its parameter 
value now, which differs from the previous behavior. Is this new 
`CharSequenceInputStream` implementation reasonable?
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@commons.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org