[PR] fix surrogate pair byte counting in ExtendedBufferedReader array read [commons-csv]

via GitHub Fri, 05 Jun 2026 08:08:37 -0700


digi-scrypt opened a new pull request, #605:
URL: https://github.com/apache/commons-csv/pull/605


   1. with byte tracking on (setTrackBytes + a charset), read(char[]) advances 
lastChar to the last buffer char before counting, and the per-char helper reads 
that field instead of the actual preceding char, so a surrogate pair gets 
matched against the wrong neighbor (the loop also ran to length instead of 
offset+length).
   2. a 4-byte char taken through the char[] path, e.g. a multi-character 
delimiter holding a supplementary character, then throws 
CharacterCodingException out of nextRecord and getBytePosition() goes wrong.
   
   Passed the previous char explicitly and moved the counting ahead of the 
lastChar update. What happens for a pair split across two buffer reads is 
covered too since the first char still pairs against the saved lastChar. Added 
a regression test next to the existing multi-character-delimiter byte-position 
one.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] fix surrogate pair byte counting in ExtendedBufferedReader array read [commons-csv]

Reply via email to