metsw24-max opened a new pull request, #669: URL: https://github.com/apache/logging-log4cxx/pull/669
Reject UTF-8 encodings of UTF-16 surrogate halves (`U+D800–U+DFFF`) during decoding in `Transcoder::decode()`. RFC 3629 §3 explicitly forbids surrogate-half values in UTF-8. Prior to this patch, the decoder accepted these sequences and treated them as valid code points, allowing malformed Unicode to enter internal `LogString` representations and later be re-emitted unchanged by downstream components. This patch fixes the issue by rejecting surrogate-half values during the existing 3-byte UTF-8 validation path. --- ## Changes ### Decoder Validation Updated the validation check in: `src/main/cpp/transcoder.cpp` from: ```cpp if (rv <= 0x800) ``` to: ```cpp if (rv <= 0x800 || (0xD800 <= rv && rv <= 0xDFFF)) ``` The existing `rv <= 0x800` condition is intentionally left unchanged because it belongs to the separate `utf8-u0800-boundary-check` issue. The new clause rejects all UTF-16 surrogate-half code points, which are invalid in UTF-8. --- ## Tests Added Added regression coverage in: `src/test/cpp/helpers/transcodertestcase.cpp` ### `testDecodeUTF8_RejectSurrogate` Verifies that the invalid UTF-8 sequence: ```text ED A0 80 ``` (previously decoded as `U+D800`) is now rejected and converted into `LOSSCHAR` substitutions. ### `testDecodeUTF8_SurrogateBoundaries` Validates correct handling around the surrogate range boundaries: * `U+D7FF` → accepted * `U+D800` → rejected * `U+DBFF` → rejected * `U+DC00` → rejected * `U+DFFF` → rejected * `U+E000` → accepted This confirms that only surrogate-half values are rejected. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: [email protected] For queries about this service, please contact Infrastructure at: [email protected]
