UTF-8 well-formedness for SimpleTextCodec

Michael Froh Mon, 18 Dec 2023 09:01:00 -0800

Hi there,

I was recently writing up a short Lucene file format tutorial (
https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html),
using SimpleTextCodec for educational purposes.


I found that SimpleTextSegmentInfo tries to output the segment ID as raw
bytes, which will often result in malformed UTF-8 output. I wrote a little
fix to output as the text representation of a byte array (
https://github.com/apache/lucene/pull/12897). I noticed that it's a similar
sort of thing with binary doc values (where the bytes get written
directly).

Is there any general desire for SImpleTextCodec to output well-formed UTF-8
where possible?

Thanks,
Froh

UTF-8 well-formedness for SimpleTextCodec

Reply via email to