Hi there, I was recently writing up a short Lucene file format tutorial ( https://msfroh.github.io/lucene-university/docs/DirectoryFileContents.html), using SimpleTextCodec for educational purposes.
I found that SimpleTextSegmentInfo tries to output the segment ID as raw bytes, which will often result in malformed UTF-8 output. I wrote a little fix to output as the text representation of a byte array ( https://github.com/apache/lucene/pull/12897). I noticed that it's a similar sort of thing with binary doc values (where the bytes get written directly). Is there any general desire for SImpleTextCodec to output well-formed UTF-8 where possible? Thanks, Froh