Recently I have been working on a custom writer for Presto and during this I kept notes on sections of the documentation that might have problems. Some of these may have already been addressed:
## Compression see https://orc.apache.org/docs/compression.html I think the hex sequence for 100000 compressed is [0x41 0x0D 0x03]. Also, it is not clear if compressed length is 2 bytes, or . ``` Each header is 3 bytes long with (compressedLength * 2 + isOriginal) stored as a little endian value. For example, the header for a chunk that compressed to 100,000 bytes would be [0x40, 0x0d, 0x03]. The header for 5 bytes that did not compress would be [0x0b, 0x00, 0x00]. ``` This section is not clear: ``` The default compression chunk size is 256K, but writers can choose their own value less than 223. ``` Should the that be 223K? If so, that seems strange since I would assume any value smaller than 256K is legit. ## String encodings see https://orc.apache.org/docs/encodings.html#string-char-and-varchar-columns This first sentence seems to be describing a heuristic used by the default implementation. ## File tail The docs should make it clear that the maximum length stored for archer and char are the maximum number of unicode characters and specifically not byte count and not UTF-16 sequences (like Java does by default). ``` // the maximum length of the type for varchar or char optional uint32 maximumLength = 4; ```
