Documentations issues

Dain Sundstrom Fri, 16 Jun 2017 12:20:02 -0700

Recently I have been working on a custom writer for Presto and during this I 
kept notes on sections of the documentation that might have problems.  Some of 
these may have already been addressed:


## Compression
see https://orc.apache.org/docs/compression.html

I think the hex sequence for 100000 compressed is [0x41 0x0D 0x03].  Also, it 
is not clear if compressed length is 2 bytes, or .
```
Each header is 3 bytes long with (compressedLength * 2 + isOriginal) stored as 
a little endian value.   For example, the header for a chunk that compressed to 
100,000 bytes would be [0x40, 0x0d, 0x03]. The header for 5 bytes that did not 
compress would be [0x0b, 0x00, 0x00]. 
```

This section is not clear:
```
The default compression chunk size is 256K, but writers can choose their own 
value less than 223.
```
Should the that be 223K?  If so, that seems strange since I would assume any 
value smaller than 256K is legit.


## String encodings
see https://orc.apache.org/docs/encodings.html#string-char-and-varchar-columns

This first sentence seems to be describing a heuristic used by the default 
implementation.

## File tail
The docs should make it clear that the maximum length stored for archer and 
char are the maximum number of unicode characters and specifically not byte 
count and not UTF-16 sequences (like Java does by default).
```
// the maximum length of the type for varchar or char
 optional uint32 maximumLength = 4;
```

Documentations issues

Reply via email to