twuebi opened a new pull request, #786:
URL: https://github.com/apache/arrow-go/pull/786
### Rationale for this change
On dictionary overflow, arrow-go always flushed the dictionary page and any
buffered dict-encoded data pages before switching to PLAIN, even when no
dict-encoded data page had been cut. On mid-cardinality columns the result was
a 4-encoding chunk layout (PLAIN_DICTIONARY, PLAIN, RLE, PLAIN) that bloated
output by 20-30% versus parquet-mr.
This was noticed when testing iceberg-go's recently added compaction
feature, where some tables with particular high cardinality columns would see a
30% size increase after compaction.
### What changes are included in this PR?
Mirror parquet-mr's FallbackValuesWriter:
- Discard the dictionary and re-encode buffered indices as PLAIN when no
dict-encoded data page has been flushed yet; only emit the dictionary page once
a dict-encoded page is committed.
- Before the first dict-encoded page, fall back to PLAIN if dict + indices
>= raw input bytes.
- Size dict-encoded pages by raw input bytes (not the RLE indices' encoded
size) so the page cadence matches PLAIN.
Adds DictEncoder.FallBackTo / ObservedRawSize and exposes
BinaryMemoTable.Value for the fallback translation.
### Are these changes tested?
Yes, as part of the PR and also e2e while testing compaction in iceberg-go.
### Are there any user-facing changes?
No public API changes, only observable thing should be the dropped double
encoding.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]