twuebi opened a new pull request, #786:
URL: https://github.com/apache/arrow-go/pull/786

   ### Rationale for this change
   
   On dictionary overflow, arrow-go always flushed the dictionary page and any 
buffered dict-encoded data pages before switching to PLAIN, even when no 
dict-encoded data page had been cut. On mid-cardinality columns the result was 
a 4-encoding chunk layout (PLAIN_DICTIONARY, PLAIN, RLE, PLAIN) that bloated 
output by 20-30% versus parquet-mr.
   
   This was noticed when testing iceberg-go's recently added compaction 
feature, where some tables with particular high cardinality columns would see a 
30% size increase after compaction.
   
   ### What changes are included in this PR?
   
   Mirror parquet-mr's FallbackValuesWriter:
   
     - Discard the dictionary and re-encode buffered indices as PLAIN when no 
dict-encoded data page has been flushed yet; only emit the dictionary page once 
a dict-encoded page is committed.
     - Before the first dict-encoded page, fall back to PLAIN if dict + indices 
>= raw input bytes.
     - Size dict-encoded pages by raw input bytes (not the RLE indices' encoded 
size) so the page cadence matches PLAIN.
   
   Adds DictEncoder.FallBackTo / ObservedRawSize and exposes 
BinaryMemoTable.Value for the fallback translation.
   
   
   ### Are these changes tested?
   
   Yes, as part of the PR and also e2e while testing compaction in iceberg-go.
   
   ### Are there any user-facing changes?
   
   No public API changes, only observable thing should be the dropped double 
encoding.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to