xiewajueji commented on issue #45061:
URL: https://github.com/apache/arrow/issues/45061#issuecomment-2562384612

   >  And you may curious that what's the size didn't writen to a page, or 
holded by Dictionary. That's EstimatedDataEncodedSize().
   https://github.com/apache/arrow/pull/33897#issuecomment-1440166798
   
   ```cpp
     int64_t estimated_buffered_value_bytes() const override {
       return current_encoder_->EstimatedDataEncodedSize();
     }
   ```
   
   ColumnWriter's method of buffered size estimation is above. It just forward 
encoder's impl. As DictEncoder's method EstimatedDataEncdedSize don't count 
dictionary size, the real buffered size is less than real situation.
   
   In Apache Parquet, DictionaryValuesWriter is always wrapped with 
FallbackValueWriter.
   ```java
   // FallbackValueWriter.class
   
     public long getBufferedSize() {
       // use raw data size to decide if we want to flush the page
       // so the actual size of the page written could be much more smaller
       // due to dictionary encoding. This prevents page being too big when 
fallback happens.
       return rawDataByteSize;
     }
   
     public void writeBytes(Binary v) {
       //for rawdata, length(4 bytes int) is stored, followed by the binary 
content itself
       rawDataByteSize += v.length() + 4;
       currentWriter.writeBytes(v);
       checkFallback();
     }
   ```
   Java's size estimation is pessimisive to prevent memory consumition too much.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to