XiangpengHao commented on issue #10921:
URL: https://github.com/apache/datafusion/issues/10921#issuecomment-2217882361

   Want to share some thoughts here on when to use `StringViewArray` and when 
not.
   
   We only consider the cost of loading data from parquet to narrow the scope.
   
   To load a `StringArray`, we need to copy the data to a new buffer and build 
offset array. The extra memory we need to setup is `array len * (string len + 
offset size)`, for `StringArray` it is `array len * (string len + 4)`, for 
`BigStringArray` it is `array len * (string len + 8)`
   
   To load a `StringViewArray`, we only need to build view array and can reuse 
the buffer from parquet decoder. The extra memory to setup is `array len * view 
size `, i.e., `array len * 16`. Note that the memory consumption of 
`StringViewArray` is constant to string length, i.e., it takes 16 bytes of 
memory no matter how long the underlying string is.
   
   For a sufficiently large array, the time to build the array should be 
proportional to the extra memory we set up.  
   
   This means that if each of the individual string is small, i.e., smaller 
than 12 bytes, `StringArray` is actually faster than `StringViewArray`. In 
other words, we should use `StringViewArray` only when strings are larger than 
12 bytes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org
For additional commands, e-mail: github-h...@datafusion.apache.org

Reply via email to