XiangpengHao commented on issue #10921: URL: https://github.com/apache/datafusion/issues/10921#issuecomment-2217882361
Want to share some thoughts here on when to use `StringViewArray` and when not. We only consider the cost of loading data from parquet to narrow the scope. To load a `StringArray`, we need to copy the data to a new buffer and build offset array. The extra memory we need to setup is `array len * (string len + offset size)`, for `StringArray` it is `array len * (string len + 4)`, for `BigStringArray` it is `array len * (string len + 8)` To load a `StringViewArray`, we only need to build view array and can reuse the buffer from parquet decoder. The extra memory to setup is `array len * view size `, i.e., `array len * 16`. Note that the memory consumption of `StringViewArray` is constant to string length, i.e., it takes 16 bytes of memory no matter how long the underlying string is. For a sufficiently large array, the time to build the array should be proportional to the extra memory we set up. This means that if each of the individual string is small, i.e., smaller than 12 bytes, `StringArray` is actually faster than `StringViewArray`. In other words, we should use `StringViewArray` only when strings are larger than 12 bytes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: github-unsubscr...@datafusion.apache.org For additional commands, e-mail: github-h...@datafusion.apache.org