Re: [I] String columns come back as `Utf8` even when cardinality is tiny — is that expected? [arrow-adbc]

via GitHub Mon, 27 Apr 2026 09:34:50 -0700


PavloPolovyi commented on issue #4271:
URL: https://github.com/apache/arrow-adbc/issues/4271#issuecomment-4328727246


   @zeroshade We are working together with @skalkin.
    
   What we tested - all simple SELECT * FROM …, no joins or aggregations, so 
transport is the main cost:
    
   - 1K / 100K / 1M rows × 10 columns of mixed types (int, double, bool, 
timestamp, varchar)
   - 1M rows × 10 string columns at varying cardinality - 10, 50, 100, 500, 1K, 
5K, 10K, 50K, 100K, and 1M distinct values
    
   One note on setup: both sides ship chunks of 100K rows to the next stage of 
our pipeline - on JDBC via setFetchSize(100000), on the ADBC path our service 
accumulates RecordBatches into matching 100K-row chunks before forwarding. Same 
chunk count, same chunk size on the wire, neither side buffers the whole 
result. So the comparison should be fair.
    
   We keep the data columnar from start to finish - each RecordBatch is sent as 
Arrow IPC over a WebSocket, then decoded into a DataFrame on the other side for 
visualization. This is the "consumer is already columnar" case you mentioned, 
and on the smaller datasets (1K, 100K rows), Snowflake ADBC is ~20–30% faster 
than our JDBC-based path.
    
   Where we don't see the win, two cases:
    
   1. 1M rows, mixed types. Snowflake is roughly tied with JDBC. Databricks 
ADBC is clearly slower - most of the time is spent waiting on the ADBC reader. 
Our best guess is that Databricks's CloudFetch path (results delivered through 
S3) is the bottleneck when running across regions, while JDBC streams directly 
via Thrift. But that's a separate conversation.
    
   2. 1M rows, strings - the topic of this discussion. On the JDBC side, our 
Java client dict-encodes strings before sending, so the wire bytes are all 2–3× 
smaller. With plain UTF-8 from ADBC, each of those steps does more work. Our 
own workaround that checks each column and casts it to a Dictionary gets 
Snowflake to ~1.6× JDBC on this test (down from ~3.2× without it) - much 
better, but still not even.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] String columns come back as `Utf8` even when cardinality is tiny — is that expected? [arrow-adbc]

Reply via email to