PavloPolovyi commented on issue #4271:
URL: https://github.com/apache/arrow-adbc/issues/4271#issuecomment-4328727246
@zeroshade We are working together with @skalkin.
What we tested - all simple SELECT * FROM …, no joins or aggregations, so
transport is the main cost:
- 1K / 100K / 1M rows × 10 columns of mixed types (int, double, bool,
timestamp, varchar)
- 1M rows × 10 string columns at varying cardinality - 10, 50, 100, 500, 1K,
5K, 10K, 50K, 100K, and 1M distinct values
One note on setup: both sides ship chunks of 100K rows to the next stage of
our pipeline - on JDBC via setFetchSize(100000), on the ADBC path our service
accumulates RecordBatches into matching 100K-row chunks before forwarding. Same
chunk count, same chunk size on the wire, neither side buffers the whole
result. So the comparison should be fair.
We keep the data columnar from start to finish - each RecordBatch is sent as
Arrow IPC over a WebSocket, then decoded into a DataFrame on the other side for
visualization. This is the "consumer is already columnar" case you mentioned,
and on the smaller datasets (1K, 100K rows), Snowflake ADBC is ~20–30% faster
than our JDBC-based path.
Where we don't see the win, two cases:
1. 1M rows, mixed types. Snowflake is roughly tied with JDBC. Databricks
ADBC is clearly slower - most of the time is spent waiting on the ADBC reader.
Our best guess is that Databricks's CloudFetch path (results delivered through
S3) is the bottleneck when running across regions, while JDBC streams directly
via Thrift. But that's a separate conversation.
2. 1M rows, strings - the topic of this discussion. On the JDBC side, our
Java client dict-encodes strings before sending, so the wire bytes are all 2–3×
smaller. With plain UTF-8 from ADBC, each of those steps does more work. Our
own workaround that checks each column and casts it to a Dictionary gets
Snowflake to ~1.6× JDBC on this test (down from ~3.2× without it) - much
better, but still not even.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]