snowflake: improve bulk ingestion speed [arrow-adbc]

via GitHub Tue, 05 Dec 2023 03:38:26 -0800


joellubi commented on issue #1327:
URL: https://github.com/apache/arrow-adbc/issues/1327#issuecomment-1840602154

Following up on #1322.

The Snowflake Connector that our ADBC driver uses [claims to make
optimizations](https://pkg.go.dev/github.com/snowflakedb/gosnowflake#hdr-Batch_Inserts_and_Binding_Parameters)
when many values are bound to an `INSERT` statement. There are some
limitations to when this optimization can be made, but it does appear that in
this case the code is already going through the connector's optimized path.
Given this still doesn't offer the throughput we would expect, it seems
reasonable to handle this on the ADBC side while addressing some of the
connector's existing limitations.

The primary limitations we'd want our solution to overcome:
1. Currently each batch becomes its own temp stage. We would want to upload
multiple (or all) batches to a single stage and load from there.
2. The connector relies on conversion to golang types which must then be
loaded into a CSV for the stage. We could likely do a lot better with arrow
type mapping by using parquet directly from arrow as the stage format.

Open question: Does adbc_ingest need to optimize ingestion of small tables
as well? Currently the connector uses a single `INSERT` query without staging
any files for very small tables. Using COPY in all cases _might_ not perform
well in these scenarios. Perhaps we can start with COPY in all cases and add
better handling for small tables in the future if there are actually issues in
these cases.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [I] go/adbc/driver/snowflake: improve bulk ingestion speed [arrow-adbc]

Reply via email to