This is an automated email from the ASF dual-hosted git repository.
lidavidm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-adbc.git
The following commit(s) were added to refs/heads/main by this push:
new 43946863 fix(c/driver/sqlite): Wrap bulk ingests in a single
begin/commit txn (#910)
43946863 is described below
commit 4394686313467bfc6025bf96d38ad94d1e46482d
Author: Solomon Choe <[email protected]>
AuthorDate: Tue Jul 25 10:31:57 2023 -0700
fix(c/driver/sqlite): Wrap bulk ingests in a single begin/commit txn (#910)
Fixes #466 by having a single begin/commit txn for ingesting tables,
instead of committing once per row.
# Testing
## R
I first installed the [SQLite R
driver](https://arrow.apache.org/adbc/main/driver/sqlite.html) and
measured the time it takes to bulk ingest the `nycflights13` dataset,
noting the 336776 rowcount:
```r
library(adbcdrivermanager)
db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db")
con <- adbc_connection_init(db)
flights <- nycflights13::flights
flights$time_hour <- NULL
stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights")
adbc_statement_bind(stmt, flights)
start_time <- Sys.time()
adbc_statement_execute_query(stmt)
end_time <- Sys.time()
[1] 336776
```
As well as the time it takes to execute the query:
```r
end_time - start_time
Time difference of 1.711345 mins
```
As a followup, I noticed it takes significantly longer (~30 minutes) to
execute the query on my XPS 15 9520:
- Ubuntu 22.04.2
- kernel 5.19.0
- i9-12900HK @ 4.90 GHz
- Intel Alder Lake-P
- 64 GB RAM
Compared to my Macbook Air (1.711345 minutes):
- macOS Ventura 13.4.1
- Apple M2
- 16 GB RAM
Both on the same version of R 4.3.1 (Beagle Scout).
After making my changes, I ran `R CMD INSTALL . --preclean` in
`arrow-adbc/r/adbcdrivermanager`. I also installed the following R
packages:
```r
# install.packages("devtools")
# install.packages("pkgbuild")
```
After which I ran the following commands to validate no build / compile
issues showing up as R packaged my changes:
```r
devtools::build()
devtools::check()
devtools::install()
```
Noting that the file I made changes to showed up as a vendored file:
```r
Vendoring files from arrow-adbc to src/:
- ../../adbc.h
- ../../c/driver/sqlite/sqlite.c
- ../../c/driver/sqlite/statement_reader.c
- ../../c/driver/sqlite/statement_reader.h
- ../../c/driver/sqlite/types.h
- ../../c/driver/common/utils.c
- ../../c/driver/common/utils.h
- ../../c/vendor/nanoarrow/nanoarrow.h
- ../../c/vendor/nanoarrow/nanoarrow.c
- ../../c/vendor/sqlite3/sqlite3.h
- ../../c/vendor/sqlite3/sqlite3.c
All files successfully copied to src/
```
After packaging and installing my changes, I ran through the same bulk
ingest commands for the `nycflights13` dataset and verified that the
table contained the same number of rows as the previous run, also noting
the speedup from 1.7 minutes to 0.2 seconds:
```r
...
> start_time <- Sys.time()
adbc_statement_execute_query(stmt)
end_time <- Sys.time()
[1] 336776
> end_time - start_time
Time difference of 0.2236128 secs
```
---
c/driver/sqlite/sqlite.c | 5 +++++
1 file changed, 5 insertions(+)
diff --git a/c/driver/sqlite/sqlite.c b/c/driver/sqlite/sqlite.c
index 6d0e38ea..5521f99a 100644
--- a/c/driver/sqlite/sqlite.c
+++ b/c/driver/sqlite/sqlite.c
@@ -1101,7 +1101,10 @@ AdbcStatusCode SqliteStatementExecuteIngest(struct
SqliteStatement* stmt,
AdbcStatusCode status = SqliteStatementInitIngest(stmt, &insert, error);
int64_t row_count = 0;
+ int is_autocommit = sqlite3_get_autocommit(stmt->conn);
if (status == ADBC_STATUS_OK) {
+ if (is_autocommit) sqlite3_exec(stmt->conn, "BEGIN TRANSACTION", 0, 0, 0);
+
while (1) {
char finished = 0;
status =
@@ -1120,6 +1123,8 @@ AdbcStatusCode SqliteStatementExecuteIngest(struct
SqliteStatement* stmt,
}
row_count++;
}
+
+ if (is_autocommit) sqlite3_exec(stmt->conn, "COMMIT", 0, 0, 0);
}
if (rows_affected) *rows_affected = row_count;