This is an automated email from the ASF dual-hosted git repository.

lidavidm pushed a commit to branch main
in repository https://gitbox.apache.org/repos/asf/arrow-adbc.git


The following commit(s) were added to refs/heads/main by this push:
     new 43946863 fix(c/driver/sqlite): Wrap bulk ingests in a single 
begin/commit txn (#910)
43946863 is described below

commit 4394686313467bfc6025bf96d38ad94d1e46482d
Author: Solomon Choe <[email protected]>
AuthorDate: Tue Jul 25 10:31:57 2023 -0700

    fix(c/driver/sqlite): Wrap bulk ingests in a single begin/commit txn (#910)
    
    Fixes #466 by having a single begin/commit txn for ingesting tables,
    instead of committing once per row.
    
    # Testing
    ## R
    I first installed the [SQLite R
    driver](https://arrow.apache.org/adbc/main/driver/sqlite.html) and
    measured the time it takes to bulk ingest the `nycflights13` dataset,
    noting the 336776 rowcount:
    ```r
    library(adbcdrivermanager)
    db <- adbc_database_init(adbcsqlite::adbcsqlite(), uri = "test.db")
    con <- adbc_connection_init(db)
    
    flights <- nycflights13::flights
    flights$time_hour <- NULL
    
    stmt <- adbc_statement_init(con, adbc.ingest.target_table = "flights")
    adbc_statement_bind(stmt, flights)
    
    start_time <- Sys.time()
    adbc_statement_execute_query(stmt)
    end_time <- Sys.time()
    [1] 336776
    ```
    As well as the time it takes to execute the query:
    ```r
    end_time - start_time
    Time difference of 1.711345 mins
    ```
    
    As a followup, I noticed it takes significantly longer (~30 minutes) to
    execute the query on my XPS 15 9520:
    - Ubuntu 22.04.2
    - kernel 5.19.0
    - i9-12900HK @ 4.90 GHz
    - Intel Alder Lake-P
    - 64 GB RAM
    
    Compared to my Macbook Air (1.711345 minutes):
    - macOS Ventura 13.4.1
    - Apple M2
    - 16 GB RAM
    
    Both on the same version of R 4.3.1 (Beagle Scout).
    
    After making my changes, I ran `R CMD INSTALL . --preclean` in
    `arrow-adbc/r/adbcdrivermanager`. I also installed the following R
    packages:
    ```r
    # install.packages("devtools")
    # install.packages("pkgbuild")
    ```
    
    After which I ran the following commands to validate no build / compile
    issues showing up as R packaged my changes:
    ```r
    devtools::build()
    devtools::check()
    devtools::install()
    ```
    Noting that the file I made changes to showed up as a vendored file:
    ```r
    Vendoring files from arrow-adbc to src/:
    - ../../adbc.h
    - ../../c/driver/sqlite/sqlite.c
    - ../../c/driver/sqlite/statement_reader.c
    - ../../c/driver/sqlite/statement_reader.h
    - ../../c/driver/sqlite/types.h
    - ../../c/driver/common/utils.c
    - ../../c/driver/common/utils.h
    - ../../c/vendor/nanoarrow/nanoarrow.h
    - ../../c/vendor/nanoarrow/nanoarrow.c
    - ../../c/vendor/sqlite3/sqlite3.h
    - ../../c/vendor/sqlite3/sqlite3.c
    All files successfully copied to src/
    ```
    
    After packaging and installing my changes, I ran through the same bulk
    ingest commands for the `nycflights13` dataset and verified that the
    table contained the same number of rows as the previous run, also noting
    the speedup from 1.7 minutes to 0.2 seconds:
    ```r
    ...
    > start_time <- Sys.time()
    adbc_statement_execute_query(stmt)
    end_time <- Sys.time()
    [1] 336776
    > end_time - start_time
    Time difference of 0.2236128 secs
    ```
---
 c/driver/sqlite/sqlite.c | 5 +++++
 1 file changed, 5 insertions(+)

diff --git a/c/driver/sqlite/sqlite.c b/c/driver/sqlite/sqlite.c
index 6d0e38ea..5521f99a 100644
--- a/c/driver/sqlite/sqlite.c
+++ b/c/driver/sqlite/sqlite.c
@@ -1101,7 +1101,10 @@ AdbcStatusCode SqliteStatementExecuteIngest(struct 
SqliteStatement* stmt,
   AdbcStatusCode status = SqliteStatementInitIngest(stmt, &insert, error);
 
   int64_t row_count = 0;
+  int is_autocommit = sqlite3_get_autocommit(stmt->conn);
   if (status == ADBC_STATUS_OK) {
+    if (is_autocommit) sqlite3_exec(stmt->conn, "BEGIN TRANSACTION", 0, 0, 0);
+
     while (1) {
       char finished = 0;
       status =
@@ -1120,6 +1123,8 @@ AdbcStatusCode SqliteStatementExecuteIngest(struct 
SqliteStatement* stmt,
       }
       row_count++;
     }
+
+    if (is_autocommit) sqlite3_exec(stmt->conn, "COMMIT", 0, 0, 0);
   }
 
   if (rows_affected) *rows_affected = row_count;

Reply via email to