[sqlite] Clarification on "No Isolation On Same Db Connection"

Stephan Mueller Wed, 07 Sep 2016 16:41:51 -0700

Hello all;

I have a scenario that is well covered by the text at 
https://www.sqlite.org/isolation.html in the section "No Isolation Between 
Operations On The Same Database Connection" (relevant paragraphs at end for 
convenience) but am hoping for a bit more clarity.  Here's an overview of my 
scenario.  I'm trying to find the right level of detail; if I've failed on the 
side of too little, I'm happy to provide more.


I have a non-trivial SELECT simulating an outer join 
(http://stackoverflow.com/questions/1923259/full-outer-join-with-sqlite) with 
the overall structure

        SELECT columns FROM t1 LEFT JOIN t2 ON t1.column = t2.column
            WHERE conditions
        UNION ALL
        SELECT columns FROM t2 LEFT JOIN t1 ON t2.column = t1.column
            WHERE t1.column IS NULL
            AND more_conditions
        ORDER BY column

The application fetches a row at a time, doing an UPDATE or INSERT or DELETE 
into t1 (henceforth referred to just  as 'updates') for each row fetched.  I 
_want_ the semantics to be that the SELECT is unperturbed by my updates.  
(Details: there are actually three tables.  The joined-on column is UNIQUE 
(indeed, the PRIMARY KEY) in each table.  The query conditions are such that 
column is UNIQUE in the result set when not doing any updates while iterating.)

I understand that a way to ensure "SELECT is unperturbed" semantics is to use 
separate connections for SELECT and updates.   This is undesirable since I'd 
have to (IIUC) do all my updates (possibly millions) in a single transaction.  
I'd prefer to commit after each update (to avoid redoing any work in case of 
unnatural program abort, for example).  The compromise for performance is to 
commit every N (4096, currently) updates, but potentially millions seems right 
out.

Another way is to fetch all the results into a temporary table tt, then iterate 
over tt while modifying t1.   This is easy to code, and SQLite can fill tt with 
a single INSERT INTO tt SELECT <as above>.  With potentially millions of rows, 
however, I'm rather an approach that doesn't spoil the interactive user 
experience by adding seconds spent on creating the temporary table, if that can 
be avoided.

I'm thus led to the following notion (apologies for my pseudo-code, which is 
neither SQL, nor C, nor the Perl I'm actually coding in):

        prepare SELECT from above
        execute()
        $last = "";
        while (fetchnext) {
                if $fetched.column <= $last                (1)
                        skip to next iteration of loop      (2)
                $last = $fetched.column                       (3)
                update-or-insert-or-delete
        }

That is, if I ever receive a record that ought to have arrived earlier because 
of ORDER BY, it must be a since-SELECT-began update, and should be ignored.

I'm trying to not think in terms of potential implementation, and stick with 
"what must ORDER BY guarantee?"   I'd imagine if it ever returns a recent 
update, it would do so at the proper point in sort order.  If it's decided that 
an "alphabetically-prior-to-where-we-have-been-returning-records recent update" 
record must be returned, it would be returned ASAP -- as if the ORDER BY 
guarantee was that "all remaining records are sorted, and the lowest of those 
returned next".

Is my thinking sound?  -- are the numbered lines in my pseudo-code necessary 
and/or sufficient to achieve the goal of ignoring updates in this case, in a 
righteously precise use of SQL?

Thanks in advance for your thoughts!

Here are selections from the relevant SQLite documentation paragraphs to 
provide context, with comments/queries interspersed.

      No Isolation Between Operations On The Same Database Connection

      ... What if a SELECT statement is started and the 
sqlite3_step()<https://www.sqlite.org/c3ref/step.html> interface steps through 
roughly half of its output, then some 
UPDATE<https://www.sqlite.org/lang_update.html> statements are run by the 
application that modify the table that the SELECT statement is reading, then 
more calls to sqlite3_step()<https://www.sqlite.org/c3ref/step.html> are made 
to finish out the SELECT statement? Will the later steps of the SELECT 
statement see the changes made by the UPDATE or not? The answer is that this 
behavior is undefined.  In particular, whether or not the SELECT statement sees 
the concurrent changes depends on which release of SQLite is running, the 
schema of the database file, whether or not 
ANALYZE<https://www.sqlite.org/lang_analyze.html> has been run, and the details 
of the query. In some cases, it might depend on the content of the database 
file, too. There is no good way to know whether or not a SELECT statement will 
see changes that were made to the database by the same database connection 
after the SELECT statement was started. And hence, developers should diligently 
avoid writing applications that make assumptions about what will occur in that 
circumstance.
When talking about the C language specification, it's traditional to claim that 
"this behaviour is undefined" means "the C compiler is free to format your hard 
drive".  In this context, I hope it means "it is not specified whether later 
steps of the SELECT statement will see the changes made by the UPDATE, but the 
database is not at risk of corruption due to updates-while-iterating".  Text 
below suggests my hopeful interpretation is correct.

Formalizing my query above, I think I'm asking: given I would retrieve records 
1..N iterating over the SELECT results when making no modifications while 
iterating, is it reasonable to assume that with 
updates-to-previously-returned-records-while-iterating, I am still guaranteed 
to receive at least the same records 1..N, in the same order (assuming the 
ORDER BY clause as in my query), possibly (but not necessarily) with additional 
records interspersed, and any interspersed records would need to respect the 
ORDER BY clause as well (which means "in sorted order for 'higher' than 
most-recently-returned, and 'as soon as possible' for 'lower or equal to' 
most-recently returned?

      If an application issues a SELECT statement on a single table like 
"SELECT rowid, * FROM table WHERE ..." and starts stepping through the output 
of that statement usingsqlite3_step()<https://www.sqlite.org/c3ref/step.html> 
and examining each row, then it is safe for the application to delete the 
current row or any prior row using "DELETE FROM table WHERE rowid=?". It is 
also safe (in the sense that it will not harm the database) for the application 
to delete a row that expected to appear later in the query but has not appeared 
yet. If a future row is deleted, however, it might happen that the row turns up 
after a subsequent sqlite3_step(), even after it has allegedly been deleted. Or 
it might not. That behavior is undefined. The application can also INSERT new 
rows into the table while the SELECT statement is running, but whether or not 
the new rows appear in subsequent sqlite3_step()s of the query is undefined. 
And the application can UPDATE the current row or any prior row, though doing 
so might cause that row to reappear in a subsequent sqlite3_step(). As long as 
the application is prepared to deal with these ambiguities, the operations 
themselves are safe and will not harm the database file.
This paragraph focuses on an example with a single table, perhaps because it's 
easy to imagine iterating over this with a simple cursor.  I can also imagine 
that more complex queries will involve more complex data structures to manage 
the iteration within SQLite, and so have a more complex relationship with what 
rows get returned.  Rather than trying to explain (and then have to maintain 
compatibility) I fully understand the point of "assume nothing" here.  
Ultimately, I want to know enough to robustly deal with these ambiguities as 
the last sentence suggests.

many thanks,
stephan();



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

[sqlite] Clarification on "No Isolation On Same Db Connection"

Reply via email to