[sqlite] Customizing the location of the .sqlite_history

2016-02-02 Thread James K. Lowden
On Sun, 31 Jan 2016 21:46:18 -0800
Jes Slow  wrote:

> Many applications do this by allowing the user to set an environment
> variable to customize the location, altho personally I would prefer
> another way since environment variables are also global. 

Global?  Environment variables are per-process, and changeable by the
process with putenv(3).  

--jkl


[sqlite] Documentation comment: VFS

2016-02-02 Thread James K. Lowden
On Mon, 1 Feb 2016 14:24:28 +0200
R Smith  wrote:

> > does it say what "VFS" stands for.  Please don't tell me...//
> 
> I'm going to ignore the request and tell you anyway: It clearly
> stands for "Variably F-Syncing Software".

Oh, come now.  In the tradition of RTMF, surely it's 

Very Fine Sofware

--jkl





[sqlite] Bug: Successfully committed transaction rolled back after power failure

2016-02-02 Thread James K. Lowden
On Sun, 31 Jan 2016 20:27:56 -0700
Scott Robison  wrote:

> On Sun, Jan 31, 2016 at 7:35 PM, Rowan Worth  wrote:
> 
> > On 31 January 2016 at 03:56, James K. Lowden
> >  wrote:
> >
> > > Surely SQLite does both -- fsync on file and directory -- as part
> > > of a commit. That's not in doubt, is it?
> > >
> >
> > No, SQLite does not. On COMMIT it fsyncs the database file and
> > unlinks the journal[1], but does not fsync the directory. This is
> > demonstrated by the logs in Stefan's initial post, and indeed the
> > purpose of his patch was to introduce an fsync on the directory
> > after unlinking the journal.
> >
> 
> Perhaps there could be a "paranoid" journal mode, where it first
> zeros out the header ala persist, then truncates the file, then
> deletes the file.

Well, I would say there are four choices:

1.  Something like you suggest, but just mark the  "committed
fact" in  the log file.  It would be enough to write "we're done here"
at the end of the file, followed by fsync, followed by unlink.  Then if
the log file is found (undeleted) after a crash, the information in the
file suffices.  

2.  Call fsync after unlink.  

3.  Call fsync after unlink if directory's extended attributes don't
include FS_DIRSYNC_FL *or* the filesystem was mounted with MS_DIRSYNC.  

4.  For systems that support it (Linux 2.6+) just go ahead and set
FS_DIRSYNC_FL.  

The 3rd option underscores the pitfalls of directory synchronization
under Linux.  NetBSD for example has neither option; afaik all
directory operations are synchronous.  (Posix doesn't mention it.)  

Personally, I like to keep things simple.  Update & fsync the log file
one last time before deleting it.  Works on every system, and is
impervious to future permutations to directory synchronization.  


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread James K. Lowden
On Sat, 30 Jan 2016 23:03:29 +
Simon Slavin  wrote:

> On 30 Jan 2016, at 8:13pm, Yannick Duch?ne 
> wrote:
> 
> > In my opinion (which some others share), OO is a bag of
> > miscellaneous things which are better tools and better understood
> > when accosted individually. Just trying to define what OO is, shows
> > it: is this about late binding? (if it is, then there sub?program
> > references, first?class functions, or even static polymorphism and
> > signature overloading) About encapsulation? (if it is, then there
> > is already modularity and scopes) About grouping logically related
> > entities? (if it is, there is already modularity, and sometime
> > physically grouping is a bad physical design).
> 
> There are a number of problems in using a relational database for
> object-oriented purposes.  One is that to provide access to stored
> objects you need to access the database in very inefficient ways
> which are slow and are not helped by caching.  You can read about
> some of the problems here:
> 
> 

To the extent "impedance mismatch" is real, it's a matter of looking
through the wrong end of the telescope.  

Programming languages have almost universally ignored relations, logic,
and constraints, leaving programmers with primitives, pointers, and
loops.   Which is cause and which effect?  Do programmers ignorant of
set theory demand primitive languages?  Or do primitive languages beget
ignorant programmers?  I don't know.  What I do know is that a
programming language with built-in support for relational concepts
remains to be invented.  

--jkl


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-31 Thread James K. Lowden
On Sat, 30 Jan 2016 20:50:17 -0500
Jim Callahan  wrote:

> I am not interested in a complete ORM; what I am interested is when
> the object-oriented language supports a SQL-R-like object. In R, the
> object is called a data.frame and the package "Pandas" supplies a
> similar data frame object to Python.
> 
> R as I have mentioned has fantastic interfaces to SQL databases that
> allow one to pass a query and have the result populate a data frame.
> The data frame in R or Python can be fed to a machine learning
> algorithm (scikit learn on Python) or to a plotting package such as
> ggplot or bokeh.

OK, good.  It sounds lke what you're really interested in is the
simplest function possible to move data from SQLite into Pandas.  (I'll
stick with Python, since I know it better than R.)  

But you already have pandas.read_sql_query.  While that function
isn't really what I'd call simple, the complexity afaict -- dates,
floats, and chunks -- can be laid at Python's feet.  At a minimum, you
have to specify the SQL and the column names you want to use in
Pandas.  

> SQLAlchemy package ... did not understand SQLite VIEWS and one had to
> write an explicit loop to build the data frame

You don't need to use SQLAlchemy with SQLite.  And you don't need to
write loops to move query results into a Pandas DataFrame.  

So, is the problem solved, or am I missing something?  

--jkl



[sqlite] Bug: Successfully committed transaction rolled back after power failure

2016-01-31 Thread James K. Lowden
On Sat, 30 Jan 2016 20:00:19 +
Simon Slavin  wrote:

> On 30 Jan 2016, at 7:56pm, James K. Lowden 
> wrote:
> 
> > Given that the fsync has returned successfully, I don't know of any
> > hardware that then will take 1000 ms to complete the write.  That's
> > the basis for my "subsecond interval" assumption. 
> 
> Writing to a RAID which has other write commands queued.

As Keith says, you get what you pay for.  I would say that hardware
that behaves as you describe is broken.  Hardware must expose some
kind of "yes, defintely writen" notification interface.  If it does,
that's what fsync abstracts.  If it doesn't, it's broken, because
"write" only means, "whatever, dude".  

A 7200 RPM drive is capable of ~75 I/O operation/second.  To reach my
1-second threshold, the RAID controller would have to have that many
queued operations.  Not only that: it would have to have the chutzpah
to claim to have written the data, knowing it couldn't guarantee its
promise.  Is Bernie Madoff consulting for Seagate?  

You might know more about this than I do.  The last time I dealt with
anything in the vicinity, write-queue depths of 10 were considered
problematic.  100 would have been catastrophic.  

Stefan Meinlschmidt mentioned "on Linux the interval can be, IIRC, up
to 30s".  I think that's a point of confusion.  That 30s sounds to me
suspiciously like the kernel's sync checkpoint interval.  Surely
buffered writes inside the device are not subject to "up to" 30s; the
device (in general) has no such timer.  

--jkl


[sqlite] Find out how many times does SQLite hit the disk?

2016-01-31 Thread James K. Lowden
On Sun, 31 Jan 2016 03:01:30 +0530
dpb  wrote:

> This will help me in deciding if moving to In-memory SQLite will
> improve my application performance. I am done with adding indexes to
> my tables in SQLite DB.

I second Simon's question.  If SQLite isn't fast enough, a good
starting assumption is that it's not being used as efficiently as
possible.  

Remember that problems at the SQL level can introduce delays that are
orders of magnitude greater than the difference in speed between memory
and disk.  That's especially true if your database is small enough to
consider switching to in-memory.  

--jkl


[sqlite] IS a SQLite db of small size as good as reliable cache?

2016-01-31 Thread James K. Lowden
On Sat, 30 Jan 2016 22:23:24 +
Howard Chu  wrote:

> Note that the cache eviction runs quite frequently - once every 5
> seconds or so, and evicts pages regardless of whether there's any
> memory pressure in the system. It's quite possibly the stupidest
> cache manager ever written.

Any insight into what they were thinking?  Back when I used Windows
daily, it used to annoy me that every morning the machine had to warm
up again, to revive the state I'd left it in the night before.  In
NetBSD I learned that unused memory is unused, so why not use it?  

I have a feeling that "fast app launching" is the reason, as though
Windows users were excitedly punching the Start button with a
stopwatch.  But maybe there's more to it than that?  

--jkl



[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread James K. Lowden
On Sun, 31 Jan 2016 17:39:28 +0100
Yannick Duch?ne  wrote:

> I saw a page (can't retrieve the URL) suggesting to order table
> columns by names. It was strange to me, as I had the idea of a
> hierarchical access for tables access. But I though ?there must be a
> good reason for them to say this?. 

On average, the quality of advice on the Internet is average, for SQL
doubly so.  Some of it is truely terrible.  

Because the advice on this list is peer-reviewed (in the sense that
people who wrote the code participate), it tends to be very good.  

My advice, ahem, is to choose chemistry over alchemy.  If you don't
understand why the advice is well founded, keep checking until you do.
If sometimes good foundations are a little mysterious, the contrary is
not true: unfounded assumptions flush out early.  

I want to answer your question a little bit abstractly, and then circle
back to SQLite.  

We know how the table is physically ordered.  But there's no WHERE
clause; the whole table will be scanned.  Building an output table for
the aggregates will be required regardless.  The only difference would
be if the cardinality of a, b, and c were differerent.  That is, if
GROUP BY A produces many fewer rows than GROUP BY B, we would expect it
to run faster. Otherwise it's an artifact of the implementation, not
something inherent in the problem.  

Yet your counts indicate the opposite: GROUP BY B is the smallest
output.  How to explain?  Or, um, EXPLAIN?  

Consider that GROUP BY A will cause each A output row to be summed
using the input (from t) sequentially.  As the input is consumed, we
move to the next output row whenever A changes.  There's no seeking
from one output row to another.  

For GROUP BY B, each input row, read sequentially, means a seek
(logically speaking) to the appropriate output row.  If the cost of
that seek is not O(1), it will add to the time used to create the
output.  Because SQLite tables are based on B+ trees, that seek cost
is O(log2 n).  

I'd say that's the source of the difference you're seeing. EXPLAIN
shows there's an optimization for GROUP BY A, probably because the
output can be constructed sequentially.  

And that's a defensible cboice, because aggregation-seeking isn't
usually a high cost (as in fact it's not in your case, either).

Aggregations by definition are reductions. Outputs are usually small,
because otherwise the GROUP BY produces incomprehensible results.
10,000 rows of aggregated output isn't unheard of, but it's unusually
many, and to find one row in 10,000 in a binary tree requires at most 15
comparisons.  More commonly outputs are in the hundreds of rows, and
need half that many.  It could be faster, but only at the expense of
code complexity and memory footprint.  

HTH, to see the lite.  

--jkl


[sqlite] Is the first column of a composite primary key, special?

2016-01-31 Thread James K. Lowden
On Sun, 31 Jan 2016 10:45:59 -0700
"Keith Medcalf"  wrote:

> Hierarchical, Network, and Network Extended database models use
> pointers in sets rather than duplicating the data.  This makes them
> orders of magnitude faster (when properly designed) than a Relational
> Model database, 

I was cheering you on ... 


> but means that there is no recovery possible where a
> pointer-chain becomes corrupted -- with the relational model
> everything has a copy (multiple duplicates) of the data so you just
> drop the corrupted thing and re-create it.

... but I have to take exception to your characterization of
the theory.  :-(  

The relational model, as you well know, doesn't describe
implementation.  It's math.  It says what relations are, and how
they're manipulated.  One data type, and an algebra closed over that
domain.  Math doesn't have pointers or duplication or corruption; those
are computer concepts, and as such are entirely outside the model.  

For example, nothing in the model prohibits or describes using
compression to minimize I/O, or a hash of interned strings.  It's up to
the implementation to find the best way to support relational
operations, and to define "best".  SQLite, after all, supports in-memory
databases, about as fragile and unrecoverable a thing as imaginable! 

In explaining the relational model, it's true Codd does mention
pointers by way of contrasting the relational model with the others you
mentioned.  Very much intentionally, the relational model consists only
of values: every join is "value based", meaning the join of A to B is
expressed strictly in terms of the values in each one.   It does not
matter if A is the "parent" and B is the "child"; the syntax is the
same either way.  

We might say Codd's hand was forced, in that *math* is value-based.  But
he emphasized it as a feature that should be manifested in the query
language, and SQL consequently was (and for the most part, still is)
value-based.  That choice is for the user's sake, because it's easier
for humans to reason about values than about pointers.  

In those old pre-relational systems -- names for which Codd had to come
up with, by the way, because they had no "model" per se, no math --
relationships between "tables" (to use a modern term) were expressed by
pointers of some kind.  The connection was manifested in the database.
If you followed it in your application, you got DBMS support, and it
was simple(ish) and fast. If you wanted an unsupported connection --
count of orders by widget, say -- you were forced to write loops, and
well advised to get coffee while the query ran.  

When we say the relational model "has no pointers", we're referring to
the user's interaction with the data.  All tables are created equal,
you might say, and all joins are the same.  That's the simplification
that permits the advice you so often give: to express the problem
logically.  Pre-relational systems offered no such option.  

We now return you to your regularly scheduled programming.  

--jkl


[sqlite] Wish List for 2016: High Level API for Object Oriented Interactive Languages

2016-01-30 Thread James K. Lowden
On Thu, 28 Jan 2016 16:47:40 -0500
Jim Callahan  wrote:

> I am hopeful this new JDBC based interface will provide as
> satisfactory high level channel between SQLite3 and Python.

As someone who's written a couple of OO DBMS libraries and uses the
Python SQLIte module, I wonder what you're hoping to see.  What example
do you have in mind for OO functionality not already there?

ODBC was created for one reason only: to define a binary interface to
DBMS client libraries.  Afaik the same is true for JDBC. 

That interface-to-a-library property has two downsides: 

1.  Configuration complexity.  With SQLite, there is no configuration
file.  Most of the features of a connection -- server, port, username,
password, etc. -- don't apply.  When a connection fails, it takes some
expertise to tease out whether the problem lies in the ODBC
configuration, the client library, the server, or the network.  With
JBDC you have still more: the added complexity of Java setup.  

2.  Opacity.  ODBC is near the top of my list for vague error
messages.  Favorite: "General ODBC error".  If I ever meet General
ODBC, I'm going to give him a piece of my mind, if I have any left.  

The second is terribly important.  Take error handling.  Instead of
getting back, say, SQLITE_BUSY, and RTFM, you get an ODBC error and
then maybe, with a little more work, the underlying status code and
message.  And you're *still* not done, because did I mention the ODBC
driver is opaque?  Your ODBC call uses the native library in
unspecified, undocumented ways.  When you get the native error status,
you then have to reason about what the ODBC driver must have done, and
from there back to what you need to do.  It's a whole layer of black
magic and obfuscation that only lengthens your day.  

And it's not just errors.  How do you map the ODBC API onto SQLite's
API?  (I've never seen an ODBC driver document the relationship of the
ODBC functions to the native ones.) What to do with sqlite3_exec or
sqlite3_blob_open? If you know the SQLite API, you'll spend quite a bit
of time discovering how it's been mapped onto the ODBC API.  And when
you're done, you'll discover pieces missing.  

You already have one layer of mediation in the Python sqlite module.
That module is thankfully a pretty thin veneer over the C API, and the
errors it produces can be straighforwardly traced to the C function it
exposes.  You have the convenience of using the connection as a context
manager, of row objects (although dict would have been better), of
fetchall.  What sort of OO garnish would you add?  

--jkl



[sqlite] Bug: Successfully committed transaction rolled back after power failure

2016-01-30 Thread James K. Lowden
On Thu, 28 Jan 2016 08:00:08 +
Meinlschmidt Stefan  wrote:

> > But I ask you, what action could the application possibly take, in
> > that subsecond interval, that it matters?
> 
> Under the QNX OS using a QNX6 filesystem with default configuration,
> that ?subsecond interval? is actually up to 10s. For any
> non-journalling filesystem (SD cards, anyone?) mounted without
> immediate write-through (for efficiency) on Linux the interval can
> be, IIRC, up to 30s. 

Are we talking about the same thing?  write(2) makes no guarantee about
when the bits will actually be written.  It transfers the data to the
kernel.  fsync(2) *does* commit the data to media and does not return
until the hardware indicates the write is complete.  

That's the data, the extent of the inode.  For metadata, e.g. mtime,
Linux requires a second fsync is required, on the containing
directory.  

Surely SQLite does both -- fsync on file and directory -- as part of a
commit. That's not in doubt, is it?  

Given that the fsync has returned successfully, I don't know of any
hardware that then will take 1000 ms to complete the write.  That's the
basis for my "subsecond interval" assumption.  

The OP's question involved Linux, and that I think is covered. As for
QNX, if the default configuration has broken fsync semantics, it's hard
to see what SQLite can do about that.  The DBMS has to rely on the OS to
write the data, and there has to be some way to ascertain that's been
done.  


> The application could for example write to some non-SQLite
> storage  Or (what we did) it could shut off device power.

Yes.  In general it's better not to maintain state in two places; if
the application writes "committed to database" status after committing
to the database, it's going to have to contend with inconsistencies on
startup after a crash.  And your scenario is reasonable: you committed
the data, and now you can turn off the device until it's needed again.  

In such a constrained environment, you need reliable status from the
device.  If you can't get that, you're definitely hosed.  If you can
get it, what does it look like?  Perhaps the SQLite VFS can expose that
as a callback or a knob.  

--jkl




[sqlite] Bug: Successfully committed transaction rolled back after power failure

2016-01-27 Thread James K. Lowden
On Wed, 27 Jan 2016 08:51:16 +
Bernard McNeill  wrote:

> Using the standard defaults (which avoid WAL), is there any
> possibility whatsoever of that last SQL transaction being lost?

I have an unusual answer: Yes, and it doesn't matter.  

Let's suppose, as you did, that the application got back "Commit
Successful" followed quickly by a power failure.  You want to know,
could the transaction be lost anyway?  But I ask you, what action could
the application possibly take, in that subsecond interval, that it
matters?  The failure could have happened an instant earlier, just
before "Commit Successful", and it would have to recover from that.  I
can think of no scenario in which the committed-but-lost transaction is
very different from the uncommitted-and-lost transaction.  

There is no God's-eye view of application state.  The important service
provided by the DBMS is not "what's committed is definitely saved", but
rather that "what's committed is definitely *consistent*".  If on
recovery the DBMS finds that,  contrary to its prior report, the state
of the database does not warrant including the final transaction, it
rolls it back, and presents the data in an internally consistent
state.  The application doesn't have to cope with the ledger being out
of balance and suchlike.  

--jkl


[sqlite] Slight problem with sqlite3_compileoption_get

2016-01-25 Thread James K. Lowden
On Sun, 24 Jan 2016 21:26:41 +
Bart Smissaert  wrote:

>   http://www.cdecl.org/
> 
> It doesn't work though with complex arguments like this:
> void (*xFunc)(sqlite3_context*,int,sqlite3_value**)

In isolation, that's a syntax error because e.g. sqlite3_value is not
defined (on that line, which is the only one being parsed).  This
works: 

void (*xFunc)(long*,int,long**)

declare xFunc as pointer to function (pointer to long, int,
pointer to pointer to long) returning void

but it's hard to say the output is clearer than the input.  

--jkl



[sqlite] Find SQLITE_BUSY reason?

2016-01-22 Thread James K. Lowden
On Fri, 22 Jan 2016 06:24:08 +
Simon Slavin  wrote:

> 
> On 22 Jan 2016, at 4:01am, Rowan Worth  wrote:
> 
> > To a point I agree, but in reality there's a fixed amount of work
> > involved with each write transaction. I recently profiled an
> > operation involving ~75,000 rows that took ~8.5 minutes to
> > complete, and found that 80% of the time was spent waiting for
> > COMMIT to complete. Rewriting the code so that all the work
> > happened in a single transaction immediately dropped the overall
> > time down to ~1.66 minutes.
> 
> Nice stats.
> 
> This is, of course, all about waiting for a rotating disc to be in
> the right place.  4500 rotations per minute is 75 rotations per
> second or 13ms per rotation.  If a transaction which involves a
> single row being written involves five write operations then it can
> take up to 65ms just waiting for the rotating disc.  Call it an
> average of 33ms per transaction.  Do that 75,000 times and you're
> waiting up to 2475 seconds == 40 minutes.

All true, but I think you're exaggerating if you're implying that's
what the user will see.  A call to write(2) doesn't necessarily involve
the rotating media; it merely transfers the data from userspace to the
kernel buffer cache (using Linux as an example).  Even fsync, on
consumer-grade disks, may return when the data have been flushed to the
device's cache, before they come to rest on the platter.  Both buffers
ameliorate the effects of latency and track-to-track seek.  

At the towering height of a SQLite transaction, there's a lot of
machinery between the caller and the platter.  Given the time involved
for the OP's commit (minutes) and the capacity of the raw disk (about
100 MB/s), unless the amount of data is in the 10's of GB, the
performance issues are unlikely to lie in the I/O substrate.  

--jkl


[sqlite] Improving performance of my windows service using SQLite DB to save settings

2016-01-22 Thread James K. Lowden
On Thu, 21 Jan 2016 22:21:31 +0530
Deepak  wrote:

> How can improve the service performance here? Keeping in mind,
> 
> 
>- There will be no new inserts (ignoring addition of few hundred
> URLs once a day while overwriting the older ones). Size of the DB
> will be more or less constant in range 1-2MB or at most 5 MB.
>- Most of the times service is reading values.
>- Another service update some of these settings. (say once an hour)
> 
> 
> Intention:
> Super fast response to the user.

>From your description, I honestly think you're barking up the wrong
tree.  How fast is it now, and how fast would "super fast" be?  How
much complexity are you willing to tolerate to get there?  

To answer your question: since the data are updated infrequently, you
could build your own cache in program memory using a data structure of
your own choosing.  A hash table, for instance, has O(1) performance
and no I/O.  Keeping 5 MB of data in memory and hashing the keys is well
within the capacity of most Windows systems.  You can even use
WaitForSingleObject to recognize when the cache needs to be rebuilt.  

--jkl





[sqlite] syntax error on SELECT with WHERE EXISTS

2016-01-18 Thread James Walker
On 1/18/2016 1:59 PM, R Smith wrote:
>
>
> On 2016/01/18 11:42 PM, James Walker wrote:
>> Why do I get a syntax error from this SQL?
>>
>> SELECT * FROM SnappedFrames WHERE EXISTS (SELECT 1 FROM
>> FilmSizeOverrides);
>>
>> SQLite says the error is near SELECT, but doesn't say which SELECT. If
>> I say either
>
> It works fine for me - What version of SQLite do you have?
> SELECT sqlite_version();

Aha, thanks, that's probably my problem.  It's SQLite 2.8.17.  Legacy 
code.  I forgot that I wasn't using SQLite 3, or I would have mentioned it.

> Are either of SnappedFrames or FilmSizeOverrides perhaps views rather
> than tables? If not, care to post the schema for the tables?
 >
> What type of program do you use for the SQL - the command-line interface
> or some DB management tool, or is this via your own code perhaps?


Originally my own code, but I did some testing in "Navicat for SQLite Lite".


[sqlite] syntax error on SELECT with WHERE EXISTS

2016-01-18 Thread James Walker
Why do I get a syntax error from this SQL?

SELECT * FROM SnappedFrames WHERE EXISTS (SELECT 1 FROM FilmSizeOverrides);

SQLite says the error is near SELECT, but doesn't say which SELECT.  If 
I say either

SELECT * FROM SnappedFrames WHERE 1;

or

SELECT 1 FROM FilmSizeOverrides;

then there's no error.


[sqlite] POSIX and MMU (was Re: out of the void: xDlSym)

2016-01-17 Thread James K. Lowden
On Sat, 16 Jan 2016 14:21:26 -0700
Scott Robison  wrote:

> > Huh.  An example of which is the "medium model" of the Intel 8086:
> > 20-bit code pointers and 16-bit data pointers.  A machine for which
> > C compilers existed, and on which no Posix system will ever run
> > (because it lacks an MMU).  Thanks for that.
> >
> 
> Sorry for the OT diversion, but I'm just curious as I don't have
> historical POSIX standards for reference. Does POSIX really *require*
> an MMU? Certainly Unix like systems were written for 8086 class
> computers, but given that POSIX was first standardized in 1988 I'm
> just curious as to whether or not an MMU is a requirement or just
> really nice to have.

AFAIK Posix makes no statement regarding hardware.  And Posix systems
do run on "machines" today without an MMU, if you count bochs as a
"machine" and disregard the MMU supplied by the underlying hardware.
Posix requires process separation, which is most efficiently
implemented in hardware, and was definitely needed back in the days of
4.77 MHz processors.  


Xenix ran on the 8086, to be sure.  As best I remember, though, the
Xenix that run on a stock IBM PC was single-threaded: no fork(2).
Sorry, no links; the www can be skimpy on pre-1990 arcania.  Apparently
there were real multitasking versions of Unix running on the 8086,
where the motherboard included some form of MMU.  I'm sure they
performed like dancing bears.  

--jkl



[sqlite] Updating the contents of a field in table A with the value of the same field in table B

2016-01-17 Thread James K. Lowden
On Sat, 16 Jan 2016 14:26:20 -0700
"Keith Medcalf"  wrote:

> > the result is nondeterministic if more than one row in S matches.
> > The update applies all rows in S matching T.  Of course, only the
> > last one is preserved.  Of course, because order is nonsemantic,
> > there's no way to know *which* S row will "win".
> 
> And the problem is exactly what?  The statement will do exactly and
> precisely what it has been instructed to do.  That it was given bad
> instruction by someone incompetent to be giving instruction is
> irrelevant.  

It's not a "bad instruction" to update one table from another when the
join criteria aren't guaranteed, by DRI, to be 1:1 correspondence.
Either the instruction can be carried out consistent with the
database's constraints, or it cannot.  A logically incoherent
instruction -- such as "update row 1 in T from rows 2-5 in S -- should
be rejected as erroneous.  The system should not apply row 5 and ignore
2-4 arbitrarily.  

It is a bad implementation that allows an update statement to have
nondeterministic results.  I would think we agree on that.  

> The real problem with implementing this sort of update is that it in
> order to prevent implementation errors (caused by the fact that the
> set algebra is performed one row at a time, and not as a set) you
> have to generate and intermediate update set and then apply that set
> after the fact -- if a column being updated is used in a join
> constraint (but not if it is used in a selection constraint).

I'm not sure I understand.  I guess by "set algebra is performed one
row at a time" you're describing implementation, but of course all
implementation is one row at a time (if that).  Do you simply mean the
semantics of the update preclude efficient implementation?  Sure,
that's something to be aware of.  Is it a "problem", per se?  

I'm on record asking for more correct, even if less efficient,
implementation in SQLite.  Specifically, it shouldn't be necessary to
drop unique constraints to increment consecutive values.  

--jkl


[sqlite] Updating the contents of a field in table A with the value of the same field in table B

2016-01-16 Thread James K. Lowden
On Sat, 16 Jan 2016 10:18:28 +0200
audio muze  wrote:

> Why is it that SQLite does not support a FROM clause in an update
> statement?

I can't answer why, but I can tell you it's fraught with potential
error.  Inventing syntax runs the risk of supporting undesirable
behavior.  

SQL Server has such a syntax.  Unfortunately, when you say:

update T ... from S

the result is nondeterministic if more than one row in S matches.  The
update applies all rows in S matching T.  Of course, only the last one
is preserved.  Of course, because order is nonsemantic, there's no way
to know *which* S row will "win".  

The ANSI standard syntax

> UPDATE target
>SET resource = (SELECT resource
>  FROM source
> WHERE id = target.id)

has the merit that the sub-select must produce 1 row.  If it does not,
the server raises an error to the effect that SET requires a scalar
input.  


--jkl


[sqlite] out of the void: xDlSym

2016-01-16 Thread James K. Lowden
On Fri, 15 Jan 2016 21:41:41 -0500
Richard Damon  wrote:

> there are machines where it doesn't work (you just need a larger
> program space than data space).

Huh.  An example of which is the "medium model" of the Intel 8086:
20-bit code pointers and 16-bit data pointers.  A machine for which C
compilers existed, and on which no Posix system will ever run (because
it lacks an MMU).  Thanks for that.  

Until this very moment I thought the code/data schism of void* was a
"rights reservation" on the part of the compiler writers, permitting
them optimization opportunities.  But there are historical examples,
and one can't rule out future ones.  The warning is attached to
-pedantic for a reason.  

Yet we live in the real present.  I admit there's probably no reason to
make code and data pointers interchangeable.  But if sure would be nice
if the fine folks on the C standards committee would provide a more
convenient syntax, and standardize the existing practice that dlsym(2)
has exemplified for a quarter century.  

--jkl


[sqlite] out of the void: xDlSym

2016-01-15 Thread James K. Lowden
I spent a fair number of hours scrutinizing xDlSym today, and I'd just
like to confirm my understanding.  Despite having worked with C on and
off since the Reagan administration, I was unprepared for 

   void (*(*xDlSym)(sqlite3_vfs*,void*, const char *zSymbol))(void);

IIUC xDlSym is a pointer to a function taking 3 arguments, returning a
pointer to a "a function", as it says in src/os_unix.c.  That function
has a peculiar signature, 

void f(void);

You may imagine my resistance.  That's one function I'm sure I've never
needed, nor ever will!  :-)  

The comments also indicate that this definition was created to satisfy
gcc under C90 in pedantic mode.  That suggests that once upon a time
xDlSym would have been defined more conventionally as

   void * (*xDlSym)(sqlite3_vfs*,void*, const char *zSymbol);

but for the never-a-function-pointer-a-void*-shall-be rule of C?

BTW, cdecl.org offers its own interpretation:

"declare xDlSym as pointer to function (pointer to void, pointer
to void, pointer to const char) returning pointer to function (void)
returning void"   

which, for me at least, is one of those answers that makes sense only
after you know the answer.  

--jkl


[sqlite] Best way to terminate a dead-transaction: commit or rollback?

2016-01-15 Thread James K. Lowden
On Fri, 15 Jan 2016 20:39:15 +0100
Olivier Mascia  wrote:

> write transactions committed after the read transaction began, are
> not seen by that read transaction. Allowing it to have a stable view
> on the whole database.  

As Keith correctly surmised, you need repeatable read.  Never having
had the need, I never explored that aspect of transactions.  (I've
changed isolation levels on other products.  I don't remember the
circumstances well enough to know whether a transaction would have
accomplished the same end.)  

Live & learn.  Thanks for the explanation.  

--jkl



[sqlite] Best way to terminate a dead-transaction: commit or rollback?

2016-01-15 Thread James K. Lowden
On Thu, 14 Jan 2016 16:54:04 +0100
Olivier Mascia  wrote:

> Let a transaction (started with BEGIN TRANSACTION) which did only
> reads. Is it any better to end it by COMMIT TRANSACTION or ROLLBACK
> TRANSACTION, or is it completely insignificant?

Why not do the right thing and remove the begin & commit transaction
statements?  

An explicit transaction implies a unit of work, two or more changes to
the database that must cohere for the database be consistent.
Read-only operations (by which I suppose you mean SELECT) don't modify
the database and don't belong in a transaction.  

--jkl



[sqlite] whish list for 2016

2016-01-12 Thread James K. Lowden
On Fri, 8 Jan 2016 08:28:29 +0100
Dominique Devienne  wrote:

> > One way to do that would be to honor a special user-created table,
> > say "PRAGMAS", with name-value pairs that are automatically applied
> > when the database is opened.
> >
> 
> Interesting idea. A "special"-to-SQLite sqlite_pragmas table, similar
> to sqlite_master. 

Similar in name, yes.  sqlite_master is special in that it's built into
the database file structure, though.  You couldn't add another such
table without changing the file structure, something OFL understandably
doesn't want to do.  

I'm suggesting that the presence of a user-created 
"configuration table" would govern the behavior of the database,
analogous to a configuration file.  

--jkl


[sqlite] whish list for 2016

2016-01-07 Thread James K. Lowden
On Sun, 20 Dec 2015 19:12:39 +0100
Big Stone  wrote:

> Best whishes for 2016!

If it's not too late to join the party...

1.  create table T (t primary key);
update T set t = t+1;

for consecutive values of t.  

2.  Specific constraint references in error messages.  


3.  Correct math.  E.g.: 


$ sqlite3 db 'select typeof(1/0)'

typeof(1/0)
---
null   

should produce an error.  There are other examples, but none comes to
mind just now.  

I would like to see a strict mode, too.  I would also like to be able
to make "strictness" a property of the database, not the connection.
One way to do that would be to honor a special user-created table, say
"PRAGMAS", with name-value pairs that are automatically applied when
the database is opened.  

IMHO update isolation (#1) and mathematical correctness (#3) are
prerequisites for analytical functions.  SQLite can't be dependably
used for quantitative work if domain errors are silently ignored.  

--jkl


[sqlite] Problem with accumulating decimal values

2015-12-16 Thread James K. Lowden
On Wed, 16 Dec 2015 15:05:34 +0100
"E.Pasma"  wrote:

> 16 dec 2015, Keith Medcalf:
> >> BEGIN;
> >> UPDATE fmtemp SET bal = ROUND(bal,2) + 123.45;
> >> (repeat a 1.000.001 times
> >> END;
> >> SELECT bal FROM fmtemp;
> >> 123450123.45
> >
> > You should NEVER round as you have done above.  You may get lucky  
> > and the errors may cancel each other out, or you may get more
> > usual results where the error equals the theoretical max of the sum
> > of the absolute value of all the truncated values, which can be
> > quite significant depending on the scale of the number you are
> > dealing with (and theior scales relative to each other).
> 
> Hello, I was only trying to digest JKL's post and the result looks  
> good. The example prints the value as it is in the database and
> shows that there is no accumulated error there. I do not see a
> counter example (not yet).
> 
> Ok this does not work of any scale of numbers. But a solution with  
> integers neither does

Keith's advice is well taken.  Keep the real number; round for
presentation.  I always say, "store what you know".  

Yes, starting from zero you can add 123.45 to a double-precision
floating point number for a very, very long time, about 81,004,455,245
times, before the error will appear in pennies. When it does, you'll
have 13 digits left of the decimal.  That's on the order of the US
GDP.  We don't measure things like that down to the penny, so no one
will know if you're wrong.  ;-)  

The thing to keep in mind is that you get ~15 decimal places of
precision.  The decimal floats.  You can put it way on the left, and
measure tiny things accurately.  You can put it on the right, and
measure astronomical things accurately.  Unless you care about
millimeters to the moon, it will do the job.  

Whole books have been written on numerical accuracy.  I suspect if that
mattered to your application you'd know about it.  My advice is to let
the engineers worry about it -- they did, years ago -- and accept
rounded output unless and until you have an example of a computation
for which that doesn't work.  

--jkl




[sqlite] Problem with accumulating decimal values

2015-12-16 Thread James K. Lowden
On Wed, 16 Dec 2015 20:33:40 +0200
R Smith  wrote:

> > Ok this does not work of any scale of numbers. But a solution with 
> > integers neither does
> 
> I think the bit that Keith tried to highlight is that we should
> always refrain from storing errors. 

Keith recommended against storing *rounded* values.  If you store $0.30
in SQLite as REAL, you store a binary approximation.  It's a fine thing
to keep unless you care about picodollars.  

> Calculated errors are fine because we can at any time revisit the
> calculation procedures, we can refine and perhaps opt for more
> significant digits - but we can ALWAYS guarantee the accuracy-level
> of the calculated result. However, storing wrong values (or let's
> call them "approximate" values if you like) is pure evil. 

I'm not sure what you mean.  

There's no problem storing a C double from memory and later fetching
it.  The same 64 bits pass through the interface unchanged.  (Well,
maybe not the *same* bits, but who can tell?!)  Once replaced back in C
memory, the computation can resume where it left off unaffected.  

What you usually don't want to do is compute based on rounded numbers.
If you store a rounded number to the database, you may lose
information.  Even if you don't -- even when the rounded number is the
right one -- such errors as accumulate at the edge of accuracy normally
wind up not mattering.  That's why C does all computation in
double precision, even when the operands are single-precision.  

The opposite mistake -- losing information --
can easily lead to results that are spectacularly wrong.  

--jkl




[sqlite] Bug with DATETIME('localtime')

2015-12-16 Thread James K. Lowden
On Sun, 13 Dec 2015 20:11:32 -0700
Scott Robison  wrote:

> > It's not fixed, although gacial progress is being made.  Even though
> > we've had the TZ database & Posix datetime functions since 1986, 30
> > years later we're still struggling with it, and not only on Windows.
> 
> The problem would be that SQLite could not depend on the presence of
> TZ functions even if they were added to the standard:

I think the time when "the standard" mattered regarding *libraries* has
passed. Nowadays, new functions do or don't get added to libc largely
based on what GNU does, and to a lesser extent on the BSD projects.  

> 1. SQLite generally avoids non ANSI C so as to be compatible with the
> largest possible number of platforms. ANSI C (aka C89 or C90 for the
> ISO version) will never be updated to add new requirements.

SQLite maintains its own implementation of not a few functions for the
sake of compatibility.  I don't know whether this should be one of
them, but there is more than enough precedent.  

> 2. Let's say that that the next version of the C standard does add TZ
> functionality. 

I haven't peeked to find out how SQLite implements date arithmetic.  I
assume it parses strings in the database, calls mktime(3), and
subtracts time_t values.  That's pretty vanilla, and doesn't *require*
the TZ database.  

The downside of using mktime is that it locks you into a "time zone
perspective", if you will.  The timezone that will be used to convert
a (UTC-based) time_t value to "local time" is set globally.  If you
want to compare two local times, you have to manipulate that global
variable between conversions.  

The new mktime_z(3) function from NetBSD unglobalizes the timezone: it
adds a timezone parameter.  That makes it much more convenient to use
(if that's what you need!)  It's been accepted afaict by IANA, but I
found no discussion of it at GNU.  

While the NetBSD (and IANA, obviously) implementation uses the TZ
database, that's not a requirement.  The function's definition makes no
reference to its implementation.  

mktime_z could be emulated on Windows without IANA's help.  Which it
would have to be, because Windows doesn't use the TZ database:  

save TZ
set TZ to something
_tzset() // Microsoft! 
mktime
restore TZ
_tzset

A quick glance at the documentation suggests
TzSpecificLocalTimeToSystemTimeEx might be useful, too.  

Someone will complain that would be slow, and something about
threads.  My understanding is that the OP got the wrong answer, and I
would say slow is better than broken.  And it won't be slow: there's no
I/O; not even a context switch.  

As Keith said, as of now you have to roll your own.  SQLite does not
support date arithmetic across time zones.  Should it?  Should it as an
extension?  I don't know.  I was just trying to understand (and
explain) what the C foundation looks like, why/how it's broken, and
what would be required to fix it.  

--jkl










[sqlite] Problem with accumulating decimal values

2015-12-15 Thread James K. Lowden
On Fri, 11 Dec 2015 16:21:30 +0200
"Frank Millman"  wrote:

> sqlite> UPDATE fmtemp SET balance = balance + 123.45;
> sqlite> SELECT bal FROM fmtemp;
> 5925.599

To a question like that you'll receive a lot of answers about numerical
accuracy.  And it's true that there are ways to "do the math" without
using floating point representation.  It's also true that it's rarely
necessary, which is why floating point representation exists and *is*
widely used.  You may find it works for you too, unless you have to
adhere to a specific rounding policy.  

Per your example, you're working with 2 decimal places of precision.
5925.599 rounds off to 5925.60; it even rounds off to
5925.60, not too shabby.  If you keep adding 123.45 to it,
you'll find you can go on forever before the answer is wrong in the
second decimal place.  

IEEE 754 is a solid bit of engineering.  It's capable of representing
15 decimal digit of precision.  That's good enough to measure the
distance to the moon ... in millimeters.  

You could have an exceptional situation, but that would be
exceptional.  Usually, double-precision math works just fine, provided
you have some form of round(3) at your disposal when it comes time to
render the value in decimal form.  

--jkl


[sqlite] Bug with DATETIME('localtime')

2015-12-13 Thread James K. Lowden
On Thu, 10 Dec 2015 06:34:44 -0700
"Keith Medcalf"  wrote:

> The only way to convert datetime data on windows is to use a
> third-party package that does it properly, or write it yourself.
> Using the WinAPI functions is equivalent to "writing it yourself"
> because they do not actually do anything -- you have to manage all
> the meaningful data and deal with the vagaries of the
> under-documented closed proprietary function implementations in
> windows (which Microsoft even admits do not work properly).

Keith, your answer was so disheartening that I was impelled to revisit
where the C standard is on time zones.  I remembered it was messy, but
thought it had surely been fixed.  

It's not fixed, although gacial progress is being made.  Even though
we've had the TZ database & Posix datetime functions since 1986, 30
years later we're still struggling with it, and not only on Windows.  

The C standard library defines functions for "local time", defined
globally with tzset(3).  To work with two time zones (even if one of
them is UTC) requires manipulating the TZ environment variable,
and "setting the time" with tzset(3) for each each one. 

The latest version of the tz database[1] incorporates changes that
originated with NetBSD in 2014.  NetBSD introduced some new functions
e.g. mktime_z(3) that add a timezone argument to the traditional time
functions of the C standard library.  This lets you allocate a timezone
variable for each zone you're interested in, and thereby to handle the
two time zones without touching global structures.  

I don't know the status of these functions beyond NetBSD.  From the
docs[2] they don't seem to have yet been incorporated in GNU libc.
Perhaps they would be possible to incorporate them in SQLite?  

Below are two programs -- one Posix, one NetBSD 7.0 -- that produce the
output the OP expects.  Neither one requires any heavy lifiting,
although I think you'll agree the second is more interesting.  

--jkl

[Posix]
#include 
#include 
#include 

int main(int argc, char *argv[] )
{
  // > 2014-10-25 20:00:00
  struct tm tm = { .tm_hour = 20,
   .tm_mday = 25,
   .tm_mon = 9,
   .tm_year = 114 };

  putenv("TZ=no time like the present");
  tzset();

  time_t time = mktime();
  printf( "time is %lu\n", (long)time );

  struct tm *greenwich_tm = gmtime();

  printf( "time in %s is %s",
  greenwich_tm->tm_zone, asctime(greenwich_tm) );

  putenv("TZ=:Europe/Moscow");
  tzset();

  struct tm *moscow_tm;
  moscow_tm = localtime();

  printf( "time in %s is %s",
  moscow_tm->tm_zone, asctime(moscow_tm) );

  return EXIT_SUCCESS;
}
[xisoP]

make && ./moscow
c99 -D_XOPEN_SOURCE=600 -D_BSD_SOURCE -g -o moscow main.c
time is 1414267200
time in GMT is Sat Oct 25 20:00:00 2014
time in MSK is Sun Oct 26 00:00:00 2014


[NetBSD]
#include 
#include 
#include 

int main(int argc, char *argv[] )
{
  // > 2014-10-25 20:00:00
  struct tm tm = { .tm_hour = 20,
   .tm_mday = 25,
   .tm_mon = 9,
   .tm_year = 114 };

  time_t time = mktime_z(NULL, );
  printf( "time is %lu\n", (long)time );

  struct tm *greenwich_tm = gmtime();

  timezone_t moscow_tz = tzalloc("Europe/Moscow");
  const char *name = tzgetname(moscow_tz, 1);
  printf("Moscow time zone name is '%s'\n", name);

  struct tm *moscow_tm, data;
  moscow_tm = localtime_rz(moscow_tz, , );

  printf( "time in %s is %s",
  greenwich_tm->tm_zone, asctime(greenwich_tm) );

  printf( "time in %s is %s",
  moscow_tm->tm_zone, asctime(moscow_tm) );

  tzfree(moscow_tz);

  return EXIT_SUCCESS;
}
[DSBteN]

make && ./moscow
c99 -g -o moscow main.c
time is 1414267200
Moscow time zone name is 'MST'
time in GMT is Sat Oct 25 20:00:00 2014
time in MSK is Sun Oct 26 00:00:00 2014

[1] https://www.iana.org/time-zones/repository/tz-link.html

[2]https://www.gnu.org/software/libc/manual/html_node/Broken_002ddown-Time.html#Broken_002ddown-Time


[sqlite] ABOUT ROWID

2015-12-11 Thread James K. Lowden
On Fri, 11 Dec 2015 05:14:24 -0700
"Keith Medcalf"  wrote:

> Far better is to run the queries multiple times in succession (say,
> 1000) and then average the result.

Good advice.  Sometimes it seems like caching is "cheating": we
don't know the performance of something if we're using the cache.  But
the cache is part of reality; it will affect performance during
real-world use of the software.  Pretending it doesn't exist will lead
to optimizing the wrong thing.  

OTOH, testing can leave the cache "hotter" than it would be under
normal use, depending on what "normal" is.  if the OS is answering to a
lot of disparate requests, SQLite will suffer more cache misses than if
it's the only application running.  If the data are large enough, even
one table scan could remove more interesting bits from the cache.  

If you want to test the effect of caching, one suggestion that's been
missing from this thread is umount(8).  Unmounting the device is bound
to release any kernel resources associated with it.  Remounting it will
bring only the barest information from the filesystem into memory.  

One difference between unmounting and using the Big Red Switch is
that the drive's own cache is in play.  But that cache is relatively
small and afaik has a write lifetime measured in milliseconds.  Unless
you're reading the same 8 MB over and over, the effect of the drive
cache won't skew your test results.  

--jkl


[sqlite] regular expression in check constraint?

2015-11-24 Thread James Hartley
I would like to add a check constraint which determines if a string
contains all digits, ie.

sqlite> select zip_code from zip_codes where regexp(zip_code,
'^[[:digit:]]+$');

However, this generates the error:

Error: no such function: regexp

Searching through sqlite.org points that this function may not be included.

So, is there any equivalent to this kind of functionality?  I can check at
the application level beforehand, but there is some solace knowing that the
database can check the condition at the time of insertion too.

Thanks!


[sqlite] How hard is it to add the constraint name to the 'FOREIGN KEY constraint failed' message?

2015-11-23 Thread James K. Lowden
On Tue, 17 Nov 2015 22:29:10 -0500
Richard Hipp  wrote:

> With CHECK, and UNIQUE, and NOT NULL constraints, you do know that the
> constraint has failed right away.  And for those constraints, SQLite
> does provide more detail about exactly which constraint failed.  But
> for FK constraints, you never know if a constraint that is failing
> right now might be resolved before the end of the transaction.

As a matter of fact, violation of UNIQUE & PK constraints is enforced
rather too strictly in SQLite.  Any UPDATE statement that modifie more
than one row of a UNIQUE column can fail because two rows may
*transiently* have the same value.  SQLite will fail the statement
even though the completed transactation leaves the constraint
unviolated.  

sqlite> create table T(t int not null primary key);
sqlite> insert into T values (1), (2);
sqlite> update T set t = t+1;
Error: UNIQUE constraint failed: T.t

--jkl


[sqlite] [AGAIN] SQLite on network share

2015-11-15 Thread James K. Lowden
On Fri, 13 Nov 2015 13:19:33 -0800
Roger Binns  wrote:

> On talking to sites that had the competitor devices, we'd find they
> did notice increases in programs crashing and data file issues, but
> had written it off as the kind of thing that happens with Windows.  

Q:  Why doesn't Microsoft write fault-tolerant software?  

A:  No need, they have fault-tolerant customers. 

(I don't know whom to credit.) 

--jkl


[sqlite] sql programming help

2015-11-14 Thread James
I'm doing a select like this:
select e.name, e.release from execution as e where run_id
in (select substr(status,7,5) from resource where status like "Busy (%");

I want to get to get a column from each row of the subselect (from 
resource) to match each result row.


resource table:
name status
serverABusy (28610)
serverBBusy (28648)

execution table:
run_id   release name
28610rel1run_name1
28648rel2run_name2

The query returns:
run_name1 rel1
run_name2 rel2

I want:
serverA   run_name1 rel1
serverB   run_name2 rel2


I can do a query for each row returned by my subselect but I'm hoping to 
do it in one SQL statement.



[sqlite] "Shipping a New Mindset" - SQLite in Windows 10

2015-11-12 Thread James K. Lowden
On Tue, 10 Nov 2015 22:20:59 +0100
"Christian Werner"  wrote:

> When SQLite is the spoon, will they give us Linux binary
> compatibility for the fork?  

Fork?  Isn't that what CreateProcess is for?  

--jkl



[sqlite] "Shipping a New Mindset" - SQLite in Windows 10

2015-11-12 Thread James K. Lowden
On Tue, 10 Nov 2015 13:45:52 -0700
Warren Young  wrote:

> This from the same company that gave us ODBC, ESQL, OLE DB, MDAC/Jet,
> DAO, RDO, ADO, ADO.NET, ADO Entity Framework, LINQ, the registry,
> Access, SQL Server Express?

The real irony is that every one of those technologies had (I'll bet)
more resources expended on it than SQLite has had.  

> Obviously getting SQLite into Windows is a great thing.  It?s just
> that it would have been even nicer a decade ago.  

It's more like 25 years.  The registry, with all its obvious defects,
made its appearance in Windows 3.1, which the oracle Wikipedia puts at
1992.  At the time Microsoft already had the Jet engine, demonstrating
the feasibility of implementing relational technology on the machines
of the day.  

How many needless, mysterious Windows problems can be linked to the
registry's tree structure and failure to enforce consistency?  How many
programmers over those decades could have benefited from built-in DBMS
technology?  How much less schlock might we have today if every Windows
programmer had been unwittingly exposed to set theory as a basic
feature of the OS?  

I'm happy to see Microsoft has seen the light in some sense, even
though there's no chance the SQLite folks will be justly compensated.
I'll be impressed when SQLite is a kernel driver and
GetPrivateProfileString is a SELECT statement.  

--jkl


[sqlite] Simple Math Question

2015-11-07 Thread James K. Lowden
On Fri, 06 Nov 2015 22:16:57 -0700
"Keith Medcalf"  wrote:

> I wrote a function called "ulps" which can be used as an extension to
> SQLite3

Bravo, Keith!  

One suggestion, if I may.  If you expect "ulps" to be used to test for
equality, perhaps it would be more convenient to ignore the sign.
Something like

fequal(x, y, delta) === abs(ulps(x -y)) < delta

might express the idea more directly?  

--jkl


[sqlite] Question about Style

2015-11-02 Thread James K. Lowden
On Fri, 30 Oct 2015 14:08:16 -0400
Ramar Collins  wrote:

>  I'm using C and a small library to get the majority of the work
> done.   My question is, do you have any suggestions or know where to
> find more lore on how to nicely embed SQL in a program like this?

I store my SQL in separate files, one per statement.  Very handy for
testing & development, too, btw.  

At initialization, I read all the SQL into a list.  In C it might be an
array of structures like

struct query_t { char *name; char *sql; } *queries;

where name is the filename and sql is the content of the file.  With
just a little work and the help of bsearch(3), you can refer to your
queries by name and keep the literal SQL out of your code.  As others
mentioned, it's better (and easier) to use prepared queries than
string-slinging for parameters.  

If you don't want depend on separate files of SQL at runtime, write a
script to generate a .c file of a static array of query_t, as above.  

HTH.  

--jkl


[sqlite] Simple Math Question

2015-10-28 Thread James K. Lowden
On Fri, 23 Oct 2015 10:43:44 -0700
Scott Hess  wrote:

> You're right, any base-2 representation right of the decimal should be
> precise to represent in base-10.  But it's the kind of thing where if
> you find yourself counting on it, you probably made a grave error
> earlier in your design :-).

I'm either brave or naive enough to think I can still add to this
discussion.  If we accept what you say, above, then why should 

>  (9.2+7.8+0+3.0+1.3+1.7)

in particular present any problem?  There's no division.  Each value
has an exact decimal representation.  I'm prepared to assert that any
permutation of their sums also has an exact decimal representation.
Therefore they should have an exact binary representation, too.  

To the OP, I want to point out that whether or not a fraction can be
presented exactly is a function of the base.   Consider that 1/3 has no
finite decimal representation.  But in base 3 it's just 

0.1

--jkl



[sqlite] Problems making sqlite.lib using Embarcadero C++ 7.00

2015-10-27 Thread James K. Lowden
On Tue, 27 Oct 2015 12:24:03 +0100
Clemens Ladisch  wrote:

> > You've solved the problem by compiling for a single-threaded
> > environment.
> 
> No; the single-threaded environment is the problem.

That's a matter of opinion.  Another way to look at it: threads set
back computing by 20 years.  

Protected memory was invented for good reason.  Then it was uninvented,
and we're still dealing with the fallout.  Software transactional
memory has so far failed to pass muster, and communicating sequential
processes has still not seen implementation support in most
languages.  

--jkl


[sqlite] Problems making sqlite.lib using Embarcadero C++ 7.00

2015-10-24 Thread James K. Lowden
On Thu, 22 Oct 2015 15:23:38 +0200
"Marco Turco"  wrote:

> The problem is when I link the generated library. I receive the
> following error related to the first two warnings so I'm unable to
> generate the executable file.

You've solved the problem by compiling for a single-threaded
environment.  I wanted to point you toward the generic solution to the
warnings and errors in case you want to use __beginthreadex.  I suspect
the advice you got about changing your compiler switches is correct;
that's what you would do with Microsoft's compiler.  

> Warning W8065 K:\sqlite_see_2013\sqlite3.c 23909: Call to function
> '_endthreadex' with no prototype in function sqlite3ThreadProc

That indicates you're notincluding the .h file that declares the
_endthreadex function.  

Unlike some other advice you might get, I'm not so sanguine about
prototype warnings. The C language permits, but does not require, the
use of functional prototypes.  When you ignore a warning about a
missing prototype, you deny yourself the parameter checking the
compiler could otherwise provide.  

> Error: Unresolved external '__endthreadex'
> referenced from K:\ACTIVEXP\SOURCE\SQLITE.SEE\SQLITE3.LIB|sqlite3

To be pedantic, that is a linker error, not a compiler error.  In this
case you're missing the library that supplies _endthreadex.  According
to Microsoft [1], it's supplied by the C run-time library.  That almost
certainly means you need to set your compile options accordingly.  

Why change compilation just to link to another library?  Normally you
don't; you just add a library name to the linker command.  In the case
of Microsoft and the C runtime, though, some of the symbols differ
between single- and multi-threaded (and release and debug) builds.  You
call function foo, but it might generate e.g. foo_debug_mt, or might
issue locks, or might replace the call with a bullt-in intrinsic or
macro.  

That's why Microsoft (and likely Embarcadero) use a compiler option.
Because of the tight coupling of compile modes and required library,
they need to be controlled together, and the simplest way to do that is
to give one option to the compiler and let it DTRT. 

> At the end I have added the parameter -DSQLITE_THREADSAFE=0 to
> exclude the multithread code from the library 

Yup, that's the other way; change the SQLite code to be
single-threaded.  

HTH.  

--jkl

[1] https://msdn.microsoft.com/en-us/library/hw264s73.aspx


[sqlite] Is it legal SQL to select a specific row of a group via a HAVING clause?

2015-09-29 Thread James K. Lowden
On Tue, 29 Sep 2015 15:05:42 +0800
Rowan Worth  wrote:

> Imagine a DB storing a history of currency exchange rate info.
> 
> CREATE TABLE Prices (
> id INTEGER PRIMARY KEY,
> day INTEGER,
> currency TEXT,
> price FLOAT);

Better for your purpose would be: 

CREATE TABLE Prices (
day INTEGER not NULL,
currency TEXT not NULL,
price FLOAT not NULL,
PRIMARY KEY(day, currency)
);

You'll note that your queries don't use the id column, and nothing in
your table definition ensures there's only one price per currency per
day.  

If it were me, I'd make day TEXT and add a constraint that requires the
month portion be between 1 and 12.

> if you want to know the latest prices, is this legal?
> 
> SELECT currency, price FROM Prices GROUP BY currency HAVING time =
> MAX(time);

What you want is called existential quantification.  The standard SQL
for it is a correlated subquery, viz:

SELECT *
FROM Prices as p 
where exists (
select 1 from Prices
where currency = p.currency
GROUP BY currency
HAVING p.time = MAX(time)
);

That version might be slower in SQLite than LIMIT 1, although it
shouldn't be.  You'll have to test it.  The advantage to you is that
it's standard SQL.  It will work on any SQL DBMS, and will help you
think about the issue in terms of sets instead of having to rely on
a crutch like LIMIT.  

--jkl


[sqlite] UPDATE silently failing

2015-09-21 Thread James K. Lowden
On Mon, 21 Sep 2015 17:44:13 -0400
Hugues Bruant  wrote:

> UPDATE cv SET cv_t=? where cv_s=? and cv_o=?;
> 
> Most of the time the row is updated as expected but in some rare cases
> we've seen this statement fail silently, as in:
>   - the row exists
>   - the row it is not updated
>   - step returns SQLITE_OK
>   - changes returns 0

Commit?  

This behavior is consistent with an uncommitted UPDATE.  

--jkl



[sqlite] How to know when a sqlite3_stmt has been recompiled ?

2015-09-21 Thread James K. Lowden
On Mon, 21 Sep 2015 11:19:16 +0200
"R.Smith"  wrote:

> On 2015-09-21 01:53 AM, Nicolas J?ger wrote:
> > hi,
> >
> >
> > Scott Robison wrote:
> >
> >> 3. Some time passes and some external process may alter the schema.
> >>
> > is it really possible ? if the answer is yes (I thought when a
> > process open a communication with the DB, other processes can't
> > alter the DB), for me that would mean that the design(s) of the
> > program(s) wich communicates with the DB is bad.
> 
> This is the entire point of a database - allowing multiple things to
> see the data. A system which only allows one thread to read data is
> just a file-handler.

Let's understand plainly: It is an isolation failure.  All the rows
from a SELECT statement are the product of an atomic action.  The
schema can't change *during* an atomic action.  

Of course, isolation is an illusion, and other DBMSs take different
approaches to it. They either disallow schema changes to tables
against which read locks are held, or they snapshot the results and
apply the change, or they queue the change as a pending write until the
read lock is released.  I don't know of another, though, that just
throws in the towel.  

Not to be snarky about it, but depending on what you mean by
"file-handler", I can think of a few systems, such as stdio, that
provide access by more than one thread to a single file.  They don't
provide much sense of isolation, though, except in append-mode.  

--jkl



[sqlite] disable update's & delete's via triggers?

2015-09-20 Thread James Hartley
I appreciate the discussion thus far.

I tried the following syntax:

CREATE TRIGGER update_history_table_trigger INSTEAD OF UPDATE ON
history_table
BEGIN
UPDATE history_table SET new_value = 1 WHERE 0 = 1;
END ;

CREATE TRIGGER delete_history_table_trigger INSTEAD OF DELETE ON
history_table
BEGIN
DELETE FROM history_tble WHERE 0 = 1;
END ;

...received the following error:  "cannot create INSTEAD OF trigger on
table: history_table"

Googling revealed the following link from 2012:

http://stackoverflow.com/questions/9487381/why-sqlite-error-cannot-create-instead-of-trigger-on-table

...which states that INSTEAD OF triggers are to be used with views, not
tables.  Is this still true?  I'm using SQlite 3.8.9.

Thanks, again.

On Sun, Sep 20, 2015 at 6:34 PM, Simon Slavin  wrote:

>
> On 21 Sep 2015, at 12:21am, James Hartley  wrote:
>
> > I am assuming that dropping the trigger
> > will re-enable any action which had been disabled by creating the
> trigger.
> >
> > Is this incorrect?
>
> Nope.  That's one way to do it.  Another is to set "PRAGMA
> recursive_triggers = OFF" then have a TRIGGER make changes to the table.
>
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>


[sqlite] I don't understand how to use NOT EXISTS

2015-09-20 Thread James K. Lowden
On Sun, 20 Sep 2015 19:33:35 +0100
Simon Slavin  wrote:

> On 20 Sep 2015, at 7:15pm, James K. Lowden 
> wrote:
> 
> > Simon Slavin  wrote:
> > 
> >> Constructions like this
> >> 
> >>> INSERT INTO _TAG_EXISTS_RESULT_ ( NAME, BOOL )
> >>> SELECT 'evil little sister'
> >> 
> >> should be rewritten so that you are not trying to do a SELECT in
> >> the middle of your INSERT.
> > 
> > Why in the world would you say that?  That's the SQL assignment
> > function, the equivalent of 
> > 
> > A = A + B
> 
> I would argue that that would be UPDATE, not INSERT.  But I see your
> side of the issue.

I actually think that's the crux of the matter, Simon.  If you think
you're operating on a row, then A = A + B is UPDATE, yes.  If you think
you're operating on tables, then A = A + B is INSERT, and UPDATE
is more like 

A = (A - C) + B

where C is the set of rows being replaced by B. 

> Suppose the SELECT doesn't return anything.  Do you still want to do
> the INSERT ? 

If I said WHERE NOT EXISTS, why would I still want to "do the insert"?!
I'm with Yoda here: there is no try.  

> Do you now have to look up the values to INSERT elsewhere ?

No, I do not, not if the values I didn't insert can be specified, e.g.

insert into S 
select * from T where exists 
(select 1 from R where ... );

If the values can't be specified in like manner, that would suggest to
me a problem with the database design.  

TIMTOWTDI, for sure.  More than one way to think about it, too.  But I
see no downside to using INSERT...SELECT, except that it's startling to
someone unfamiliar with it.  

--jkl


[sqlite] disable update's & delete's via triggers?

2015-09-20 Thread James Hartley
On Sun, Sep 20, 2015 at 5:20 PM, Simon Slavin  wrote:

>
> Then your next concern is over how to zero-out the log if you really
> needed to do that.
>
> That is a valid concern, however, I am assuming that dropping the trigger
will re-enable any action which had been disabled by creating the trigger.

 Is this incorrect?


[sqlite] Handling the whole select query from an index

2015-09-20 Thread James K. Lowden
On Sat, 19 Sep 2015 14:24:24 +0100
Rob Willett  wrote:

> Thanks. We?ve got 100 requests a second which may be enough. We?ll
> keep looking though for any time.

I suppose you know that 100/sec is about 0.1% of what the machine is
capable of.  

You spoke of read-only data that changes infrequently, and you wanted
maximum speed.  I would sort them into a static C array, and use
std::lower_bound to search it.  I would expose that as a function in a
shared library, and publish updates by updating the shared library.  I
would expect at least 100,000 invocations per second, with the added
benefit that the iterator returned by lower_bound instantly answers the
question of existence for the provide string.  

Everything DHR said of advantages to using SQLite is true.  If what you
want is to minimize lookup time on static data, though, searching sorted
data will give you better locality of reference and fewer machine
instructions than any interpreted b-tree.  

HTH.  

--jkl


[sqlite] disable update's & delete's via triggers?

2015-09-20 Thread James Hartley
As I look at the flowchart for triggers at the following:

https://www.sqlite.org/lang_createtrigger.html

It appears that the body of a trigger must specify some kind of statement.

I ask because I am wanting to log all activity on another table.  I can
successfully do this with triggers.  What I am curious to do is disable
UPDATE's & DELETE's on the log tables themselves.

Can anyone confirm that this is not allowed with the current triger grammar?

Thanks!


[sqlite] I don't understand how to use NOT EXISTS

2015-09-20 Thread James K. Lowden
On Wed, 16 Sep 2015 08:23:04 +1000
Barry Smith  wrote:

> As for your original query: think about just the select clause (you
> can run it independently). This will return ('magnetohydrodynamics',
> 1) for each row in the table. 

It took me a bit to understand what you meant.  I also think there's a
better answer than resorting to LIMIT 1.  

To clarify, the OP's query is (reformatted):

> INSERT INTO TAGS ( NAME, COUNT ) 
> SELECT 'magnetohydrodynamics', 1
> FROM TAGS -- <- the error 
> WHERE NOT EXISTS (
>   SELECT * FROM TAGS 
>   WHERE NAME = 'magnetohydrodynamics' 
> );

The existence test is against the whole TAGS table.  As long as the
name 'magnetohydrodynamics' appears in the table, SELECT will return
zero rows.  But -- your point -- if the name tested does *not* appear in
the table, SELECT will return as many rows as are in the table.  

The solution is simply to say what's meant instead: 

INSERT INTO TAGS ( NAME, COUNT ) 
SELECT 'magnetohydrodynamics',1 
WHERE NOT EXISTS (
SELECT 1 FROM TAGS 
WHERE NAME = 'magnetohydrodynamics' 
);

As to the OP's question about where he went wrong, the query as
presented should not have created the results he showed.  Those results
could be explained, though, if "magnetohydrodynamics" was misspelled in
the WHERE clause.  I suspect that's what went wrong.  

--jkl


[sqlite] I don't understand how to use NOT EXISTS

2015-09-20 Thread James K. Lowden
On Wed, 16 Sep 2015 19:40:23 +0100
Simon Slavin  wrote:

> Constructions like this
> 
> > INSERT INTO _TAG_EXISTS_RESULT_ ( NAME, BOOL )
> > SELECT 'evil little sister'
> 
> should be rewritten so that you are not trying to do a SELECT in the
> middle of your INSERT.

Why in the world would you say that?  That's the SQL assignment
function, the equivalent of 

A = A + B

Anything you'd do to rewrite it would result in something more complex,
doubly so if it included the atomicity guarantee of the original.  

--jkl



[sqlite] bug in PRAGMA ignore_check_constraints?

2015-09-16 Thread James K. Lowden
On Fri, 11 Sep 2015 02:02:05 +0100
Simon Slavin  wrote:

> Looking at
> 
> 
> 
> maybe the 'constraints' that the documentation refers to are the ones
> specifically declared using CHECK in the table definition.  Perhaps
> NOT NULL and UNIQUE don't count.

That's the conventional understanding.  SQL didn't always have CHECK,
as you know, but it had NULL from the very start.  I cannot think of a
product that ever permitted missing values in a column marked NOT NULL;
there was no SET NULL OK or whatever.  

Constraint checking -- including referential integrity contraints --
used to be done with triggers.  In those days, you dropped enforcement
by dropping (or changing) the trigger.  With the arrival of Declarative
Referential Integrity, that wasn't an option because enforcement was
part of the table definition, not procedural code.  Hence the need to
enable it selectively with system settings.  

--jkl



[sqlite] Third test of json and index expressions, now it works

2015-09-14 Thread James K. Lowden
On Sun, 13 Sep 2015 10:46:21 +0200
Clemens Ladisch  wrote:

> James K. Lowden wrote:
> > select a, r from (
> > SELECT a, random() as r FROM a
> > ) as R
> > WHERE r <> r;
> >
> > will yield zero rows, every time.
> 
>   $ sqlite3
>   SQLite version 3.8.12 2015-09-12 19:50:58
>   ...
>   sqlite> create table a(a);
>   sqlite> insert into a values (1),(2),(3);
>   sqlite> select a, r from (SELECT a, random() as r FROM a) as R
>   sqlite> WHERE r <> r;
>   1|-909199267849538533
>   2|8366540922678358399
>   3|-6124149463908475628

Oh, Jimminy Crickets.  That's Just Wrong.  Thank you for the correction
viz SQLite.  I hope we can look forward to seeing it fixed.  

R.r doesn't refer, semantically, to the RANDOM function.  It refers to
the product of RANDOM.  

Apparently, deep in the bowels of SQLite, the query evaluation logic
doesn't recognize -- and deal correctly with -- nondeterministic
functions.  Afaik there are only 2: DATE(now) and RANDOM.  And they've
both caused problems.  

Deterministic functions are idempotent.  No matter how often you call,
say, ABS(x), you get the same result for the same x, every time  The
execution evaluation can invoke ABS as often as it likes; the worst
that will happen is a little inefficiency.  

When it comes to nondeterministic functions, though, it matters a great
deal how often they're called, else you get nonsensical answers like
R.r not being equal to itself.  

--jkl


[sqlite] Third test of json and index expressions, now it works

2015-09-12 Thread James K. Lowden
On Wed, 09 Sep 2015 09:56:12 +0200
"Domingo Alvarez Duarte"  wrote:

> With your knowledge could you give your view about how evaluation of
> calculated/function columns should be done to have correct results.  
...
> CREATE TABLE a(b);  
...
> SELECT a, random() as r FROM a WHERE r <> r;  

That's a syntax error, because there is no column a.r.  

Once you reorganize it to make it syntactically valid, the answer
becomes clear:

select a, r from ( 
SELECT a, random() as r FROM a   
) as R 
WHERE r <> r;

will yield zero rows, every time.  Whatever value RANDOM produces goes
in the "r" column and any value, including "r", is always equal to
itself.  

--jkl


[sqlite] Third test of json and index expressions, now it works

2015-09-08 Thread James K. Lowden
On Sat, 5 Sep 2015 09:07:11 -0700
Darko Volaric  wrote:

> I'm asking why the SQL standard restricts the use of aliases in this
> way and what the benefit of this restriction is.

Rationales in SQL are hard to come by.  The language was promulgated by
a private firm, and the standard evolved under the aegis of what was,
for all intents and purposes, a private club.  Rationales for most
warts in the language boil down to "because we say so".  

Nonetheless, there is a good reason!  

There are no aliases in SQL, Horatio.  

In every SQL database, column names are unique and unambiguous.  If you
know the name of a table and a column, you've identified it.  A query
can name *new* columns, but it can't create aliases for existing
ones.  Consider, 

> SELECT a+b AS x FROM t1 WHERE x=99;

Here, "x" is a new column, the product of the SELECT.  By the rules of
SQL, it's *not* a macro for a+b, and it's not an alias.  It's the name
of the column formed by computing a+b, a new column of a new table.  

Now consider, 

SELECT x from (
SELECT a+b AS x FROM t1 
) as TEETH_GNASHER
WHERE x=99;

The outer query does not refer to t1, and thus not to "a" nor "b".  It
sees only the new table, with its sole column, "x".  And, although it's
a bit verbose, it also satisfies the sacred DRY criterion.  

Is that good?  The "no aliases" rule has one thing going for it: it's
consistent.  It's easy to understand and remember, and it reduces
opportunities for ambiguity.  SQL is a logic-based language, and
ambiguity in logic is anathema because it's too easy to form
syntactically valid constructions that produce incorrect (and
unintended) results.  

Nearly every SQL programmer uses some other programming language as the
"real" language for his application.  There's a temptation to make
informal, sometimes unwitting assumptions about the rules of SQL drawn
from that other language.  The best way to understand any language
though, including SQL, is on its own terms.  So double-quotes denote
identifiers, single-quotes strings, "||" contatenation, and there are
no aliases.  It's not easy to slough off unwarranted associations with
other languages, but once that's done, SQL is impossible to
misconstrue.  

--jkl


[sqlite] Variables in statements

2015-09-08 Thread James K. Lowden
On Thu, 03 Sep 2015 16:44:50 +
Peter Haworth  wrote:

> The statement I'm using is:
> 
> SELECT * FROM myTable WHERE myKey IN (:1)
> 
> If the value I supply to be used as :1 is a single integer, the SELECT
> finds the correct rows.  If the value is a comma separated list of
> integers, e.g 1,2 the SELECT statement does not return any rows and no
> error is returned.  

If you think about the SQLite C interface that Livecode must be
using, you'll see why lists don't work in this context.  The data in
your list are conveyed to SQLite as a pointer to the list data.  The
caller indicates how that pointer is to interpreted through the bind
function used.  (http://www.sqlite.org/c3ref/bind_blob.html)  Livecode
must bind the parameter as one of those scalar types, probably
integer.  SQLite then interprets the data as (say) an integer whose
bytes are the ASCII encoding of the characters 

'1'  ','  '2'  ',' [...]

which is probably a very big number, and one that does not match any
myKey values.  Hence no row returns and no error produced.  

> I suspect this is a Livecode problem

If Livecode has a way to prevent passing a "list" type as an integer
parameter then, yes, I'd say so.  If not, they'd have to call it a
pibcak problem and say, "well, don't do that".  :-)  

--jkl


[sqlite] order by not working in combination with random()

2015-08-26 Thread James K. Lowden
On Wed, 26 Aug 2015 13:39:09 +0100
Simon Slavin  wrote:

> 
> On 26 Aug 2015, at 11:27am, tonyp at acm.org wrote:
> 
> > Plus, it apparently recognizes if the random() expression in the
> > ORDER BY is the same as the SELECT one and again sort correctly
> > (without re-evaluating) and without needing an alias.
> 
> Ah, but I would call /that/ a bug !

I think I can convince you Postgres is right and SQLite is not.  :-)
I'll give you 4 reasons:

1.  Theory
2.  Authority
3.  Atomicity
4.  Consistency

Theory.  Like the relation of relational algebra, a SQL table has no
order. (Not only tables, but any table-like thing: view, CTE,
subquery.)  If you take a table and permute its order, it's still the
same table. ORDER BY permutes the order in a particular way.  It does
not change the table.  Relationally, the input and output of ORDER BY
must be identical.  

Authority.  As a matter of fact, CJ Date says the output of ORDER BY is
*not* a table, but a cursor.  That makes sense, and also explains why
ORDER BY can, according to the SQL standard, appear only once in a
query, and only at the end.  You could think of ORDER BY not as a sort
operator but as a row-returning operator that magically returns the
"next" row per a specification.  

The SQL-92 standard distinqushes between the SELECT statement with
all its many clauses, and ORDER BY:

"""
 Format

  ::=
   [  ]
"""

SQLite would do well to adopt this notion in its query processor.  It
would simplify the system, and would have avoided this error.  

Atomicity.  The rule in SQL is that all functions act *as if* they were
invoked only once.  We had this confusion some time back with 
"date(now)".  There's only one "now" in a query, and there's only one
random() in a row-column.  

The column-specification of SELECT is applied *once* to each row.  The
column-specification of ORDER BY (and GROUP BY) is *not* applied; it
merely *designates* the column by using the same specification text.
Consider this example:

sqlite> select 2+2 order by '2+2';
2+2   
--
4 

'2+2' here is only the column name.  It's not an instruction to create
data.  

Consistency.  Any two semantically equivalent queries that produce
different outputs from the same input constitute prima facia evidence
of a bug.  However you restate a query, if it is logically equivalent,
it should produce the same results.  The fact that you can rejigger
queries with RANDOM() and get different answers tells you immediately
that something is wrong.  

It's sometimes useful to think about how a DBMS works, and imagine
traversing trees, sorting, etc.  It helps in understanding how to
express a query, and in how it will be processed.  But you have to
throw all that away when thinking about query semantics.  The user has
*one* chance only to state the query; he has no control over how its
executed.  As he is constrained, so is the system: it cannot let
implementation details affect query semantics.  That's the way it is
and should be, because it best serves the interests of both parties.  

--jkl


[sqlite] order by not working in combination with random()

2015-08-25 Thread James K. Lowden
On Mon, 17 Aug 2015 12:01:58 +0200
Clemens Ladisch  wrote:

> Just because the ORDER BY clause refers to a column of the
> SELECT clause does not mean that the value is not computed
> a second time.

Let's at least recognize that as a bug.  ORDER BY shouldn't interpret
SQL or invoke functions.  It's not even properly part of the SELECT
statement, but is rather just a post-SELECT filter.  Any changes to the
behavior of ORDER BY effected by changing the syntax in semantically
equivalent ways is only further evidence of the bug.  

ORDER BY should evaluate the *data* only.  In the case of random(),
like any other function it gets invoked once, by SELECT, and produces a
value.  That value is the one that ORDER BY should operate on.  

--jkl


[sqlite] Boolean constraint regression between 3.7.7.1 and 3.8.6?

2015-08-25 Thread James K. Lowden
On Mon, 17 Aug 2015 17:19:49 +0200
Olivier Barthelemy  wrote:

> CREATE TABLE VariableDetails (dtl_id INTEGER PRIMARY KEY ASC
> AUTOINCREMENT, storage_implicit BOOLEANCHECK (storage_implicit =
> 'true' OR storage_implicit = 'false'), storage_type TEXT NOT NULL);
> 
> Insert statement
> INSERT INTO VariableDetails (storage_implicit, storage_type) VALUES
> (true, INT_64);
> ( Values are not passed dirrectly in the statement. The boolean in
> particular is passed using sqlite3_bind_int() )

I think you got to the right place with this, but I'm not sure it was
made clear that the reason is evident in the above text.  

The constraint is against the strings 'true' and 'false'.  The insert
statement inserts "true", no quotes; as you say, it's a C symbol
interpreted by the compiler.  It's bound to the prepared statement with
sqlite3_bind_int.  The value in the bound location will be interpreted
as an integer, not as a pointer to a character array!  

--jkl


[sqlite] pragma temp_store_directory is deprecated, what is the alternative?

2015-08-18 Thread James K. Lowden
On Sat, 15 Aug 2015 01:17:28 +0100
Simon Slavin  wrote:

> > BTW, Posix is almost silent on the question.  It says TMPDIR will
> > define the location of a temporary store, but not how. 
> 
> I'm okay if the documentation simply says something like ...
> 
> For Darwin (Mac), it's always /tmp/
> For Linux, see the TMPDIR environment variable
> For Windows see [whatever it is]

Agreed, although IIUC it could be simpler than that, see next.  

> However I suspect things may be more complicated than that.  For
> instance, does .NET respect the OS's choice of temporary directory no
> matter which OS it's running under ?  I have no idea.

What I'm suggesting is that there is no "OS's choice", really.  

There are a few functions in the C standard library, e.g. tmpfile(3),
that may consult the environment.  The variable's name varies by
implementation.  Some implementations, notably GNU's (if the
documentation is correct), do not consult the environment.  

I would guess .NET is written atop the Win32 API and uses
GetTempFileName or somesuch.  That uses GetTempPath, whose return value
is affected by TMP and TEMP.
(https://msdn.microsoft.com/en-us/library/windows/desktop/aa364992
(v=vs.85).aspx).  

GetTempPath and tmpnam(3) on Windows both honor TMP, but the fallback
policies differ.  So it's not really a question of what the OS's choice
is, because the *OS* offers no "temporary file" function.  It's really a
question of which library function is called, and how that function is
implemented.  

But none of that matters unless those functions are used.  An
application -- or library, as in SQLite's case -- need not use them,
which in any case aren't all that helpful.  AIUI SQLite does *not* use
those functions, but rather has its own way to determine where temporary
files go.  In that case the rule could be quite simple and
OS-independent.  For instance, 

1.  Use "SQLITE_TMPDIR" if defined 
2.  Use current working directory otherwise

where the value is set by sqlite3_initialize and cannot be changed
thereafter. 

--jkl



[sqlite] pragma temp_store_directory is deprecated, what is the alternative?

2015-08-14 Thread James K. Lowden
On Mon, 27 Jul 2015 20:35:30 +0100
Simon Slavin  wrote:

> On 27 Jul 2015, at 8:03pm, Zsb?n Ambrus  wrote:
> 
> > Does this work with the same environment variable name on both unix
> > and windows?  I'm asking because unix and windows programs each use
> > different conventions for what environment variable to care about
> > when determining the temporary directory, and these conventions
> > differ on unix and windows, though I believe TEMP is the customary
> > variable on windows
...
> If it's going to be documented, it would also be nice to see it on
> 
> 
> 
> which, according to my browser, doesn't mention the word
> 'environment'.

I would be nice to have clearly documented the means by which the
location of temporary files is controlled.  I assumed some flavor of
tmpfile(3) was used, but evidently SQLite doesn't rely on that
function.  

BTW, Posix is almost silent on the question.  It says TMPDIR will
define the location of a temporary store, but not how.  The C standard
library doesn't mention it for functions like tmpfile.   BSDs such
as OS X do, but according to the docs GNU glibc doesn't.  

The issue of thead safety when using environment variables is
spurious.  In a few places I can think of in libc (time zone, locale),
it consults the environment once on startup (or first use) and uses that
variable henceforward.  SQLite could do something similar, or otherwise
ensure that whenever the environment is read all threads get a
consistent view.  

--jkl


[sqlite] Tomcat 6 consumes too much memorys

2015-07-31 Thread James Qian Wang
The same query and database are fine if running from the command line
sqlite3 or from java command line (via jdbc).

Any one has the same issue please?

Thanks a lot in advance

-- 
James Qian Wang
Mobile: 44 7986 099 233


[sqlite] Query tooooo sloooooooooow

2015-07-30 Thread James Qian Wang
I agree.  Thank you all.



On Thu, Jul 30, 2015 at 3:44 PM, Clemens Ladisch  wrote:

> James Qian Wang wrote:
> > 0|0|0|SCAN TABLE contact AS c USING COVERING INDEX elid2
> > 0|1|1|SEARCH TABLE history AS h USING COVERING INDEX elid (elid=?)
>
> There is no more efficient way to execute this query.
>
> > both very slow
>
> What file system? What disks? Any network?
>
>
> Regards,
> Clemens
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
James Qian Wang
Mobile: 44 7986 099 233


[sqlite] Query tooooo sloooooooooow

2015-07-30 Thread James Qian Wang
my desktop (windows 7)
SQLite version 3.8.7.4 2014-12-09 01:34:36 and explain query plan showed:

0|0|0|SCAN TABLE contact AS c USING COVERING INDEX elid2
0|1|1|SEARCH TABLE history AS h USING COVERING INDEX elid (elid=?)

my linux box needs upgrade.

regardless, both very slow


On Thu, Jul 30, 2015 at 11:03 AM, Clemens Ladisch 
wrote:

> James Qian Wang wrote:
> > select count(*) from contact c left join history h on (h.elid=c.elid);
>
> Please show the output of EXPLAIN QUERY PLAN for this query.
>
> If it does not look like this:
>   0|0|0|SCAN TABLE contact AS c USING COVERING INDEX elid2
>   0|1|1|SEARCH TABLE history AS h USING COVERING INDEX elid (elid=?)
> you should get a newer SQLite version.
>
>
> Regards,
> Clemens
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
James Qian Wang
Mobile: 44 7986 099 233


[sqlite] Query tooooo sloooooooooow

2015-07-30 Thread James Qian Wang
Hi All,

Here is my query:
select count(*) from contact c left join history h on (h.elid=c.elid);

Here are the table structures:
sqlite> .schema history
CREATE TABLE history (name varchar(32), email varchar(128) unique, elid
int(12));
CREATE INDEX elid on history (elid)
;
sqlite> .schema contact
CREATE TABLE contact (name varchar(32), email varchar(128) unique , elid
int(12));
CREATE INDEX elid2 on contact (elid);

Table size:
contact 20m
history  10m

elid integer and each elid has about 20k records

Ideas please?

Thanks a lot in advance
-- 
James Qian Wang
Mobile: 44 7986 099 233


[sqlite] Suggestions for Fast Set Logic?

2015-07-12 Thread James K. Lowden
On Sat, 11 Jul 2015 19:02:59 -0600
Scott Robison  wrote:

> > I don't follow you.  A complex query is an assemblage of clauses.
> > Whether or not broken down "to individual sub queries", the search
> > arguments are the same.  They are few in kind, even though they may
> > be large in number.  The analysis for what indexes to create will
> > be the same either way.  In fact, Keith already proposed a
> > plausible design.
> 
> I'm just saying that knowing which index to create in advance is
> impossible if the user can specify arbitrarily complex where clauses.

In that case, I can make your job easier, starting today.  You don't
need to consider query complexity to design your indexes.  :-) 

The complexity of the query per se is immaterial.   If we had a
database with one table with only one column of integers, the
*complexity* of the queries against it could still be arbitrarily
complex, because that's what SQL allows.  But the choice of what to
index would be trivial.  

The database in question is nearly that simple.  It consists of just a
few tables, and AIUI the search terms are limited to keywords and rating
values. Maybe there are a few others.  The combinations in which those
could be used in a query are limitless.  The number of indexes appears
to be about 3.  

Determination of indexes is always driven by search arguments.  What
will be in the WHERE clause?  What will be in the JOIN?  The number of
ANDs and ORs doesn't matter, nor does the number of tables joined.  No
matter how complex the queries, the number of indexes is always limited
by the searchable columns.  

--jkl


[sqlite] Suggestions for Fast Set Logic?

2015-07-11 Thread James K. Lowden
On Fri, 10 Jul 2015 09:54:27 -0600
Scott Robison  wrote:

> As described, the user will be able to create arbitrarily
> complex queries. Since it is impossible to know in advance what
> indexes might be required, breaking it down to individual sub queries
> with simple where clauses that can be virtually guaranteed to use a
> simple index

I don't follow you.  A complex query is an assemblage of clauses.
Whether or not broken down "to individual sub queries", the search
arguments are the same.  They are few in kind, even though they may be
large in number.  The analysis for what indexes to create will be the
same either way.  In fact, Keith already proposed a plausible design.  

My concerns would be two:

1.  OR clauses are essentially UNION.  I don't know how sophisticated
SQLite is about using different indexes for different OR-separated
criteria or parts of a UNION.  Maybe separate queries would be
better.  

2.  LIKE clauses usually don't engage indexes.  The OP should
understand that pattern-matching queries will often entail a table scan
unless ANDed to an indexed term.  

--jkl


[sqlite] Suggestions for Fast Set Logic?

2015-07-10 Thread James K. Lowden
On Thu, 9 Jul 2015 22:28:04 +0100
Simon Slavin  wrote:

> If you're going to do it properly you have a keywords column which
> contains strings like
> 
> ,
> ,animal,
> ,animal,predator,
> ,animal,predator,bird,

Why do it that way?  I would recommend a schema like Keith's, with one
keyword per row in a separate table.  

The OP mentioned that searching 250,000 rows took 2 seconds.  Because a
LIKE clause can't use an index, every query of a column such as you
suggest will require a linear scan, plus pattern-matching overhead.  A
normalized plan like Keith's permits indexing by keyword, which should
be blazing fast when testing for equality to string constants.  

--jkl


[sqlite] Slow real world performance - Any suggestions please (warning long)

2015-07-05 Thread James K. Lowden
On Fri, 3 Jul 2015 11:35:21 +0100
Rob Willett  wrote:

> It basically has taken no time to run 10,551 selects. This implies
> that we had issues on the database before with either fragmentation
> or indexes being poorly setup.  

Perhaps you said so and I missed it, but to me it's clear that almost
all the improvement comes from using RAG_Idx1.  Because it indexes the
Text column, you're now doing a binary search instead of a linear
scan.  

By changing the collation, you went from scanning every row to
searching a binary tree.  That took you from O(n) to O(log2 n).  If
each row in your 16 GB database is 100 bytes, allowing for overhead you
might have 80 million rows?  To satisfy your query, on average that
would be an improvement from 40,000,000 I/O operations to 13.  

When something that used to take 122 seconds starts going a million
times faster, it begins to look pretty much instantaneous.   :-)  

--jkl


[sqlite] Could frequent 'database is locked' errors mean SQLite is not a good fit for my application?

2015-07-02 Thread James K. Lowden
On Thu, 2 Jul 2015 10:09:12 -0400
Kathleen Alexander  wrote:

> Essentially, I have written an application in C++ that interfaces
> (reads and writes) with a SQLite database, and I am getting lots of
> 'database is locked' errors. [...]
> 
> My application runs on Linux (ubuntu 13.10), and is driven by a bash
> script that spawns many (~60 on a 64 core workstation) instances of a
> serial, C++ program, each of which opens its own connection to the
> database and performs reads and writes.

It may be that SQLite is not a good fit for your application.
Concurrent update is SQLite's Achilles heel.  

Each insert/update/delete in SQLite requires exclusive access.  In WAL
mode, it requires exclusive access to the table; else it requires
exclusive access to the whole database.  That means, by default, only
one process can update the database at a time.  If you have 60
processes, 59 will wait.  

Depending on your requirements, that may still be OK.  If the updates
run quickly enough for your purposes, then increasing the timeout may
do the trick.  Estimate the processing time and number of updates to
compute an overall throughput.  If that's acceptable, problem solved.  

Most DBMSs, the heavy kind, devote extensive resources to support
concurrent update.  Usually contention is managed at the page or row
level, and a whole section of the manual covers how the system
implements SQL's "isolation levels".  Just exactly how many processes
can update the database at once is a function of almost everything:
DBMS configuration, table design, index definition, and isolation
level, not to mention hardware capacity and the OS I/O subsystem.  Your
mileage will most certainly vary.  

> B. If SQLite is not a good fit for my program, do you have any
> suggestions of an alternative database engine that is free or might
> be free or inexpensive for academic use?

*If* that's the case, your best option IMO is Postgres.  If you want to
handle ~1000 concurrent updates, though, you will have to pay attention
to the details, and may have to arrange to minimize resource contention
in the DBMS.  It all depends on the throughput and response-time
requirements.  

HTH.  

--jkl


[sqlite] Question about the list

2015-06-21 Thread James K. Lowden
On Fri, 19 Jun 2015 12:14:22 +1000
 wrote:

> Some of the messages I receive have been cross-posted to two lists
> (eg from jkl):
> 
> sqlite-users at mailinglists.sqlite.org
> sqlite-users at sqlite.org

Oy, sorry for the confusion.  There were old settings in my mailer,
thanks for setting me straight, as it were.  

--jkl



[sqlite] Mozilla wiki 'avoid SQLite'

2015-06-18 Thread James K. Lowden
On Wed, 17 Jun 2015 22:05:12 -0700
Darren Duncan  wrote:

> I also believe the world is ripe to have SQL alternatives, its just
> a matter of ones appearing that are compelling to users for real work
> and not just an academic exercise. The fact we're still generally
> with SQL means this hasn't happened yet, but that doesn't mean it
> won't.

Yes, I've been keeping track of TTM-inspired projects, and tried to
convince Ingres to commercialize its D implementation.  What makes
David's andl unusual is that it's a new language atop a DBMS that is
used in production.  

Still, I'm not so sure the world is ready for a better SQL.  The
evidence stands against, certainly.  We already discarded one -- QUEL --
which hardly anyone remembers.  A great deal of effort has gone into
replacing SQL with less powerful constructs with no theoretical
foundation, and into neutering the DBMS with ORM toys.  

Do not underestimate SQL's enormous inertia and network effect.
Recognize that even if the syntax is better, engineering
challenges remain if the promise of fidelity to the relational model is
to be realized.  

The inertia stems from laziness and ignorance, never in short supply.  A
fraction of those who write SQL understand the math and logic
underlying it.  Forums like Stack Overflow are rife with questions that
demonstrate as much; the answers are often not much better.  If you're
not thinking in terms of relational algebra and are unaware of the
benefits of using logic to accomplish your task, changing syntaxes will
only compound your problems.  If you *are* thinking in those terms,
it's still work to learn a new language.  It's not clear that a better
syntax -- if that's all you get -- would be seen as a worthwhile effort
for very many people.  

The network effect is important, too.  The market recognizes knowledge
of SQL as a skill (even if it underestimates its value).  That skill is
portable across jobs and DBMS implementations.  It is a way that
programmers communicate to each other across the organization and
time.  A new language has to clear a threshold of "better" to be
accepted.  

There are two engineering challenges that come to mind: performance and
set-theory semantics.  

As David outlines in his blog, it's no mean feat to swap out SQL from
any DBMS, including SQLite.  Because the query language is
assumed/known to be SQL, query planning, rewriting, transformation, and
optimization is bound up in the grammar of SQL itself.  Readers of this
list appreciate how much work goes into that.  

Set-theory touches on the implementation, too.  DISTINCT is 
implied in andl.  It was left out of SQL because it requires a sort.
The ramifications are nontrivial; consider joining on a nonkey and
counting the results.  Also remember that SQLite violates even SQL's
set-theoretic features, such as when 

update T set X = X + 1

fails if X is declared unique.  

So I would say replacing SQL with andl or similar is in fact not
analogous to replacing Perl with Python or C++ with Go.  The requisite
knowledge in the user community to appreciate the improvement is
greater (and relatively nonexistent).  And the engineering effort to
build a DBMS is far more than to build a compiler.  

There's a reason Larry Ellison can affort a yacht almost 500 feet
long.  

--jkl


[sqlite] Is recursive CTE fully capable?

2015-06-16 Thread James K. Lowden
On Mon, 15 Jun 2015 11:03:17 +1000
 wrote:

> >>>Unless the recursion is circular, I don't see how an SQL query
> >>>over a finite database could fail to terminate.  
> 
> What does this mean? It is trivial to write a recursive CTE that does
> not terminate, and the property of "circularity" is not what makes the
> difference.

Hmm, for a correctly written recursive query not to terminate, is it not
a requirement that the data contain a cycle?  I can't prove it, but no
counterexample springs to mind.  

In the positive: a correct recursive query always terminates if the
data represent a directed acyclic graph.  

By "correct" I mean the CTE expresses a recursive relation.  If you
recurse over

with R (a, b) as (select 1 as a, 1 as b)

you have no right to expect termination.  But you might be able to fry
an egg on the processor.  

--jkl



[sqlite] Mozilla wiki 'avoid SQLite'

2015-06-16 Thread James K. Lowden
On Tue, 16 Jun 2015 09:56:38 +1000
 wrote:

> The question is: what should a database language do? Andl can already
> match or surpass SQL on database programming tasks, but is that
> interesting enough?
> 
> What would make a database programming better, or best?

Two things I've often pointed to are namespaces and regular
expressions.  Another is compound datatypes.  

SQL and C both suffer from a single variable namespace.  We get around
it by using prefixes, e.g., "local_memcpy" or "annualized_returns".  

C++ added namespaces to the language.  I suggest SQL's successor do the
same, but use the Unix filesystem's hierarchical namespace as a model.
Putatively, 

ATTACH DATABASE 'foo.db' as /db/local/foo;
CHANGE DATABASE /db/local/foo;
CREATE TABLE annualized/returns ;

As far as I can tell, all the basic file and link management features
of the filesystem have analogous utility in a database.  (I would extend
that idea to permission bits, about which we could have a robust
discussion if you're interested.)  

Regular expressions likewise belong in a query language.  The LIKE
operator, an NIH relic of SQL's IBM origins, belongs on the ash heap of
history.  Best to follow Russ Cox's advice and restrict the regex
syntax to constructs with O(n) complexity.  

Finally, compound datatypes would simplify and encourage the use of
natural keys.  Something along these lines, 

CREATE UDT stock_key ( asof datetime, cusip char(8) );
CREATE TABLE prices( stock_key, price float );
CREATE TABLE returns( days int, return float, stock_key 
references prices );

Constraints defined on the compound user-defined type would of course
apply to whatever table it appears in.  

I thought I'd pass these along because you asked and because I don't
remember seeing them in TTM.  

I assume you're supporting row-generators.  Do you intend to support
table-comparison, too?  What about insert/update as assignment?  

--jkl



[sqlite] Mozilla wiki 'avoid SQLite'

2015-06-16 Thread James K. Lowden
On Tue, 16 Jun 2015 10:36:45 +0200
Marco Bonardo  wrote:

> The contents of the page came out from actual bugs and misuses we hit
> in years of use and experience with it and were discussed with
> attention. Most also have workarounds or suggested fixes.

While I'm sure that's true, the page is filled with dubious
assertions.  

Almost everything it says about I/O is evidence-free. The implications
of a file (or the database) using multiple extents on the disk
("fragmentation") is the best example.  For an application like FF, how
much data, and how many seeks, are we really talking about? How many of
them are in name only, because the OS reads whole tracks and buffers
hundreds of megabytes in the page cache?  

Is fsync "very expensive"?  Not likely.  Compared to network latency,
it's near zero.  On many consumer hard disks, it *is* zero, because the
drive lies about completed writes.  

Do small, simple writes perform better in JSON?  If so, by how much?
What complexity then leaks back into the application to manage deletes
and index seeks, no to mention concurrency?  

The page doesn't mention the value of constraints for correctness.  It
doesn't mention the convenience to the developer of being able to
inspect and query the data outside the application.  

Yes, sure, right tool for the job.  In deciding, best to know what the
tool does, and what the job is.  

--jkl



[sqlite] Is recursive CTE fully capable?

2015-06-14 Thread James K. Lowden
On Fri, 12 Jun 2015 01:45:50 +0100
Simon Slavin  wrote:

> There are plenty of queries which can be expressed in a SQL database
> but can't be answered without a computer which can reprogram itself

Are there?  Do you mean there are SQL queries like that?  Or do you
mean there are such queries that could be asked of a relational database
but cannot be expressed in SQL?  

There are queries that cannot be formulated in first order predicate
logic, and recursion is the single capability of SQL that exceeds
FOPL power.  Unless the recursion is circular, I don't see how an SQL
query over a finite database could fail to terminate.  

--jkl


[sqlite] DROP statement on Virtual Tables

2015-06-09 Thread James K. Lowden
On Tue, 9 Jun 2015 15:13:47 +
Hick Gunter  wrote:

> xConnect is called whenever SQLite decides it needs to do something
> with the existing virtual table. There must have been a successful
> xCreate call (in another session, another thread or another process)
> before. xConnect is not allowed to fail, even if whatever storage
> implements the persistant data is no longer present. It should just
> take note of that fact and let the other routines react appropriately.

I don't see how xConnect can not be allowed to fail.   The
documentation says it may return SQLITE_ERROR, and it must allocate the
sqlite3_vtab structure, which can fail.  If a required resource is not
available, ISTM returning an error is well advised, lest a call to
later function fail more mysteriously.  

That said, the putative CSV file cannot be deleted between xCreate and
xConnect if the vtable holds a descriptor to the file.  In Linux, the
file may be unlinked, but the inode is preserved while any open
descriptor remains.  In Windows, the user is prevented from deleting an
open file.  

Of course the file may go away anyway.  It might be on a USB drive and
the user may pull it.  In general, the hardware may fail.  That's
another reason xConnect may return an error.  

--jkl



[sqlite] User-defined types

2015-06-05 Thread James K. Lowden
On Thu, 4 Jun 2015 15:11:55 -0700
Darko Volaric  wrote:

> Are you seriously saying that that SQL syntax is friendly? How can you
> defend SQL syntax other than on grounds of history or
> standardization?

The first and best defense of SQL is that it has at least some
basis in the relational model.  It expresses relational project,
select, join, union, and intersection directly, and with contortions
relational division.  Like Algol-60, it's an improvement on its
predecessors and on many of its successors.  

> If you're more comfortable and familiar with JSON
> the yes it is easier and you can avoid an unnecessary conversion step.

I wonder how many applications you've profiled for which SQL generation
and parsing were a significant share of the run time.  Usually once the
data are of any appreciable size I/O becomes the most important
component.  That's why modern DBMSs have so many features to minimize
I/O.  

While you're working on your new syntax, I hope you'll keep Richard's
query in mind.  SQL, verbose as it is, is pretty clean compared to
most ORM syntaxes I've seen.  A syntax that can express his query more
succinctly and is also "better" along the lines you describe would be
an achievement.  

--jkl


[sqlite] User-defined types

2015-06-05 Thread James K. Lowden
On Fri, 5 Jun 2015 13:07:59 -0400
Stephen Chrzanowski  wrote:

> If N-Tier software development is 'annoying' and you are not happy,
> either get help by other members of your team, or, find a different
> hobby, because anything less than 3-tier programming dealing with
> multiple languages, technologies and functionality just isn't going
> away.  Quite honestly, and this is just my opinion, but I think it is
> absolutely wrong of you to go into a place of employment angry or
> annoyed at ANY level because of the multiple languages and
> technologies used to bring your product to life. Get mad at the
> politics, not the tools being used.  

Whew!  Most days this list is as well behaved as a mature poodle, and
then once in a while someone writes a rant that segfaults on the first
paragraph.  

You're saying complexity here to stay, it's inevitable.  Simon is
saying it's unnecesary (ergo annoying).  The more you know about
inherent and accidental complexity, the lower your threshold for being
annoyed by the latter.  There's no reason your 6-tier application
couldn't be written in a single language.  You could have one syntax
for data structures, one representation (and semantics) for missing
data, one binary representation for each type.  You could throw an
exception at tier-0 couldn't be caught in tier-5.  

The jumble of technologies we use is very much a happenstance accident
of industrial history.  No one designed it that way.  Why else would we
have no less than two completely different abstractions of the machine
-- the OS and the browser -- connected by nothing more sophisticated
than a stream of bytes?

> Second... Come on... Really?  This "switching...requires a lot of
> effort" comment is a pretty weak sauce excuse coming from any
> developer

It's not an excuse, "sauce" (whatever that means) or otherwise.  It's a
fact.  It's called "cognitive load" and it's been recognized since the
dawn of software.  

--jkl


[sqlite] AUTOINC vs. UUIDs

2015-05-21 Thread James K. Lowden
On Thu, 21 May 2015 13:44:15 +0200
Mark Straver  wrote:

> > Oh and the other feature! of UUIDs is merging databases... if I
> > have a database deployed to 15 customers, each with their own ORGID
> > I can easily throw them in the same database without worrying about
> > fixing up primary keys and all records related to those keys.
> 
> You wouldn't with sequential PKs either, but instead of stuffing 2
> types of data into a single field (which I think is also a bad idea)
> you simply have an extra field with ORGID. The PK should always be
> unique to the target DB, unrelated to the source DBs, when merging.

Exactly.  One extra column makes the key unique across (in this
example) organizations.  Plus it adds information -- the organization
ID -- that would otherwise be lost in the hash.  

We have a whole Internet running on nonrandom addresses, a world wide
web running on client-determined nonrandom addresses.  Instead of
relying on hashes and patching up the inevitable conflicts, the
designers TCP/IP and HTTP wisely decided on human-readable,
deterministic identifiers.  Strange indeed it is to watch those who
build atop that infrastructure assert that hashes are necessary and
intergers insecure.  

--jkl


[sqlite] AUTOINC vs. UUIDs

2015-05-21 Thread James K. Lowden
On Wed, 20 May 2015 19:05:29 +0100
Simon Slavin  wrote:

> Posting this not because I agree with it but because the subject has
> come up here a couple of times.
> 
>  >
> 
> "Today, I?ll talk about why we stopped using serial integers for our
> primary keys, and why we?re now extensively using Universally Unique
> IDs (or UUIDs) almost everywhere."

"A relational database is a graph where nodes are called
entities and edges relations."

It's hard to know where to start.  But that's definitely not it.  

--jkl



> 
> Simon.
> ___
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users


[sqlite] Virtual Table query - why isn't SQLite using my indexes?

2015-05-20 Thread James K. Lowden
On Tue, 19 May 2015 20:44:17 +
Eric Hill  wrote:

> But then what about a query like this:
> 
>   SELECT * FROM T1
>   LEFT OUTER JOIN T2 ON ( T2.a = T1.a ) AND ( T2.b =
> T1.b ) AND ( T2.c = T1.c );
> 
> xBestIndex will get called here for T1 with 3 constraints, c, b, and
> a, in that order.  In this case, though, it seems (to the
> uninitiated, at least ;-) that the "best index" would be:
> 
>   CREATE INDEX T1_all ON T1 (c, b, a);

Given that query, any index that includes a, b, and c would serve
equally well.  The order of the criteria in the ON clause is
immaterial.  

> I guess xBestIndex is saying "Tell me about indexes that you already
> have available so I can take advantage of them", 

Right.  

> I have *no* indexes, but I am willing to make whatever indexes would
> be most helpful, if I could just figure that out.

Depending on the physical characteristics of thing you're
representing as a virtual table, you might not want any indexes. If
every access requires either a sequential scan or a random access to
computed location, all search criteria are either perfect or
inconsequential.  If that's fast enough, it's fast enough.  If you want
to make it faster, indexes are certainly an option.  

Don't start by guessing, though.  Rather than beginning by trying to
anticipate what the query-generation tool will produce (and thus what
xBestIndex combinations will be interrogated), it's better to put the
horse before the cart and consider the data.  Once you've characterized
your data, you apply that information to that supplied by xBestIndex.  

If you have a table in 1NF, you have a key.  What are the columns that
identify a row?  That's your primary key, and your first index. There
may be other sets of columns that uniquely identify a row; these also
could use an index to enforce/verify uniqueness.  

Suppose your primary key is {a,b,c} and you want to construct an index
for it.  At this point I start will talking out of school, because I
don't know anything about the SQLite query planner.  But if I go astray
I'm sure others will correct me.  

The choice of the first column in the index is the main concern, and it
is influenced by the dominant search criteria i.e., the kinds of queries
that will be submitted.  

Queries in general are of two kinds: point queries, returning one row
(specifying values for every column in the index), and range queries
(specifying an incomplete set of values for the index).  Point queries
are indifferent to column order: find row in index and return. 
Range queries are the interesting ones.  

In a B+ tree such as SQLite uses for its indexes, rows will be sorted
in column order.  If your index is {a,b,c}, all the "a" rows will be
together and all the "b" columns within those "a"'s, and so on.  When
the query says, 

where a between x and y

the I/O engine can scoop all the row references out of contiguous
pages in the index.  Whereas if the query says, 

where b between x and y

the index is all but useless.  Because it's sorted by "a", it would have
to be scanned from beginning to end to find all the "b"'s that meet the
criteria.  

The other important criterion to answering xBestIndex, as Hick
mentioned, is cardinality.  You can create for each column a histogram
of the frequency distribution for the values in that column.  If the
column domain is two values (e.g., 'male' and 'female'), for example,
the probability of it matching a queried value X might be 50%.  (I think
I read that at Google 4 employees in 5 are men, in which case there a
value of 'female' would be a much more useful 20%.)  You can use the
combination of selective cardinality and your intuition (or experience)
for probable search criteria to create other indexes.  Your answer to
xBestIndex will be based on the product of the selectivity of the
contiguous columns (starting from the first) in the applicable index.  

Cardinality is only one part of the equation, however; the other is the
operator.  Equality is at least an opportunity for a point query, but
inequality implies a range.  How big a range?  If we're talking about
ages, for example, 

where age > x

If X is 0, that's not very useful, whereas if X is 100, it's very
selective of human beings (assuming age is in years).  Problem is, X
isn't supplied (for good reason) to xBestIndex.  How to answer?  

You can only work with what you know: the cardinality of the column,
the operator, and the rowcount.  Rows and cardinality you know, and
they apply directly to equality comparisons.  For inequality, you need
a heuristic.  A quarter of a century ago, during the Late Bronz Age,
Sybase used a simple heuristic: every inequality selected 25% of the
table.  That meant a BETWEEN query (two inequalities) was 6.25% = 0.25
* 0.25.  I doubt they pulled that number out of a hat.  Absent better
information, that might be your best option.  

I hope that's of some use.  Complicated, for sure, and no small amount

[sqlite] Suggestion for .import improvement

2015-05-12 Thread James K. Lowden
On Mon, 11 May 2015 22:24:53 +0300
 wrote:

> Current .import behavior is that when the imported table does not
> exist, it will be created using the first row from the data file as
> column names for the new table.

Huh, somehow I'd forgotten that!  

> Once the initial table is created by the first .import, every
> subsequent time .import will add the header row again and again as
> data.

I think your suggested solution is a bit arcane, though.  A single
change to the header row, perhaps capitalization or adding an
underscore, would cause the header row to be treated as data.  And the
whole create-table-from-headers feature, while admittedly sometimes
convenient, is also a lion trap.  Most of the time, I'd rather be told
the tablename I typed doesn't exist than to automatically create a new
one  with the wrong name or in the wrong database.  

It would be nice to be able to control that feature instead of adding to
its automagic.  In addition, it would be nice to control which lines of
the data file are imported.  

My first suggestion would be to add a setting, perhaps 

.import_mode create|require|skip_header|check_header
or
.autocreate on|off

The first option's values are:

create: .import *always* creates a table (else error)
require: .import *never* creates a table
skip_header: like require, and skip the first line of a file
check_header: like require, and verify that strings in the first
line of the file match the column names

That's pretty comprehensive, but also a little rigid.  Sometimes
there's more to choose about the file than how to treat the first
line.  That's more than a humble .import_mode setting can bear.  

Deselecting the header line isn't so hard, after all.  One way is to
replace 

sqlite3 db ".import '${FILE}' tablename"
with
sed 1d ${FILE} | sqlite3 db ".import '/dev/stdin' tablename"

But what if it's not a one-liner?  How to choose lines through
the .import syntax?  

.import FILE TABLE

I would take a page out of Perl's syntax, and let FILE be a command if
the first character is '|', 

sqlite3 db ".import '| sed 1d ${FILE}' tablename"

Drop the leading pipe character of the FILE string, hand the rest to
exec(2), and read its output from a pipe.  That give you the full
capabilty of the command line to manipulate the file as it's imported.  

--jkl


[sqlite] Signal handling and mmap

2015-05-09 Thread James K. Lowden
On Tue, 5 May 2015 12:05:51 -0700
Charles Munger  wrote:

> At https://www.sqlite.org/mmap.html, the documentation mentions:
> 
> "An I/O error on a memory-mapped file cannot be caught and dealt with
> by SQLite. Instead, the I/O error causes a signal which, if not
> caught by the application, results in a program crash."
> 
> It seems to me that a naively implemented handler might introduce
> database coherency bugs and other issues. Is there anywhere where I
> can read about how one might implement such a signal handler
> correctly, to gracefully recover the same way any as other sqlite I/O
> error?

Not that I know of.  ISTM one goal should be to convert the signal into
something that results in a rollback.  Then perhaps the user can
remove unneeded files and release enough space for the operation to
continue.  But if the problem is failing hardware, no amount of
signal-handling will reliably prevent file corruption.  

--jkl


[sqlite] Multiple instances of the same program accessing the same db file

2015-05-05 Thread James K. Lowden
On Mon, 4 May 2015 02:05:54 +0100
Simon Slavin  wrote:

> On 4 May 2015, at 1:30am, James K. Lowden 
> wrote:
> 
> > That is the way most remote filesystems are designed and implemented
> > and documented.  Cf. http://www.ietf.org/rfc/rfc1813.txt:
> > 
> >   4.11 Caching policies
> > 
> >   The NFS version 3 protocol does not define a policy for
> >   caching on the client or server. In particular, there is no
> >   support for strict cache consistency between a client and
> >   server, nor between different clients. See [Kazar] for a
> >   discussion of the issues of cache synchronization and
> >   mechanisms in several distributed file systems.
> > 
> > If you can find documentation for cache semantics for CIFS, I'd be
> > interested to see it.  
> 
> In contrast to NFS both SMB and AFP are designed to support networks
> properly, with caches only on the computer which hosts the file and
> locking correctly implemented.  

Are you saying that SMB clients have no filebuffer cache?  Or that
it exists, but is reliably invalidated by writes from other clients?  

>From what I remember from configuring Samba, SMB is 1) very
complicated (lots of versions and knobs) and 2) undocumented in
numerous ways. The prototocol is better documented these days, thanks
to the EU, but I've never seen anything resembling semantic
guarantees.  So I'm skeptical of any assertion that it Just Works, even
subject to the constraints you mention.  

--jkl


[sqlite] Multiple instances of the same program accessing the same db file

2015-05-03 Thread James K. Lowden
On Sun, 03 May 2015 13:09:43 -0700
Scott Doctor  wrote:

> I wonder why the network developers have not appropriately addressed
> this issue. Looks like I need to research this problem more before
> implementing. 

>From one perspective -- a common use case -- they *have* addressed this
issue.  But first let's be clear on what's going on and why network
filesystems are absolutely, positively unreliable for SQLite.  

Consider a 1-byte file containing the letter 'A'.  Two processes open
it read-write, and read the file.  The first time read(2) is called, the
'A' is read from the server's file into the client NFS (or whatever)
driver. Thereafter, on each machine, each iteration of 

lseek(fd, 0, 0, SEEK_SET);
read(fd, buf, 1);

will read from the filebuffer cache, not from the server, at least for
a while.  

Now the first machine writes 'B' to the file.  On that machine, the
above iteration returns 'B' because that machine's filebuffer cache was
updated when the 'B' was sent to the server.  The first machine's cache
is coherent with the file contents on the server.  

The second machine is *not* notified of the update.  Reads continue to
be satisfied from cache, and the cache continues to contain 'A'.  The
cache is inconsistent with the file contents on the server.  

That is the way most remote filesystems are designed and implemented
and documented.  Cf. http://www.ietf.org/rfc/rfc1813.txt:

   4.11 Caching policies

   The NFS version 3 protocol does not define a policy for
   caching on the client or server. In particular, there is no
   support for strict cache consistency between a client and
   server, nor between different clients. See [Kazar] for a
   discussion of the issues of cache synchronization and
   mechanisms in several distributed file systems.

If you can find documentation for cache semantics for CIFS, I'd be
interested to see it.  

This is why I keep reminding the list that problems with remote
filesystems aren't due to "bugs".  They work as designed and as
documented.  They just don't work as expected, when "expected"
means, "just like local filesystems (only slower)".  

A moment's thought reveals why they work this way.  Network-wide
client-cache coherency is has O(n) complexity where N is the number of
clients.  It's fraught with timing and performance issues.  

We haven't even mentioned SQLite per se.  I don't know where SQLite
tracks table-update status.  I do know that some of the filebuffer
cache data are in SQLite variables.  Some invariants about
thread-of-control consistency of those variables do not hold when the
underlying file is changed by an "external" process, which is what a
SQLite client running on another machine is.  The situation invites
textbook write-after-read cache-coherency problems.  

The problems that SQLite has with network filesystems isn't the least
mysterious if you scratch the surface.  RFC 1813 is in the public
domain, and references several papers that are freely available.  One
evening will convince you.  

You might ask, if the problems are hard but solveable, why not solve
them anyway, and perhaps make the sematics configurable on a
correctness-performance tradeoff?  I think the answer is that most
applications don't require it!  Most remote-file applications read the
entire file into memory, modify the data in memory, and write the file
back out minutes or hours later when the user saves it.  For them,
last-writer-wins is OK.  For applications that need to coordinate
comingled writes to a shared file, the dominant solution is a
server-based DBMS such as Postgres.  

--jkl


[sqlite] Segfault during FTS index creation from huge data

2015-05-03 Thread James K. Lowden
On Thu, 30 Apr 2015 12:47:57 -0600
Scott Robison  wrote:

> Perhaps you are correct and "sigsegv" is not the literal signal that
> is triggered in this case. I don't care, really. The fact is that an
> apparently valid pointer was returned from a memory allocation
> function yet can result in an invalid access for whatever reason (out
> of memory, in this case). The Linux OOM killer may kill the offending
> process (which is what one would expect, but one would also expect
> malloc to return null, so we already know not to expect the
> expected). Or it may kill some other process which has done nothing
> wrong! Sure, the OS is protecting the two processes address space
> from one another, but it seems to me that if one process can kill
> another process, there is a problem.

I have no argument with you, Scott.  It's neither the first nor last
thing Linix implemented in the name of efficiency that undid what
previously were guarantees.  My only angels-on-the-head-of-a-pin
argument is that the OOM-killer doesn't invalidate malloc-returned
pointers in particular.  It sweeps with a much broader brush, you might
say.   ;-)  

SIGSEGV *is* significant to the OP because it doesn't signify heap
exhaustion.  If that signal was triggered in the heap, it indicates
heap corruption.  If it was triggered in the stack, it suggests the
stack might been exhausted, perhaps before a pure OOM condition was
reached.  

--jkl


[sqlite] Multiple instances of the same program accessing the same db file

2015-05-03 Thread James K. Lowden
On Sat, 02 May 2015 19:24:19 -0700
Scott Doctor  wrote:

> Each computer has its own copy of the program, but may 
> share the database file which may be located on any of the computers.
> So I am wondering whether I should implement my own locking logic in
> my program.

Yes, you need to arrange within the application not to have two
processes writing to the remote database at the same time.  The locking
SQLite uses depends on semantics not provided by the remote
filesystem.  

With a local filesystem, when two processes are updating a file, each
process's update is visible to the other in the unified buffer cache
supplied by the OS.  In a networked filesystem, there is no unified
buffer cache: updates from process A, while manifest in the file, are
not necessarily reflected in the cache used by process B on another
machine.  A subsequent update from B based on its outdated cache could
well create an incoherent file image.  

The only safe answer is arrange for each update to begin by locking the
file in the application.  Then open the database, update it, close the
database, and unlock the file.  By the same token, after any update
every reading process should close and re-open the database before
continuing to rely on the database

One way to do that would be to keep an update.count file adjacent to
the database file.  Lock it, read it, increment it, proceed with the
database update, and release it.  Before each read, lock the file for
reading, and check the counter value.  If it's changed, close and
re-open the database, execute the SELECT, and release the file.  

That's just an outline; I might have overlooked something.  The
complexity and pitfalls explain why applications that need
inter-machine consistency connect to a DBMS daemon that manages its
database files locally.  

HTH.  

--jkl


[sqlite] Does column order matter for performance?

2015-04-30 Thread James K. Lowden
On Thu, 30 Apr 2015 09:45:14 -0700
Pol-Online  wrote:

> I wasn?t able to find the answer to this question online: does the
> column order matter for SQLite performance? E.g. should you put fixed
> width columns like INTEGER before TEXT or BLOB?  

I'm going to check that box as No.  

SQLite keeps its rows in pages and reads whole pages.  Within a page it
references rows and columns by offset.  Every location within a page is
accessed in constant time.  So column order doesn't matter because
page-offset doesn't matter.  

Most database performance problems are design problems.  If you pay
attention to cardinality and index definition, the rest will usually
take care of itself.   

--jkl


[sqlite] Segfault during FTS index creation from huge data

2015-04-30 Thread James K. Lowden
On Wed, 29 Apr 2015 20:29:07 -0600
Scott Robison  wrote:

> > That code can fail on a system configured to overcommit memory. By
> > that standard, the pointer is invalid.
> >
> 
> Accidentally sent before I was finished. In any case, by "invalid
> pointer" I did not mean to imply "it returns a bit pattern that could
> never represent a valid pointer". I mean "if you dereference a
> pointer returned by malloc that is not null or some implementation
> defined value, it should not result in an invalid memory access".

Agreed.  And I don't think that will happen with malloc.  It might, and
I have a plausible scenario, but I don't think that's what happened.  

In the bizarre context of the Linux OOM killer, the OS may promise more
memory than it can supply.  This promise is made by malloc and
materialized by writes to memory allocated through the returned
pointer, because at time of writing the the OS must actually (and may
fail to) allocate the memory from RAM or swap.  

Exhaustion of overcommitted memory does *not* result in SIGSEGV,
however.  The OOM killer selects a process for SIGKILL, and the
straw-on-the-camel's-back process that triggered the OOM condition is
not necessarily the one that is selected.  

As far as "invalid" goes, I don't see how we can single out pointers
from malloc.  In the presence of overcommitted memory, *all* addresses,
including that of the program text, are invalid in the sense that they
are undependable.  The process may be killed through no fault of its
own by virtue of a heuristic.  I think it's fair to say it makes the
machine nondeterministic, or at least adds to the machine's
nondeterminism.  

Can writing through a pointer returned by malloc (within the
allocated range) ever result in SIGSEGV?  Maybe.   I have a plausible
scenario in the context of sparse files and mmap, which malloc uses.  

Let us say that you have two processes on a 64-bit machine, and a 1 TB
filesystem.  Each process opens a new file, seeks to the position 1 TB -
1, and writes 1 byte.  Each process now owns a file whose "size" is 1 TB
and whose block count is 1.  Most of the filesystem is empty, yet the
two files have allocated 200% of the available space.  These are known
as "sparse" files; the unwritten locations are called "holes".  

Now each process calls mmap(2) on its file for the entire 1 TB.  Still
OK.  mmap will not fail.  The holes in the files return 0 when read.
When written to, the OS allocates a block from filesystem and maps it
to a page of memory.  As each process begins writing 1's sequentially
to its memory, successive blocks are allocated.  Soon enough the last
block is allocated and the filesystem will be really and truly full.  

At the next allocation, no block can be allocated and no page mapped.
What to do?  When calling write(2) on a full filesystem we expect
ENOSPC, but there's nowhere to return an error condition when writing
to memory.  Consequently the OS has no choice but to signal the
process.  That signal will be, yes, SIGSEGV.  

What does that have to do with malloc?  GNU malloc uses mmap for large
allocations; the pointer it returns is supplied by mmap for an
anonymous mapping to blocks in the swap partition.  If malloc creates
sparse files, writes through malloc'd pointers could result in SIGSEGV.
However, I do not know that that's what malloc does.  

I do not think that's what's happening in the OP's case.  I suspect the
OP's process sailed past any memory-allocation constraints because of
the overcommitted memory configuration, and eventually ran aground when
the stack was exhausted.  Others have already suggested fixing the
overcommit setting as a first step.  Others might be:

1.  Examine the core dump to determine if the SIGSEGV was triggered by
a write to heap or stack memory.  Or not, as the case may be.  ;-)  

2.  Investigate the malloc algorithm and/or replace it with one that
does not use sparse files.  

3.  Increase the stack space allocated to the process.  

It's an interesting problem.  I hope we learn the answer.   

--jkl


[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread James K. Lowden
On Wed, 29 Apr 2015 01:27:03 +0300
Artem  wrote:

> >> Error: near line 1: out of memory
> 
> > That's not a segfault, though, is it.
> 
> When I did the same in linux version of SQLite - I saw
> the "Segmentation Fault" error.

The above message indicates that SQLite trapped an errno of ENOMEM
after malloc returned NULL.  Otherwise it would not report "near line
1", something the kernel could never do.  

One possible way you could see SIGSEGV on Linux -- even if every call
to malloc is handled correctly and no out-of-bounds reference is ever
made -- is if the process runs out of stack space.[1]  That seems within
the realm of possiblity if we're talking about giant allocations.  

It's also plausible that Windows would report ENOMEM while LInux does
not.  If, as is commonly done, Linux is configured to overcommit
memory, malloc will seldom or never (depending
on /proc/sys/vm/overcommit_memory) return NULL.  Given the same RAM and
swap space, Windows would report ENOMEM when Linux would not.  In the
Linux case, the process could plausibly continue (and not be the OOM
victim) up until the point that it exhausts its stack.  

You could check the value of your overcommit_memory and set it to 2 if
that's not already the case.  That might shed some light.  

Developers: What value of overcommit_memory is used in testing?  

---jkl

[1] http://www.win.tue.nl/~aeb/linux/lk/lk-9.html


[sqlite] Segfault during FTS index creation from huge data

2015-04-29 Thread James K. Lowden
On Wed, 29 Apr 2015 02:39:50 -0600
Scott Robison  wrote:

> On linux, malloc may return a non null yet invalid pointer and only
> fail when the memory is accessed because it wasn't really available.

Citation needed.  I believe SIGSEGV is possible with anonymous mmap and
overcommitted memory.  ISTR I've read as much.  But I was unable to
find a definitive reference saying as much when I looked for one in
answering this message.  

It's not quite accurate to say the pointer is "invalid".  It's valid;
it just refers to memory the system may not be able to supply when
committed_memory > memory.  

--jkl



[sqlite] SQLite and Scientific Computing: Arrays and Complex Numbers

2015-04-29 Thread James K. Lowden
On Tue, 28 Apr 2015 09:24:56 +0200
Dominique Devienne  wrote:

> On Tue, Apr 28, 2015 at 4:16 AM, James K. Lowden
>  wrote:
> 
> > A major hurdle is the memory model: because array-programming
> > libraries normally mandate the data be in contiguous memory,
> > there's a cost to converting to/from the DBMS's B+ tree.  The more
> > array-like the physical storage of the DBMS, the more it cedes
> > transactional and update capability.
> 
> Well, just look at how Oracle solved that problem. The row data
> doesn't hold the blob itself, like in SQLite, but an index to
> separate blob pages. This proverbial level of indirection brings
> tremendous benefits, because you can then update a blob w/o having to
> rewrite the whole blob (you copy the "blob" page(s) being updated
> only, and copy the row into a new "data" page with an updated "blob
> index" with just a few entries changed to point to the updated pages. 

Does Oracle provide access to the BLOB as a contiguous memory image in
the Oracle server's virtual address space?

The benefit you're describing reduces the I/O required to update the
BLOB and to traverse the table without referencing it.  That's a *very*
old technique; it was available in Sybase in 1980s.  To support use of 
arbitrary linear algebra functions on those BLOBs, they would have to
supplied to user-defined functions in linear memory with known
addresses.  

--jkl


[sqlite] SQLite and Scientific Computing: Arrays and Complex Numbers

2015-04-27 Thread James K. Lowden
On Mon, 27 Apr 2015 16:35:11 -0400
Jim Callahan  wrote:

> So, the application of arrays and complex numbers go far beyond the
> one question posed to this list with arrays of integers and reals
> being far more common than arrays of complex numbers.
> 
> Complex numbers are included as types in C, C#, FORTRAN and Python
> and are being added to server databases (Oracle and Microsoft SQL
> Server) as user defined types, so programmers and server database
> administrators may develop an expectation that complex types or user
> defined types will be supported SQLite3.

Useful as complex numbers and arrays are to scientific work, SQLite is
deficient in ordinary math.  Division by zero yields NULL, for
example.  The usual suite of mathematical functions such as declared in
math.h are absent, never mind linear algebra.  Even enforcing the rule
that a column always hold a number is inconvenient at best.  

> I would expect that users of complex numbers would be an
> extremely small subset of data base users

Yes, and database users are a small subset of scientific programmers.
As some of the papers you cited allude to, there is some work (thanks
in part to Jim Gray) to answer the needs of that subset.  Most of that
work, though -- already a subset of subset -- focuses on giant
datasets.  "SQLite for science" is thus a small population by almost any
measure.  

IMO a "scientific DBMS" comprises two separate good ideas.  Neither one
has been solved.  

The first is efficient computation over "array-like" stuff, alerting
the computation engine that it is working with arrays, not sets, and/or
directly invoking existing libraries e.g. BLAS.  A major hurdle is the
memory model: because array-programming libraries normally mandate the
data be in contiguous memory, there's a cost to converting to/from the
DBMS's B+ tree.  The more array-like the physical storage of the DBMS,
the more it cedes transactional and update capability.  

The second is expression of mathematics in a database language.  SciQL
goes some distance in that direction by making tables and arrays
interchangable, but the battle is not yet won.  I suspect a better
answer would arise from APL-for-databases than SQL-for-math. But, see
above, not exactly a popular topic.  

--jkl


[sqlite] Thoughts on storing arrays of complex numbers

2015-04-24 Thread James K. Lowden
On Fri, 24 Apr 2015 13:37:40 +
"Drago, William @ CSG - NARDA-MITEQ"  wrote:

> I'm trying to avoid re-inventing the wheel. Is there a best or
> generally accept way to store arrays of complex numbers? 

A table in First Normal Form has no repeating groups.  That means no
row has an array of any kind.  Arrays in the relational model are
represented in columns, one element per row.  

A column whose type is nonrepeating is said to be "atomic", but that's
something of a convention.  An atom is supposed to be indivisible, but
we can take substrings of string, parts of dates, and exponents of
floating point numbers.  So nonrepeating datatypes aren't necessarily
atomic, exactly.  They're just not repeating.  ;-)  

The question of your complex array then comes down to two apects: how
to represent the complex number, and how to represent the array.  The
case for the array is simple: keep one complex number per row.  The
case for the "divisible atomic" complex number depends on a choice: how
you want the DBMS to treat the components of the complex type.  

The most general solution -- and therefore probably the best one -- is
to keep the real and complex component each in its own REAL column.
That lets you sort and select the complex numbers using SQLite's
built-in functions without limitation.  For example, if we call those
components "a" and "b", you could say, 

select * from T where "a" between 1.0 and 2.0

Such a table would be 

create table complex_array
( name TEXT not null
, ordinal INTEGER not null
, real_part REAL not null
, imaginary REAL not null
, primary key( name, ordinal )
);

That's the textbook solution on a DBMS without user-defined types.  

An alternative is to conceive of the complex type as a datatype, and
represent it in SQLite as BLOB or TEXT.  That severely limits
SQLite's ability to compare and select the values, although that
limitation can be somewhat alleviated with user-defined functions e.g., 

select * from T where "C" = complex(1.0, -0.5)

If 1) you're not interested in letting the DBMS inspect the data, and
2) you have some convenient C function to losslessly convert your
complex type to a string or bit-string, then a single-column
representation might be more convenient.  

--jkl




[sqlite] json_* functions in sqlite

2015-04-23 Thread James K. Lowden
On Tue, 21 Apr 2015 18:09:33 -0700
Ashish Sharma  wrote:

> Many times I store JSON data in sqlite. It will be useful if sqlite
> came with functions which understand JSON. Presto has a nice set
> https://prestodb.io/docs/current/functions/json.html

In case you don't know, you could implement functions such as Presto
provides yourself with a set of user-defined functions, without any
help from the SQLite project.  

As to whether JSON should be a supported datatype in SQLite, the answer
is clearly No.  SQLite, let us note, barely recognizes datatypes at
all, and lacks even a date type.  But more than that, JSON as "native"
datatype has two fundamental problems: definition and complexity.  

Definitionally, we should require any datatype have meaningful
operations comparable to those of numbers and strings.  What does it
mean to "add" two JSON objects?  When is one less than another?  Do you
seriously want to propose as a *type* a thing that can't be a primary
key?  

The problem of complexity is that everything in the JSON blob can,
fundamentally, be represented as tables in SQLite.  I realize "modern"
tools read/write JSON, that it's the OODBMS of the Javascript set.  But
that doesn't change the fact that the JSON tree is a graph, and we know
every graph can be represented with tables. 

Why does that matter?  Because a tree-as-datatype would add a whole new
theoretical structure (graph theory) that is 100% redundant to the
relational model embodied in SQLite.  You get a bunch of new functions
and ways to get at the data.  What you do *not* get is additional query
power.  In fact you get less, because graph theory gives you less: no
subsets and no joins, to name just two.  

That's not to say there should be some rule preventing you from storing
JSON in your SQLite database.  You may find it convenient, especially if
supported with some functions that make it possible to compare (or
perhaps update) components of it, because it represents some giant
state-property that for most purposes can be treated as an integral
unit.  It is to say that every JSON-specific feature you add duplicates 
one already present (in a different form) in SQLite.  Add enough of
them and you'll replace the DBMS with itself, if you see what I mean.  

--jkl


[sqlite] Destroy all evidence of a database

2015-04-23 Thread James K. Lowden
On Wed, 22 Apr 2015 16:56:07 +0100
Simon Slavin  wrote:

> You have made me realise, however, that a nice attack against
> encrypted SQLite databases might be to crash a SQLite application
> while it's processing and examine any journal files, shared memory
> file and temporary index files.  It might be interesting to review
> the various encryption systems widely available for SQLite and figure
> out which of them would be vulnerable to such an attack.

Encryption found its way into DBMS featuredom about 10 years ago, I
guess.  I've always thought it was patently stupid.  A DBMSs job is
store data.  Encryption probably should be done in the application.
Failing that, whole-filesystem encryption solves the problem in a
general way.  For SQLite, an encrypted loopback filesystem would solve
you problem neatly, except for that "no virtual filesystem"
stipulation.  

Coming back to the problem at hand, Scott Hess suggested that you modify
the SQLite VFS to trap all calls to the underlying open(2).  In that
way ISTM you could add each opened filename to a list processed by
a function whose address is passed to atexit(3).  Assuming the task
terminates nomally (unsignalled) all registered files would be deleted
by that function. If signals also need to be dealt with (IIUC they do
not) then I would fork the process that uses SQLite and arrange for the
parent to delete the files when the child terminates.  

What I like about the VFS idea is that it's minimally dependent on
SQLite particulars.  However the code changes in the future, its
relationship to the VFS will be quite stable.  You don't need to know
how many OS files are opened, or by what name.  Just trap and record
each one.  

HTH.  

--jkl


<    1   2   3   4   5   6   7   8   9   10   >