On 2016/08/04 5:05 PM, Kevin O'Gorman wrote:
Lots of answers, so thanks all around.  Some more info:

1. All partitions have at least 3 GB free, and it's not changing.  /tmp is
3 TiB and empty.
2. I have a RAID partition, for size, but no RAID controller.  As a hobby
project, I don't have spare parts, and I fear the results of a failure of a
hardware RAID without a spare, so I use Linux mdadm to manage software RAID
across three 4-TB drives.
3. Positions are 64 bytes always, so your size guesses are right.  They are
in no particular order.  I like the suggestion of a separate position
table, because they're going to appear in multiple qmove records, with an
average of about 3 or 4 appearances I think.  Maybe more.

I'm going to retry, using some of the suggestions above: smaller
transactions, IGNORE, position table rowids in the moves table, smaller
subsets being loaded, developing ideas of how time depends on data size.

If it doesn't go well fairly quickly, I'll probably go back to flat files
and writing the queries the hard way.  At least I know what I'm dealing
with there.

Note that there is nothing about SQLite that isn't working extremely well for your purpose - going back to flat files is akin to going back to the dark ages. Databases of this size (and even much larger) work perfectly all around with some people here able to quote you more staggering figures even - but it does require some special processing which might be thwarted by your current system layout or design choices. however, if done right, it should not be significantly slower than writing flat files - so try to persist to find what "done right" entails.

Remember that once this has worked - you can query the data with SQL... that is an amazingly powerful feature over flat files, and once an index exists (however slow it took to make), queries will be lightning fast, something a flat-file can never give you.


QUESTIONS:
If it's best in general to build indexes as a separate step, does this also
apply to primary indexes?  Can a table without a primary index have one
added later?  Isn't ROWID the real primary, presuming it has ROWIDs?  And
if so, then is a primary index on a ROWID table just for compliance with
standard SQL, and really no better than any other index?  Obviously, I'm a
bit confused about this.

A primary index is nothing other than a normal Unique index which has a guaranteed 1-to-1 key-to-row ratio and has look-up priority status. All SQL tables MUST have such a key to be able to guarantee access to any individual row, and if you omit the primary key bit, then some internal mechanism is used for it - in SQLite's case, this mechanism is called the row_id. A nice recent feature of SQLite allows you to get rid of this row_id overhead by explicitly specifying a Primary Key and then WITHOUT ROWID after the Table declaration.

The advantage of NOT using an explicit Primary key from the start is that the row_id can simply be added by increments during insert statements due to its auto-supplied-by-the-db-engine and INT type nature. A primary key has to be sorted (or at least, has to determine the correct sort-order insert position) for every added row. This costs a good bit of time on really large insert operations. The bulky single sort operation while building the PK index after-the-fact takes a lot less time than the repeated look-up-insert operations for the key while making the table.

I hope this answers the question, but feel free to ask more if anything remains unclear.


While I'm at it, I may as well ask if ROWID has any physical significance,
such that a VACUUM operation might change it.  Or is it just an arbitrary
ID inserted by SQLite and added to each record when they exist at all.

It is arbitrary, it is supplied by the DB engine in general, but it will never be changed by anything! That would go against all SQL and RDBMS premises. You could supply it yourself to (and I often advocate this to be the better practice).

If you declare a table with a primary key specified /exactly/ like this:
CREATE myTable ("ID" INTEGER PRIMARY KEY, .... )
Then the "ID" in this case becomes an alias for the row_id and manipulating/reading the value in it is in fact reading / manipulating the actual row_id.

(The "ID" can be anything else you like, but the "INTEGER PRIMARY KEY" part needs to be exactly written like that).


The current dataset is intended to solve one particular issue in the
overall project.  It looks like I'd want to build each such dataset
separately, as there will likely be a few hundred, and I gather that adding
to these tables will be pretty slow once the indexes have been built.  Or
is it sensible to drop indexes, add data and rebuild?

No this is not true - it will be really fast to add another few rows... it's just slow when you add the initial few zillion rows due to simple laws of quantity. If however you will be adding rows at an amazing rate, I would suggest a different way of doing it.

Also, you can gain a LOT of speed by simply using the C API directly in your exporter and not try to import text from it after the fact, not to mention the advantage of being able to track speed and progress - it's easy too, any reason you are not going this route? (If you are not confident with C or the API, I'm sure 99% of the people here can type you out a piece of code in under a minute that will do the inserts via API, if you need it).

Cheers,
Ryan
_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to