On Mon, 22 Apr 2019 21:25:31 +0000
"Lee, Jason" <jason...@lanl.gov> wrote:

> I have a set of several million database files sitting on my
> filesystem. Each thread will open a previously unprocessed database
> file, do some queries, close the database, and move on to the next
> unprocessed database file.

Fascinating.  One wonders what Feynman would have said.  

Even with gobs of RAM and solid-state storage, I/O will quickly
bottleneck because the processor is 3 orders of magnitude faster than
RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
it's exhausted.  

I would build a pipeline, and let processes do the work. Write a program
to process a single database: open, query, output, close.  Then define
a make(1) rule to convert one database into one output file.  Then run
"make -j dd" where "dd" is the number of simultaneous processes (or
"jobs").  I think you'll find ~10 processes is all you can sustain.  

You could use the sqlite3 utility as your "program", but it's not very
good at detecting errors and returning a nonzero return status to the
OS. Hence a bespoke program.  Also, you can get the data into binary
form, suitable for concatenation into one big file for input into your
numerical process.  That will go a lot faster.  

Although there's some overhead to invoking a million processes, it's
dwarfed by the I/O time.  

The advantage of doing the work under make is that it's reusable and
restartable.  if you bury the machine, you can kill make and restart it
with a lower number of jobs.  If you find some databases are corrupt or
incomplete, you can replace them, and make will reprocess only the new
ones.  If you add other databases at a later time, make will process
only those.  You can add subsequent steps, too; make won't start from
square 1 unless it has to.  

With millions of inputs, the odds are you will find problems.
Perfectly good input over a dataset that size probably occured before
in recorded history, but not frequently.  

I assume your millions of databases are not in a single directory; I'd
guess you have 1000s of directories.  They offer convenient work
partitions, which you might need; I have no idea how make will respond
to a dependency tree with millions of nodes.  

--jkl

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to