> How does each thread know whether the file has been "previously processed" or 
> not?


The paths are pushed onto a queue and each thread pops the top off. I am also 
looking into the queuing code to see if there are issues


> In other words, if you "get rid of" all the sqlite3 processing and replace it 
> with a 5 ms sleep, does increasing the number of threads exhibit the same 
> symptom?


The timings were for sqlite3_open_v2, not for the whole process. The current 
code is effectively just an sqlite3_open_v2 followed by an sqlite3_close, and 
yet the time it takes to complete sqlite3_open_v2 still increases with the 
number of threads.


> Even with gobs of RAM and solid-state storage, I/O will quickly
> bottleneck because the processor is 3 orders of magnitude faster than
> RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
> it's exhausted.


I/O is not the bottleneck. I have 8 NVMe drives in RAID0. I have not been able 
to drive the disks in the slightest because the threads spend the majority of 
their time in sqlite3_open_v2, sqlite3_close, and sqlite3_prepare_v2.


Jason Lee

________________________________
From: sqlite-users <[email protected]> on behalf of 
James K. Lowden <[email protected]>
Sent: Monday, April 22, 2019 4:53:42 PM
To: [email protected]
Subject: Re: [sqlite] Multiple Independent Database Instances

On Mon, 22 Apr 2019 21:25:31 +0000
"Lee, Jason" <[email protected]> wrote:

> I have a set of several million database files sitting on my
> filesystem. Each thread will open a previously unprocessed database
> file, do some queries, close the database, and move on to the next
> unprocessed database file.

Fascinating.  One wonders what Feynman would have said.

Even with gobs of RAM and solid-state storage, I/O will quickly
bottleneck because the processor is 3 orders of magnitude faster than
RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
it's exhausted.

I would build a pipeline, and let processes do the work. Write a program
to process a single database: open, query, output, close.  Then define
a make(1) rule to convert one database into one output file.  Then run
"make -j dd" where "dd" is the number of simultaneous processes (or
"jobs").  I think you'll find ~10 processes is all you can sustain.

You could use the sqlite3 utility as your "program", but it's not very
good at detecting errors and returning a nonzero return status to the
OS. Hence a bespoke program.  Also, you can get the data into binary
form, suitable for concatenation into one big file for input into your
numerical process.  That will go a lot faster.

Although there's some overhead to invoking a million processes, it's
dwarfed by the I/O time.

The advantage of doing the work under make is that it's reusable and
restartable.  if you bury the machine, you can kill make and restart it
with a lower number of jobs.  If you find some databases are corrupt or
incomplete, you can replace them, and make will reprocess only the new
ones.  If you add other databases at a later time, make will process
only those.  You can add subsequent steps, too; make won't start from
square 1 unless it has to.

With millions of inputs, the odds are you will find problems.
Perfectly good input over a dataset that size probably occured before
in recorded history, but not frequently.

I assume your millions of databases are not in a single directory; I'd
guess you have 1000s of directories.  They offer convenient work
partitions, which you might need; I have no idea how make will respond
to a dependency tree with millions of nodes.

--jkl

_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to