> How does each thread know whether the file has been "previously processed" or > not?
The paths are pushed onto a queue and each thread pops the top off. I am also looking into the queuing code to see if there are issues > In other words, if you "get rid of" all the sqlite3 processing and replace it > with a 5 ms sleep, does increasing the number of threads exhibit the same > symptom? The timings were for sqlite3_open_v2, not for the whole process. The current code is effectively just an sqlite3_open_v2 followed by an sqlite3_close, and yet the time it takes to complete sqlite3_open_v2 still increases with the number of threads. > Even with gobs of RAM and solid-state storage, I/O will quickly > bottleneck because the processor is 3 orders of magnitude faster than > RAM and 6 orders faster than the disk. Once you exhaust the I/O bus, > it's exhausted. I/O is not the bottleneck. I have 8 NVMe drives in RAID0. I have not been able to drive the disks in the slightest because the threads spend the majority of their time in sqlite3_open_v2, sqlite3_close, and sqlite3_prepare_v2. Jason Lee ________________________________ From: sqlite-users <[email protected]> on behalf of James K. Lowden <[email protected]> Sent: Monday, April 22, 2019 4:53:42 PM To: [email protected] Subject: Re: [sqlite] Multiple Independent Database Instances On Mon, 22 Apr 2019 21:25:31 +0000 "Lee, Jason" <[email protected]> wrote: > I have a set of several million database files sitting on my > filesystem. Each thread will open a previously unprocessed database > file, do some queries, close the database, and move on to the next > unprocessed database file. Fascinating. One wonders what Feynman would have said. Even with gobs of RAM and solid-state storage, I/O will quickly bottleneck because the processor is 3 orders of magnitude faster than RAM and 6 orders faster than the disk. Once you exhaust the I/O bus, it's exhausted. I would build a pipeline, and let processes do the work. Write a program to process a single database: open, query, output, close. Then define a make(1) rule to convert one database into one output file. Then run "make -j dd" where "dd" is the number of simultaneous processes (or "jobs"). I think you'll find ~10 processes is all you can sustain. You could use the sqlite3 utility as your "program", but it's not very good at detecting errors and returning a nonzero return status to the OS. Hence a bespoke program. Also, you can get the data into binary form, suitable for concatenation into one big file for input into your numerical process. That will go a lot faster. Although there's some overhead to invoking a million processes, it's dwarfed by the I/O time. The advantage of doing the work under make is that it's reusable and restartable. if you bury the machine, you can kill make and restart it with a lower number of jobs. If you find some databases are corrupt or incomplete, you can replace them, and make will reprocess only the new ones. If you add other databases at a later time, make will process only those. You can add subsequent steps, too; make won't start from square 1 unless it has to. With millions of inputs, the odds are you will find problems. Perfectly good input over a dataset that size probably occured before in recorded history, but not frequently. I assume your millions of databases are not in a single directory; I'd guess you have 1000s of directories. They offer convenient work partitions, which you might need; I have no idea how make will respond to a dependency tree with millions of nodes. --jkl _______________________________________________ sqlite-users mailing list [email protected] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list [email protected] http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

