On Mon, 22 Apr 2019 21:25:31 +0000 "Lee, Jason" <jason...@lanl.gov> wrote:
> I have a set of several million database files sitting on my > filesystem. Each thread will open a previously unprocessed database > file, do some queries, close the database, and move on to the next > unprocessed database file. Fascinating. One wonders what Feynman would have said. Even with gobs of RAM and solid-state storage, I/O will quickly bottleneck because the processor is 3 orders of magnitude faster than RAM and 6 orders faster than the disk. Once you exhaust the I/O bus, it's exhausted. I would build a pipeline, and let processes do the work. Write a program to process a single database: open, query, output, close. Then define a make(1) rule to convert one database into one output file. Then run "make -j dd" where "dd" is the number of simultaneous processes (or "jobs"). I think you'll find ~10 processes is all you can sustain. You could use the sqlite3 utility as your "program", but it's not very good at detecting errors and returning a nonzero return status to the OS. Hence a bespoke program. Also, you can get the data into binary form, suitable for concatenation into one big file for input into your numerical process. That will go a lot faster. Although there's some overhead to invoking a million processes, it's dwarfed by the I/O time. The advantage of doing the work under make is that it's reusable and restartable. if you bury the machine, you can kill make and restart it with a lower number of jobs. If you find some databases are corrupt or incomplete, you can replace them, and make will reprocess only the new ones. If you add other databases at a later time, make will process only those. You can add subsequent steps, too; make won't start from square 1 unless it has to. With millions of inputs, the odds are you will find problems. Perfectly good input over a dataset that size probably occured before in recorded history, but not frequently. I assume your millions of databases are not in a single directory; I'd guess you have 1000s of directories. They offer convenient work partitions, which you might need; I have no idea how make will respond to a dependency tree with millions of nodes. --jkl _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users