Have you tried utilizing a RAMdisk in your pipeline? Before adding a database to the queue of work move/copy the file to somewhere in /tmp/ then point the worker thread at it. The accesses should be much faster.
Hope you’re making progress. -Tommy On Apr 22, 2019, at 6:44 PM, Keith Medcalf <kmedc...@dessus.com> wrote: Interesting. If you can guarantee that you will only have a single thread accessing a single database only from one single thread, give it a try with SQLITE_OPEN_NOMUTEX | SQLITE_OPEN_READONLY in the flags parameter of sqlite3_open_v2 ... Don't know if it will make a difference, but it might. --- The fact that there's a Highway to Hell but only a Stairway to Heaven says a lot about anticipated traffic volume. > -----Original Message----- > From: sqlite-users [mailto:sqlite-users- > boun...@mailinglists.sqlite.org] On Behalf Of Lee, Jason > Sent: Monday, 22 April, 2019 17:33 > To: sqlite-users@mailinglists.sqlite.org > Subject: Re: [sqlite] Multiple Independent Database Instances > >> How does each thread know whether the file has been "previously > processed" or not? > > > The paths are pushed onto a queue and each thread pops the top off. I > am also looking into the queuing code to see if there are issues > > >> In other words, if you "get rid of" all the sqlite3 processing and > replace it with a 5 ms sleep, does increasing the number of threads > exhibit the same symptom? > > > The timings were for sqlite3_open_v2, not for the whole process. The > current code is effectively just an sqlite3_open_v2 followed by an > sqlite3_close, and yet the time it takes to complete sqlite3_open_v2 > still increases with the number of threads. > > >> Even with gobs of RAM and solid-state storage, I/O will quickly >> bottleneck because the processor is 3 orders of magnitude faster > than >> RAM and 6 orders faster than the disk. Once you exhaust the I/O > bus, >> it's exhausted. > > > I/O is not the bottleneck. I have 8 NVMe drives in RAID0. I have not > been able to drive the disks in the slightest because the threads > spend the majority of their time in sqlite3_open_v2, sqlite3_close, > and sqlite3_prepare_v2. > > > Jason Lee > > ________________________________ > From: sqlite-users <sqlite-users-boun...@mailinglists.sqlite.org> on > behalf of James K. Lowden <jklow...@schemamania.org> > Sent: Monday, April 22, 2019 4:53:42 PM > To: sqlite-users@mailinglists.sqlite.org > Subject: Re: [sqlite] Multiple Independent Database Instances > > On Mon, 22 Apr 2019 21:25:31 +0000 > "Lee, Jason" <jason...@lanl.gov> wrote: > >> I have a set of several million database files sitting on my >> filesystem. Each thread will open a previously unprocessed database >> file, do some queries, close the database, and move on to the next >> unprocessed database file. > > Fascinating. One wonders what Feynman would have said. > > Even with gobs of RAM and solid-state storage, I/O will quickly > bottleneck because the processor is 3 orders of magnitude faster than > RAM and 6 orders faster than the disk. Once you exhaust the I/O bus, > it's exhausted. > > I would build a pipeline, and let processes do the work. Write a > program > to process a single database: open, query, output, close. Then > define > a make(1) rule to convert one database into one output file. Then > run > "make -j dd" where "dd" is the number of simultaneous processes (or > "jobs"). I think you'll find ~10 processes is all you can sustain. > > You could use the sqlite3 utility as your "program", but it's not > very > good at detecting errors and returning a nonzero return status to the > OS. Hence a bespoke program. Also, you can get the data into binary > form, suitable for concatenation into one big file for input into > your > numerical process. That will go a lot faster. > > Although there's some overhead to invoking a million processes, it's > dwarfed by the I/O time. > > The advantage of doing the work under make is that it's reusable and > restartable. if you bury the machine, you can kill make and restart > it > with a lower number of jobs. If you find some databases are corrupt > or > incomplete, you can replace them, and make will reprocess only the > new > ones. If you add other databases at a later time, make will process > only those. You can add subsequent steps, too; make won't start from > square 1 unless it has to. > > With millions of inputs, the odds are you will find problems. > Perfectly good input over a dataset that size probably occured before > in recorded history, but not frequently. > > I assume your millions of databases are not in a single directory; > I'd > guess you have 1000s of directories. They offer convenient work > partitions, which you might need; I have no idea how make will > respond > to a dependency tree with millions of nodes. > > --jkl > > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users > _______________________________________________ > sqlite-users mailing list > sqlite-users@mailinglists.sqlite.org > http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users _______________________________________________ sqlite-users mailing list sqlite-users@mailinglists.sqlite.org http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users