Have you tried utilizing a RAMdisk in your pipeline? Before adding a database 
to the queue of work move/copy the file to somewhere in /tmp/  then point the 
worker thread at it. The accesses should be much faster.

Hope you’re making progress.
-Tommy

On Apr 22, 2019, at 6:44 PM, Keith Medcalf <kmedc...@dessus.com> wrote:


Interesting.  If you can guarantee that you will only have a single thread 
accessing a single database only from one single thread, give it a try with 
SQLITE_OPEN_NOMUTEX | SQLITE_OPEN_READONLY in the flags parameter of 
sqlite3_open_v2 ...

Don't know if it will make a difference, but it might.

---
The fact that there's a Highway to Hell but only a Stairway to Heaven says a 
lot about anticipated traffic volume.


> -----Original Message-----
> From: sqlite-users [mailto:sqlite-users-
> boun...@mailinglists.sqlite.org] On Behalf Of Lee, Jason
> Sent: Monday, 22 April, 2019 17:33
> To: sqlite-users@mailinglists.sqlite.org
> Subject: Re: [sqlite] Multiple Independent Database Instances
> 
>> How does each thread know whether the file has been "previously
> processed" or not?
> 
> 
> The paths are pushed onto a queue and each thread pops the top off. I
> am also looking into the queuing code to see if there are issues
> 
> 
>> In other words, if you "get rid of" all the sqlite3 processing and
> replace it with a 5 ms sleep, does increasing the number of threads
> exhibit the same symptom?
> 
> 
> The timings were for sqlite3_open_v2, not for the whole process. The
> current code is effectively just an sqlite3_open_v2 followed by an
> sqlite3_close, and yet the time it takes to complete sqlite3_open_v2
> still increases with the number of threads.
> 
> 
>> Even with gobs of RAM and solid-state storage, I/O will quickly
>> bottleneck because the processor is 3 orders of magnitude faster
> than
>> RAM and 6 orders faster than the disk.  Once you exhaust the I/O
> bus,
>> it's exhausted.
> 
> 
> I/O is not the bottleneck. I have 8 NVMe drives in RAID0. I have not
> been able to drive the disks in the slightest because the threads
> spend the majority of their time in sqlite3_open_v2, sqlite3_close,
> and sqlite3_prepare_v2.
> 
> 
> Jason Lee
> 
> ________________________________
> From: sqlite-users <sqlite-users-boun...@mailinglists.sqlite.org> on
> behalf of James K. Lowden <jklow...@schemamania.org>
> Sent: Monday, April 22, 2019 4:53:42 PM
> To: sqlite-users@mailinglists.sqlite.org
> Subject: Re: [sqlite] Multiple Independent Database Instances
> 
> On Mon, 22 Apr 2019 21:25:31 +0000
> "Lee, Jason" <jason...@lanl.gov> wrote:
> 
>> I have a set of several million database files sitting on my
>> filesystem. Each thread will open a previously unprocessed database
>> file, do some queries, close the database, and move on to the next
>> unprocessed database file.
> 
> Fascinating.  One wonders what Feynman would have said.
> 
> Even with gobs of RAM and solid-state storage, I/O will quickly
> bottleneck because the processor is 3 orders of magnitude faster than
> RAM and 6 orders faster than the disk.  Once you exhaust the I/O bus,
> it's exhausted.
> 
> I would build a pipeline, and let processes do the work. Write a
> program
> to process a single database: open, query, output, close.  Then
> define
> a make(1) rule to convert one database into one output file.  Then
> run
> "make -j dd" where "dd" is the number of simultaneous processes (or
> "jobs").  I think you'll find ~10 processes is all you can sustain.
> 
> You could use the sqlite3 utility as your "program", but it's not
> very
> good at detecting errors and returning a nonzero return status to the
> OS. Hence a bespoke program.  Also, you can get the data into binary
> form, suitable for concatenation into one big file for input into
> your
> numerical process.  That will go a lot faster.
> 
> Although there's some overhead to invoking a million processes, it's
> dwarfed by the I/O time.
> 
> The advantage of doing the work under make is that it's reusable and
> restartable.  if you bury the machine, you can kill make and restart
> it
> with a lower number of jobs.  If you find some databases are corrupt
> or
> incomplete, you can replace them, and make will reprocess only the
> new
> ones.  If you add other databases at a later time, make will process
> only those.  You can add subsequent steps, too; make won't start from
> square 1 unless it has to.
> 
> With millions of inputs, the odds are you will find problems.
> Perfectly good input over a dataset that size probably occured before
> in recorded history, but not frequently.
> 
> I assume your millions of databases are not in a single directory;
> I'd
> guess you have 1000s of directories.  They offer convenient work
> partitions, which you might need; I have no idea how make will
> respond
> to a dependency tree with millions of nodes.
> 
> --jkl
> 
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users



_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to