On Tue, Apr 1, 2014 at 11:58 PM, Kevin Xu <accol...@gmail.com> wrote:
> Hi everyone I have been working on a small bioinformatics project that > attempts to store FASTQ (http://en.wikipedia.org/wiki/FASTQ_format) data > into a SQLite database. > > The app works by reading in the FASTQ file (through Boost::memory_mapped > regions), running the parser through it, binding the variables using > sqlite3_bind_xxx, and inserting them in a prepared statement wrapped within > BEGIN...END statements. > > The FASTQ file is parsed using a Boost::Spirit parser I wrote that spits > out structs with 13 members that get bound to the fields of a sqlite3 > prepared statement. One of those fields is a BLOB column that contains > about 151 bytes of data per row (If you look at FASTQ format, I basically > combined the quality and sequences together using a simple proprietary > "compression" scheme that achieves 50% compression). There are no indexes > other than the default rowid. > > I have several datasets I tested this on, with 100K (each sequence is made > up of 4 lines of ASCII text in the original file), 500K, and 91Mil > sequences (the data is paired end, which means that the dataset comes in > pairs - if 1 fastq file contains 100K sequences, there will be a paired > fastq file containing another 100K (from the "other end" of the dna > fragment). I have currently achieved: > > 100K rows - 1.11s > 500K rows - 5.12s > 182M rows - 1772s > So you are creating a 27.5 GB database in under a half hour? What is your database page size? > > which works to out to approximately 90K-100K inserts per second. The > testing platform is currently on a i7 laptop with an SSD, and it might > transition to a computing cluster in a university environment if it works > out. I am trying to see if it is possible to achieve even higher throughput > - my professor was adamant that he managed to insert 3 million rows in 7 > seconds (which works out to over 420K inserts per second) though he could > not find his code that did it or records of his insertions. > You might try programming a virtual table to read you FASTQ files, then transfer content from the virtual table into your real table using: INSERT INTO realtable SELECT * FROM virtualtable; Information on virtual tables is at http://www.sqlite.org/vtab.html and you can find example code in some of the extensions at http://www.sqlite.org/src/tree?ci=trunk&name=ext/misc and in some of the test modules at http://www.sqlite.org/src/tree?ci=trunk&name=src&re=test_ > > When I profiled the application (using Instruments), and after inverting > the call tree, the time spent within the program is broken down as follows: > (-> means called by) > > 2170ms 15.1% - sqlite3VdbeExec -> sqlite3_step -> insert function > 2142ms 14.9% - pwrite -> unixWrite -> pager_write/pager_write_pagelist > 1925ms 14.9% - std::string::insert -> boost::spirit > 539ms 3.7% - pack (my compression function) > ... > > The performance profile seem to suggest that I am already spending the > majority of the time actually doing inserts called by sqlite3 - I don't > really see any way to improve on insert throughput any further. > > Hope someone more knowledgeable in this area can help me out with this! > > Kevin Xu > _______________________________________________ > sqlite-users mailing list > sqlite-users@sqlite.org > http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users > -- D. Richard Hipp d...@sqlite.org _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users