On Tue, Apr 1, 2014 at 11:58 PM, Kevin Xu <accol...@gmail.com> wrote:

> Hi everyone I have been working on a small bioinformatics project that
> attempts to store FASTQ (http://en.wikipedia.org/wiki/FASTQ_format) data
> into a SQLite database.
>
> The app works by reading in the FASTQ file (through Boost::memory_mapped
> regions), running the parser through it, binding the variables using
> sqlite3_bind_xxx, and inserting them in a prepared statement wrapped within
> BEGIN...END statements.
>
> The FASTQ file is parsed using a Boost::Spirit parser I wrote that spits
> out structs with 13 members that get bound to the fields of a sqlite3
> prepared statement. One of those fields is a BLOB column that contains
> about 151 bytes of data per row (If you look at FASTQ format, I basically
> combined the quality and sequences together using a simple proprietary
> "compression" scheme that achieves 50% compression). There are no indexes
> other than the default rowid.
>
> I have several datasets I tested this on, with 100K (each sequence is made
> up of 4 lines of ASCII text in the original file), 500K, and 91Mil
> sequences (the data is paired end, which means that the dataset comes in
> pairs - if 1 fastq file contains 100K sequences, there will be a paired
> fastq file containing another 100K (from the "other end" of the dna
> fragment). I have currently achieved:
>
> 100K rows - 1.11s
> 500K rows - 5.12s
> 182M  rows - 1772s
>

So you are creating a 27.5 GB database in under a half hour?

What is your database page size?


>
> which works to out to approximately 90K-100K inserts per second. The
> testing platform is currently on a i7 laptop with an SSD, and it might
> transition to a computing cluster in a university environment if it works
> out. I am trying to see if it is possible to achieve even higher throughput
> - my professor was adamant that he managed to insert 3 million rows in 7
> seconds (which works out to over 420K inserts per second) though he could
> not find his code that did it or records of his insertions.
>

You might try programming a virtual table to read you FASTQ files, then
transfer content from the virtual table into your real table using:

     INSERT INTO realtable SELECT * FROM virtualtable;

Information on virtual tables is at http://www.sqlite.org/vtab.html and you
can find example code in some of the extensions at
http://www.sqlite.org/src/tree?ci=trunk&name=ext/misc and in some of the
test modules at http://www.sqlite.org/src/tree?ci=trunk&name=src&re=test_



>
> When I profiled the application (using Instruments), and after inverting
> the call tree, the time spent within the program is broken down as follows:
> (-> means called by)
>
> 2170ms 15.1% - sqlite3VdbeExec -> sqlite3_step -> insert function
> 2142ms 14.9% - pwrite -> unixWrite -> pager_write/pager_write_pagelist
> 1925ms 14.9% - std::string::insert -> boost::spirit
> 539ms    3.7% - pack (my compression function)
> ...
>
> The performance profile seem to suggest that I am already spending the
> majority of the time actually doing inserts called by sqlite3 - I don't
> really see any way to improve on insert throughput any further.
>
> Hope someone more knowledgeable in this area can help me out with this!
>
> Kevin Xu
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>



-- 
D. Richard Hipp
d...@sqlite.org
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to