Hi everyone I have been working on a small bioinformatics project that attempts to store FASTQ (http://en.wikipedia.org/wiki/FASTQ_format) data into a SQLite database.
The app works by reading in the FASTQ file (through Boost::memory_mapped regions), running the parser through it, binding the variables using sqlite3_bind_xxx, and inserting them in a prepared statement wrapped within BEGIN...END statements. The FASTQ file is parsed using a Boost::Spirit parser I wrote that spits out structs with 13 members that get bound to the fields of a sqlite3 prepared statement. One of those fields is a BLOB column that contains about 151 bytes of data per row (If you look at FASTQ format, I basically combined the quality and sequences together using a simple proprietary "compression" scheme that achieves 50% compression). There are no indexes other than the default rowid. I have several datasets I tested this on, with 100K (each sequence is made up of 4 lines of ASCII text in the original file), 500K, and 91Mil sequences (the data is paired end, which means that the dataset comes in pairs - if 1 fastq file contains 100K sequences, there will be a paired fastq file containing another 100K (from the "other end" of the dna fragment). I have currently achieved: 100K rows - 1.11s 500K rows - 5.12s 182M rows - 1772s which works to out to approximately 90K-100K inserts per second. The testing platform is currently on a i7 laptop with an SSD, and it might transition to a computing cluster in a university environment if it works out. I am trying to see if it is possible to achieve even higher throughput - my professor was adamant that he managed to insert 3 million rows in 7 seconds (which works out to over 420K inserts per second) though he could not find his code that did it or records of his insertions. When I profiled the application (using Instruments), and after inverting the call tree, the time spent within the program is broken down as follows: (-> means called by) 2170ms 15.1% - sqlite3VdbeExec -> sqlite3_step -> insert function 2142ms 14.9% - pwrite -> unixWrite -> pager_write/pager_write_pagelist 1925ms 14.9% - std::string::insert -> boost::spirit 539ms 3.7% - pack (my compression function) ... The performance profile seem to suggest that I am already spending the majority of the time actually doing inserts called by sqlite3 - I don't really see any way to improve on insert throughput any further. Hope someone more knowledgeable in this area can help me out with this! Kevin Xu _______________________________________________ sqlite-users mailing list sqlite-users@sqlite.org http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users