Re: [sqlite] Copying from a sqlite database in parallel.

Bo Peng Mon, 05 Aug 2013 19:24:41 -0700

Hi, rsmith,

Thank you very much for your insights and I agree with you on the role
of diskspeed vs. CPU in this process. However, in the bioinformatics
application I am maintaining, we have huge databases that easily
exceed the size of regular SSD drives (512G for example) and we have
to rely on diskarraries of regular harddrives, with less-than-optimal
random access speed.


One of the problems we have is the performance of analyzing large
tables one by one. For example, we are puzzled that it takes from 40
seconds to 10 minutes to copy tables with similar sizes from a
database to another. We are still trying to understand the problem but
I am suspecting that the layout of records play a role here. For
example, if we create 1000 tables and insert records to them in batch,
the data of each table will likely spread evenly across the whole
database, and almost all pages need to be read when we go through
records of a table. One of the things I am trying to do (to improve
the performance of analysis) is to copy tables to several temporary
databases before analysis, and I am hoping that parallelization can
speed up the copy process. Ideally, if we are reading from one
database and feeding several destination databases, we only need to
read the source database once due to the caching of disk contents.

Anyway, I see a patch that adds 'READONLY' keyword to 'ATTACH
DATABASE' at 
http://old.nabble.com/Re%3A-Question-regarding-modified-ATTACH-command-p26178713.html.
It applies to sqlite-3.6.17 but I will try to adapt it to the current
version of sqlite and see if it helps. I will also try to vacuum the
main database so that data of tables cluster together (not sure if
this will happen, or if it helps performance), which will take days to
complete.

Thanks,
Bo


On Mon, Aug 5, 2013 at 7:46 PM, RSmith <rsm...@rsweb.co.za> wrote:
> Hi Bo,
>
> Read-Only belongs to the connection, so I don't think you can do what you
> want to do here, but that should not be a problem. It would be very easy to
> run the SELECT query on one DB connection and in code bind the results from
> it to another DB connection's INSERT query. However, what you are trying to
> do is moot unless you have multiple source drives/servers/whatever.
> The reason I say this is that it takes no real processor effort to copy the
> data, you could probably easily copy a hundred times the data on a mediocre
> system with a mediocre processor. The bottleneck is always the File system
> or drive hardware or service. Even a modern very fast SSD in RAID is several
> magnitudes slower than the throughput of the processor bus or even the
> system bus (on last count a standard* i7's system bus could shift some
> 6.4GB/s in a normal setup, whereas the above described drive would probably
> do 1GB/s on paper - an insane speed for a drive, but a bit lacklustre in
> processor terms). So maybe if you are reading it from different sources it
> might make some difference, although the destination file would still be on
> a single device, which, unless it is completely in memory**, would be very
> much the bottleneck. Multi-threading will have no useful improvement.
>
> If you are hoping for the query prepares to go faster, this is unlikely too,
> as each query will have to be prepared fully per connection, thereafter
> multithreading the execution of the multi-queries will be back to square 1,
> as described above. Trust me, if there was any real advantage to be gained
> from this, SQLite's developers would have multi-threaded it aeons ago.
>
> Would this be a standard sort of thing your system needs to do, or a
> once-in-a-while maintenance type of routine?
>
>
> * - The same approximate ratios hold true for other manufacturers (AMD,
> NVIDIA, etc) and other architectures such as ARM and is in general worse for
> any other type of drive.
>
> ** - Actually, coming to think of it, even the memory system and bus would
> be significantly slower than a processor can push those bytes, so you still
> won't win with multi-threading. Any form of multi-threading primarily
> benefits the processor's ability to push more FLOPS through different
> pipelines, this helps calculations a lot, but is the least concern when
> copying data.
>
>
>
> On 2013/08/06 01:15, Bo Peng wrote:
>>
>> Dear sqlite experts,
>>
>> I would like to copy some tables from a sqlite database to others
>> (e.g.  'INSERT INTO to_db.table SELECT * FROM from_db.table'  where
>> to_db is attached to from_db). To improve the performance of this
>> process, I am trying to run several processes, each read from
>> "from_db" and write to a different "to_db". However, I have to connect
>> to from_db in readonly mode (python sqlite module with
>> check_same_thread set to False) in order to allow multiple process to
>> access this database simultaneously. Then, if I attach a 'to_db' to
>> 'from_db', the 'to_db' is still readonly so I cannot write to it. Is
>> there anyway that I can attach a database in read-write mode to a
>> readonly main database? Or, if we make the destination database the
>> main read-write database, can I attached a database to a read-write
>> database in read-only mode?
>>
>> Many thanks in advance,
>> Bo
>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users@sqlite.org
>> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Copying from a sqlite database in parallel.

Reply via email to