A few observations....
1. My condolences with those dimensions you are heading for "big
data"/hadoop land.

2. Worry about the number of rows; that's what feeds into the big-oh: O(n).
Assuming your 150 columns translate into a 1.5k to 2k record length that
means your 300 gigabyte file must have 150 (300 gig/2k) to 200 (300
gig/1.5k) million records. That's a lot of n for O(n).

3. SSDs are faster than spinning disks because they have potentially zero
track to track seek times. On the other hand, once the read-write head is
above the correct disk sector disks have very highly evolved I/O for
consecutive sectors (not fragmented). By contrast, the controllers that
provide the interface to SSDs are not nearly as evolved so SSDs may still
have lower bandwidth (for example a "Class 10" SD card designed for HD
Video still (unless otherwise marked) only has 10 Mbyte per second speed).
http://en.wikipedia.org/wiki/Secure_Digital

However, a device speed of 10 Mbyte per second would put a SCSI device near
the bottom of the speed hierarchy.
http://en.wikipedia.org/wiki/SCSI

Serial ATA (SATA) is an even faster interface.
http://en.wikipedia.org/wiki/Serial_ATA

So, I would recommend double checking the specs of your SSD and not
necessarily making the reasonable, but not yet true, assumption that
solid-state has to be faster than mechanical.

One strategy that might work is to have an entirely separate (not joined) 8
column table; develop queries (targets) on that database and then write out
the primary key of the rows you are interested in to a separate table
(CREATE TABLE AS SELECT primarykey FROM 8columntable WHERE yourquery;) and
then JOIN the row reduced table to the main table. If your rowreduced table
has millions of fewer records (primary keys) that's millions of rows (in
the main 150 column table) where the precompiled SQL query doesn't have to
be executed.

HTH,

Jim Callahan
Orlando, FL














On Wed, Aug 6, 2014 at 11:51 PM, Paul Dillon <paul.dil...@gmail.com> wrote:

> Hello,
>
> I would like to know if splitting a big table into two smaller ones, and
> then using a join in my queries would speed up performance.
>
> My table is 100-300GB in size and has about 150 columns.  There are 8
> fields that I frequently use in my queries, which require full table scans
> to complete.  I usually query each field once per data load, so the time to
> index them is not worth it.  (FYI I am data mining).
>
> So my questions are:
>
> 1. Will moving these 8 query fields to a smaller table improve query
> performance when joined to the larger table?  My logic is that this small
> table would only be about 5% the size of the full table, so the full table
> scan to service the query might be faster.
>
> 2. Would it be worthwhile having that smaller table in a separate .db file,
> so that the data is closer together on my SSD drive?
>
> 3. Would the data loading performance be heavily impacted if I had to
> insert into two tables instead of one?  I use "INSERT OR REPLACE" for my
> loading, with a single index.
>
> Many Thanks,
>
> Paul
> _______________________________________________
> sqlite-users mailing list
> sqlite-users@sqlite.org
> http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
sqlite-users@sqlite.org
http://sqlite.org:8080/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to