Thanks everyone for your answers,
I made some changes to the database according to the information you
gave me. It improved the performance of the query by about 20% (the
request now takes 4 seconds instead of 5).
Here are some more information, regarding all the suggestions I
received:
- The version of SQLite I used is the one provided by Debian (current
stable: wheezy).
- I need the database to be opened in Read Only mode, so I did not
use the WAL mode.
- All the existing indexes cannot be removed because they are used
by other queries.
- I however removed unused indexes for the following tests results
Note that my benchmarks are run in batch, with sqlite3 as with
mysql. I stop and start the mysql daemon to avoid most caching (I
hope).
# For mysql, I use:
/etc/init.d/mysql stop; /etc/init.d/mysql start; \
time echo 'SELECT SUM(nbBytesDecompOut+nbBytesCompIn) as vol, portLan \
as item FROM flows WHERE timestamp>=1383770600 AND \
timestamp<=1384770600 AND protocol IN (17, 6) GROUP BY portLan \
ORDER BY vol DESC LIMIT 6;' |
mysql testperf
2783898050 33722
1374153827 33736
886842830 39155
655809252 51800
363040479 53153
358988337 59757
real 0m1.067s
user 0m0.000s
sys 0m0.000s
# For sqlite, I use:
time echo 'SELECT SUM(nbBytesDecompOut+nbBytesCompIn) as vol, portLan
as item FROM flows WHERE timestamp>=1383770600 AND
timestamp<=1384770600 AND protocol IN (17, 6) GROUP BY portLan
ORDER BY vol DESC LIMIT 6;' |
sqlite3 /var/db/udcast/flow_stats_OA_1M.db
2783898050|33722
1374153827|33736
886842830|39155
655809252|51800
363040479|53153
358988337|59757
real 0m4.405s
user 0m1.812s
sys 0m2.580s
Here is the time spent in the query according to the number of lines
matching the where clause (ANALYZE has been run before):
PERIOD (s) MIN TS MAX TS LINES TIME
-------------------------------------------------------
3600 1384767000 1384770600 35113 0:00.06
-------------------------------------------------------
7200 1384763400 1384770600 67611 0:00.11
-------------------------------------------------------
21600 1384749000 1384770600 154592 0:00.69
-------------------------------------------------------
43200 1384727400 1384770600 270728 0:01.18
-------------------------------------------------------
86400 1384684200 1384770600 501871 0:02.20
-------------------------------------------------------
all 1383770600 1384770600 1000000 0:04.44
The 20% improvement is nice, but mysql (even without caching) is still
far ahead for the moment (4 times faster). Other ideas are welcome.
Thanks again!
2015-04-01 12:52 GMT+02:00 GB <gbinfo at web.de>:
> In case of SELECTing "all available" I recommend invoking a different
> statement without the timestamp-part instead of providing some min and max
> values for timestamp. This avoids tricking the query planner into some
> wrong decisions (if you have an index with protocol as the first column).
>
> And how about WAL mode? If concurrency is of any concern for you, this
> definitely is something worth a try.
>
> -- GB
>
>
> Jeff Roux schrieb am 31.03.2015 um 12:48:
>
>> Thanks everyone for the answers.
>>
>> I won't be able to make some tests today, I will come back to you soon
>> with
>> additional information. Just to say that, in the worst case, the WHERE
>> clause selects the entire data, i.e 1000000 entries. The user can select a
>> time range of 1 hour, 3 hours, 1 day, and ? all available ?. Note: before
>> being added in the database, the time stamps are aggregated on a 180
>> second
>> period and a lot of rows has the same time stamp (~ one row per TCP
>> session).
>>
>> All the columns are defined as INTEGER. There are 41 columns in total in
>> the flow table. If I remember well, there is no primary key defined for
>> this table.
>>
>> 2015-03-31 8:32 GMT+02:00 GB <gbinfo at web.de>:
>>
>> From what I see, I assume that timestamp gives the highest selectivity.
>>> Taking into account that protocol is SELECTed for and portLan is GROUPed
>>> BY, I'd try an index (timestamp, protocol, portLan) (not sure if portLan
>>> helps here, but it's worth a try, I think). Don't forget to ANALYZE, of
>>> course. Are your colums of INTEGER affinity? If the are of TEXT, they
>>> will
>>> store anything as TEXT. May make a difference in both space consumption
>>> and
>>> speed. Is your SQLite lib built with SQLITE_ENABLE_STAT4 enabled? If not,
>>> give it a try. It sometimes makes a big difference.
>>>
>>> Is it possible that data collection and retrieval happen at the same
>>> time?
>>> If so, try running the database in WAL mode, it should help with
>>> concurrency issues.
>>>
>>> -- GB
>>>
>>>
>>> Jeff Roux schrieb am 30.03.2015 um 11:46:
>>>
>>> Hi everyone,
>>>>
>>>> I have a daemon that collects information and stores it in a SQLite
>>>> database. The table has 1 million rows.
>>>>
>>>> This daemon is running on a HP server with 12 cores, 32 GB of RAM,
>>>> and a SSD drive. I have performance issues with some requests. For
>>>> instance, the following request takes more than 5 seconds to
>>>> accomplish with SQlite3 (in that particular case, the WHERE clause
>>>> selects all the data in the database, i.e. 1000000 rows):
>>>>
>>>> SELECT SUM(nbBytesDecompOut + nbBytesCompIn) as vol, portLan as item
>>>> FROM flows
>>>> WHERE timestamp>=1383770600 AND timestamp<=1384770600 AND protocol IN
>>>> (17, 6)
>>>> GROUP BY portLan ORDER BY vol DESC LIMIT 6;
>>>>
>>>> I have done some tests with or without "INDEXED BY" clauses and got
>>>> nearly the same results.
>>>>
>>>> I compared the performance with a mySQL and the same request takes
>>>> less than 1 second to accomplish.
>>>>
>>>> Could you give me some directions to optimize this kind of request
>>>> with SQlite3 when there is a big amount of data in the table ? I
>>>> need to increase 3 times the number of rows and the performance will
>>>> become unacceptable for my application.
>>>>
>>>> Thanks in advance.
>>>> _______________________________________________
>>>> sqlite-users mailing list
>>>> sqlite-users at mailinglists.sqlite.org
>>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>>
>>>>
>>>> _______________________________________________
>>> sqlite-users mailing list
>>> sqlite-users at mailinglists.sqlite.org
>>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>>
>>> _______________________________________________
>> sqlite-users mailing list
>> sqlite-users at mailinglists.sqlite.org
>> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>>
>
> _______________________________________________
> sqlite-users mailing list
> sqlite-users at mailinglists.sqlite.org
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>