On 2018/07/29 10:34 AM, Gerlando Falauto wrote:
Hi,

I'm totally new to sqlite and I'd like to use it for some logging

Welcome Gerlando. :)

application on an embedded
linux-based device.  Data comes from multiple (~10), similar sources at a
steady rate.
The rolling data set would be in the size of 20 GB. Such an amount of
storage would suffice to retain data from the previous week or so.

Reading the documentation https://www.sqlite.org/whentouse.html somehow
suggests the usage of sharding:

Concurrency is also improved by "database sharding": using separate
database files for
different subdomains. For example, the server might have a separate
SQLite database for each
user, so that the server can handle hundreds or thousands of simultaneous
connections, but
each SQLite database is only used by one connection.
In my case I would be doing sharding on the data source and/or the day of
the timestamp, so to have individual files in the size of a few hundreds MB.
This way, deleting the oldest data would be as easy as deleting the
corresponding file.

I think you are perhaps missing a core idea here - the only use-case that requires sharding is where you have very high write-concurrency from multiple sources, and even then, the sharding, in order to have any helpful effect, needs to distinguish "write sources", not events or time-frames or such.

SQLite will very happily run a 20GB (or much larger) database written to from many sources, and very happily delete old data from it and pump new data in without much in the way of "help" needed, AND then produce fast queries without much fanfare.

The question that needs to be answered specifically is: How many data input sources are there? as in how many Processes will attempt to write to the database at the same time? Two processes can obviously NOT write at the same time, so if a concurrent write happens, one process has to wait a few milliseconds. This gets compounded as more and more write sources are added or as write-frequency increases.

If a single process is writing data to a single DB from many different sources, there is zero reason for sharding. If many processes are running all with their own connection to the DB, AND they have high concurrency (i.e. high frequency updates from many DB connections which heightens the incidence of simultaneous write attempts to a single DB file) then it starts becoming a good idea to allocate two or more DB files so that we split the connections between those files, effectively lowering the write-collision frequency for a single file.

Incidentally, all DBs offer some form of concurrency alleviation (load balancing, partitioning, etc.) which often also serves other purposes.

To get to the point: With the above in mind, do you still feel like you need to go the sharding route? Could you perhaps quote figures such as how many bytes would a typical data update be? How many updates per second, from how many different Processes?  Do you have a maintenance Window (as in, is there a time of day or on a weekend or such that you can go a few minutes without logging so one can clean up the old logs, perhaps Vacuum and re-Analyze?

This will allow much better advice, and someone on here is bound to already have something just like that running and will be able to quickly give some experience hints.


_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to