On 2018/07/30 12:39 AM, Gerlando Falauto wrote:

The question that needs to be answered specifically is: How many data
input sources are there? as in how many Processes will attempt to write
to the database at the same time? Two processes can obviously NOT write
at the same time, so if a concurrent write happens, one process has to
wait a few milliseconds. This gets compounded as more and more write
sources are added or as write-frequency increases.
When you say "a few milliseconds", how exactly do you measure that?

What is the timeframe where the second process has to wait if the
first one is doing a batch of a thousand writes?

This is very hard to say, hence my other questions in this regard. The time needed depends firstly on how much data needs to be inserted, then, how many inserts are required (if batched), then how many Indexes do you have on the table (each Index need to be modified for each insert), then does the inserts contain only data, or perhaps functions that need to be calculated (especially for UDFs) and then, are there any triggers that need to run upon Inserting, and most importantly of all, how quick is the hardware that has to drink all this data?  (The hardware IO is the typical bottleneck).

A query doing a single insert of a few bytes with no Indexes, no triggers, no functions will be stupendously fast, whereas any increase in one or more of the above will slow things down. How much exactly is something you need to test, any guesswork will not be useful. What I can say is that if you don't have any Unique Indexes or triggers, the insert speed /should/ not change much with size.


To get to the point: With the above in mind, do you still feel like you
need to go the sharding route? Could you perhaps quote figures such as
how many bytes would a typical data update be? How many updates per
second, from how many different Processes?  Do you have a maintenance
Window (as in, is there a time of day or on a weekend or such that you
can go a few minutes without logging so one can clean up the old logs,
perhaps Vacuum and re-Analyze?
In the current use case thre's a single process. The way I see it, in
the near future it would probably increase to 3-4 processes,

Why? Are you adding more other logging sources? Or is this just a feeling that more processes would handle better in some way?

each doing 10-100 writes per second or so. Each write would be around
1KB-20KB (one single text field, I guess).
100 x 20KB inserts = 2MB Inserted x 4 processes = 8MB per second inserted issuing around 400 file-locks... Let's assume only 1 Index (apart from the rowid) and no triggers, then this should be easily handled, even on slower hardware. This is significant enough though that it should be tested before-hand with real-world operating parameters in place.

I wonder if writing data in batches would be helpful though.
Most certainly. Less file-lock operations is better. Plan anything you do to collect logs together and fire them into the DB at say 1 to 5 second intervals in single transactions. How to decide the interval: How much data-time can you afford to lose if the system crashes? How fast is the data needed downstream by something reading the DB?

But yeah, I guess sharding makes much less sense right now. :-)

Agreed.

Cheers,
Ryan

_______________________________________________
sqlite-users mailing list
sqlite-users@mailinglists.sqlite.org
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Reply via email to