Re: [sqlite] Sqlite Sharding HOWTO

Gerlando Falauto Mon, 30 Jul 2018 11:57:10 -0700

On Mon, Jul 30, 2018 at 1:58 AM, R Smith <[email protected]> wrote:


> On 2018/07/30 12:39 AM, Gerlando Falauto wrote:
>
>>
>> The question that needs to be answered specifically is: How many data
>>> input sources are there? as in how many Processes will attempt to write
>>> to the database at the same time? Two processes can obviously NOT write
>>> at the same time, so if a concurrent write happens, one process has to
>>> wait a few milliseconds. This gets compounded as more and more write
>>> sources are added or as write-frequency increases.
>>>
>> When you say "a few milliseconds", how exactly do you measure that?
>>
>> What is the timeframe where the second process has to wait if the
>> first one is doing a batch of a thousand writes?
>>
>
> This is very hard to say, hence my other questions in this regard. The
> time needed depends firstly on how much data needs to be inserted, then,
> how many inserts are required (if batched), then how many Indexes do you
> have on the table (each Index need to be modified for each insert), then
> does the inserts contain only data, or perhaps functions that need to be
> calculated (especially for UDFs) and then, are there any triggers that need
> to run upon Inserting, and most importantly of all, how quick is the
> hardware that has to drink all this data?  (The hardware IO is the typical
> bottleneck).
>
> A query doing a single insert of a few bytes with no Indexes, no triggers,
> no functions will be stupendously fast, whereas any increase in one or more
> of the above will slow things down. How much exactly is something you need
> to test, any guesswork will not be useful. What I can say is that if you
> don't have any Unique Indexes or triggers, the insert speed /should/ not
> change much with size.
>

I see, thanks for the heads up. What's with Unique Indexes though? Why are
they any worse than a regular index?



>
>> To get to the point: With the above in mind, do you still feel like you
>>> need to go the sharding route? Could you perhaps quote figures such as
>>> how many bytes would a typical data update be? How many updates per
>>> second, from how many different Processes?  Do you have a maintenance
>>> Window (as in, is there a time of day or on a weekend or such that you
>>> can go a few minutes without logging so one can clean up the old logs,
>>> perhaps Vacuum and re-Analyze?
>>>
>> In the current use case thre's a single process. The way I see it, in
>> the near future it would probably increase to 3-4 processes,
>>
>
> Why? Are you adding more other logging sources? Or is this just a feeling
> that more processes would handle better in some way?
>

Could be adding more logging sources in the near future.



>
> each doing 10-100 writes per second or so. Each write would be around
>> 1KB-20KB (one single text field, I guess).
>>
> 100 x 20KB inserts = 2MB Inserted x 4 processes = 8MB per second inserted
> issuing around 400 file-locks... Let's assume only 1 Index (apart from the
> rowid) and no triggers, then this should be easily handled, even on slower
> hardware. This is significant enough though that it should be tested
> before-hand with real-world operating parameters in place.
>

I see.


> I wonder if writing data in batches would be helpful though.
>>
> Most certainly. Less file-lock operations is better. Plan anything you do
> to collect logs together and fire them into the DB at say 1 to 5 second
> intervals in single transactions.
> How to decide the interval: How much data-time can you afford to lose if
> the system crashes? How fast is the data needed downstream by something
> reading the DB?
>
>
1 second would be my educated guess.
While on the topic, I'm still not 100% sure I got the table schema right,
perhaps you can shed some more light.
The data I'd like to collect would be some sampled sensor data (sampling
rate being 4000Hz in the worst case).
One approach would be to timestamp each and every sample. This would be
very convenient for subsequent analysis, but I'm afraid it would kill
performance and waste a lot of disk space.
A different approach would be to produce way fewer rows (let's say, again,
one per second) and storing a whole bunch of samples in a single record.
But in this case I should find some way of
a) storing this data (JSON, BLOB...)
b) flattening (or pivoting?) it back so that consecutive rows could be
analyzed in sequence as a huge set of samples

Any suggestion about that?

Thanks again!
Gerlando



> But yeah, I guess sharding makes much less sense right now. :-)
>>
>
> Agreed.
>
> Cheers,
> Ryan
>
> _______________________________________________
> sqlite-users mailing list
> [email protected]
> http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users
>
_______________________________________________
sqlite-users mailing list
[email protected]
http://mailinglists.sqlite.org/cgi-bin/mailman/listinfo/sqlite-users

Re: [sqlite] Sqlite Sharding HOWTO

Reply via email to