[ 
https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800851#comment-13800851
 ] 

Aleksey Yeschenko commented on CASSANDRA-6134:
----------------------------------------------

bq. It runs every 1/2 of write timeout and replays all batches written within 
0.9 * write timeout from now. This way we ensure, that batched updates will be 
replayed to th moment client times out from coordinator.

This isn't what we want to ensure though. The current timeout (write timeout * 
2) is there to account for maximum batchlog write timeout + actual data write 
timeout. Avoiding extra mutations is IMO more important than having less delay 
in the failure scenario (and slow writes would happen more often than outright 
failures). And you definitely don't want to hammer an already slow node with 
twice the load. So -1 on this particular change.

bq. It submits all mutations from single batch in parallel (Like StorageProxy 
do). Old implementation played them one-by-one, so client can see half applied 
batches in CF for a long time (depending on size of batch).

This is fine. But yeah, we could/should parallelize batchlog replay more (can 
be done w/out modifying the schema).

bq. It fixes a subtle racing bug with incorrect hint TTL calculation

Care to elaborate? I think there was a tricky open bug related to this, but 
can't fine the JIRA #.

To avoid random reads, we could read the mutation blob in 
replayAllFailedBatches() and and pass it to replayBatch() (I thought we were 
already doing that). To make replay more async, as you suggest, we could read 
several batches and initate their replay async instead of replaying them one by 
one (but w/ RateLimiter in place).

To avoid iterating over the already replayed batches (tombstones), we could 
purge the replayed batches directly from the memtable (although I'd need to see 
a benchmark proving that it's worth doing it first).

Other stuff, in no particular order:

- making the table COMPACT STORAGE limits our flexibility wrt future batchlog 
schema changes, so -1 on that
- we should probably rate-limit batchlog replay w/ RateLimiter
- +1 on moving forceBatchlogReplay() to batchlogTasks as well (this was an 
omission from CASSANDRA-6079, ninja-committed it in 
7e057f504613e68082a76642983d353f3f0400fb)
- +1 on running cleanup() on startup
- -1 on using writeTime for TTL calculation from the UUID (the time can 
actually jump, but uuids will always increase, and it's not what we want for 
TTL calc)

In general:

I like some of the suggested changes, and would like to see the ones that are 
possible w/out the schema change implemented first. I'm strongly against 
altering the batchlog schema, unless the benchmarks can clearly prove that the 
version with the partitioned schema is significantly better than what we could 
come up with without altering the schema, and many of them can be. We should 
avoid any potentially brittle/breaking extra migration code on the already 
slow-ish startup.

Could you give it a try, [~m0nstermind]? Namely,
- replaying several mutations read in replayAllFailedBatches() simultaneously 
instead of 1-by-1
- avoiding the random read by passing the read blob to replayBatch()
- measure the effect of purging the replayed batch from the memtable (when not 
read from the disk)

If this gives us most of the win of a version with the altered schema, then 
I'll be satisfied with just those changes. If benchmarks say that we have a lot 
extra relative and absolute efficiency to gain from the schema change, then I 
won't argue with the data.

> More efficient BatchlogManager
> ------------------------------
>
>                 Key: CASSANDRA-6134
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-6134
>             Project: Cassandra
>          Issue Type: Improvement
>            Reporter: Oleg Anastasyev
>            Priority: Minor
>         Attachments: BatchlogManager.txt
>
>
> As we discussed earlier in CASSANDRA-6079 this is the new BatchManager.
> It stores batch records in 
> {code}
> CREATE TABLE batchlog (
>   id_partition int,
>   id timeuuid,
>   data blob,
>   PRIMARY KEY (id_partition, id)
> ) WITH COMPACT STORAGE AND
>   CLUSTERING ORDER BY (id DESC)
> {code}
> where id_partition is minute-since-epoch of id uuid. 
> So when it scans for batches to replay ot scans within a single partition for 
>  a slice of ids since last processed date till now minus write timeout.
> So no full batchlog CF scan and lot of randrom reads are made on normal 
> cycle. 
> Other improvements:
> 1. It runs every 1/2 of write timeout and replays all batches written within 
> 0.9 * write timeout from now. This way we ensure, that batched updates will 
> be replayed to th moment client times out from coordinator.
> 2. It submits all mutations from single batch in parallel (Like StorageProxy 
> do). Old implementation played them one-by-one, so client can see half 
> applied batches in CF for a long time (depending on size of batch).
> 3. It fixes a subtle racing bug with incorrect hint ttl calculation



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Reply via email to