[ https://issues.apache.org/jira/browse/CASSANDRA-6134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13800851#comment-13800851 ]
Aleksey Yeschenko commented on CASSANDRA-6134: ---------------------------------------------- bq. It runs every 1/2 of write timeout and replays all batches written within 0.9 * write timeout from now. This way we ensure, that batched updates will be replayed to th moment client times out from coordinator. This isn't what we want to ensure though. The current timeout (write timeout * 2) is there to account for maximum batchlog write timeout + actual data write timeout. Avoiding extra mutations is IMO more important than having less delay in the failure scenario (and slow writes would happen more often than outright failures). And you definitely don't want to hammer an already slow node with twice the load. So -1 on this particular change. bq. It submits all mutations from single batch in parallel (Like StorageProxy do). Old implementation played them one-by-one, so client can see half applied batches in CF for a long time (depending on size of batch). This is fine. But yeah, we could/should parallelize batchlog replay more (can be done w/out modifying the schema). bq. It fixes a subtle racing bug with incorrect hint TTL calculation Care to elaborate? I think there was a tricky open bug related to this, but can't fine the JIRA #. To avoid random reads, we could read the mutation blob in replayAllFailedBatches() and and pass it to replayBatch() (I thought we were already doing that). To make replay more async, as you suggest, we could read several batches and initate their replay async instead of replaying them one by one (but w/ RateLimiter in place). To avoid iterating over the already replayed batches (tombstones), we could purge the replayed batches directly from the memtable (although I'd need to see a benchmark proving that it's worth doing it first). Other stuff, in no particular order: - making the table COMPACT STORAGE limits our flexibility wrt future batchlog schema changes, so -1 on that - we should probably rate-limit batchlog replay w/ RateLimiter - +1 on moving forceBatchlogReplay() to batchlogTasks as well (this was an omission from CASSANDRA-6079, ninja-committed it in 7e057f504613e68082a76642983d353f3f0400fb) - +1 on running cleanup() on startup - -1 on using writeTime for TTL calculation from the UUID (the time can actually jump, but uuids will always increase, and it's not what we want for TTL calc) In general: I like some of the suggested changes, and would like to see the ones that are possible w/out the schema change implemented first. I'm strongly against altering the batchlog schema, unless the benchmarks can clearly prove that the version with the partitioned schema is significantly better than what we could come up with without altering the schema, and many of them can be. We should avoid any potentially brittle/breaking extra migration code on the already slow-ish startup. Could you give it a try, [~m0nstermind]? Namely, - replaying several mutations read in replayAllFailedBatches() simultaneously instead of 1-by-1 - avoiding the random read by passing the read blob to replayBatch() - measure the effect of purging the replayed batch from the memtable (when not read from the disk) If this gives us most of the win of a version with the altered schema, then I'll be satisfied with just those changes. If benchmarks say that we have a lot extra relative and absolute efficiency to gain from the schema change, then I won't argue with the data. > More efficient BatchlogManager > ------------------------------ > > Key: CASSANDRA-6134 > URL: https://issues.apache.org/jira/browse/CASSANDRA-6134 > Project: Cassandra > Issue Type: Improvement > Reporter: Oleg Anastasyev > Priority: Minor > Attachments: BatchlogManager.txt > > > As we discussed earlier in CASSANDRA-6079 this is the new BatchManager. > It stores batch records in > {code} > CREATE TABLE batchlog ( > id_partition int, > id timeuuid, > data blob, > PRIMARY KEY (id_partition, id) > ) WITH COMPACT STORAGE AND > CLUSTERING ORDER BY (id DESC) > {code} > where id_partition is minute-since-epoch of id uuid. > So when it scans for batches to replay ot scans within a single partition for > a slice of ids since last processed date till now minus write timeout. > So no full batchlog CF scan and lot of randrom reads are made on normal > cycle. > Other improvements: > 1. It runs every 1/2 of write timeout and replays all batches written within > 0.9 * write timeout from now. This way we ensure, that batched updates will > be replayed to th moment client times out from coordinator. > 2. It submits all mutations from single batch in parallel (Like StorageProxy > do). Old implementation played them one-by-one, so client can see half > applied batches in CF for a long time (depending on size of batch). > 3. It fixes a subtle racing bug with incorrect hint ttl calculation -- This message was sent by Atlassian JIRA (v6.1#6144)