[ 
https://issues.apache.org/jira/browse/PHOENIX-7846?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tanuj Khurana updated PHOENIX-7846:
-----------------------------------
    Description: 
Problem:

ReplicationLog maintains a currentBatch which accumulates every successful 
append and clears only on an explicit sync() call. On writer rotation 
mid-batch, replayCurrentBatch() re-appends every record in the batch onto the 
new writer. For workloads with many appends between explicit syncs, the replay 
cost scales linearly with batch size. 

There is a pre-existing implicit durability point: LogFileFormatWriter.append() 
checks the in-memory block size after each append and, when the block hits 
maxBlockSize (default 1 MB), triggers an internal sync() that flushes the block 
to HDFS. Records up to that point are durable. However, this information does 
not propagate back to ReplicationLog.append(), so currentBatch keeps growing 
past these durability points. 

For example, with a 10k-record batch (1 KB records, 1 MB block size): blocks 
fill every ~1000 records, but currentBatch grows to 10,000. Rotation at record 
9,500 replays all 9,500 records — even though  records 1–9,000 are already 
durable in completed blocks on the old writer's file.

Solution:

Change LogFile.Writer.append() to return a boolean indicating whether a 
block-full sync occurred. Propagate this signal through LogFileFormatWriter → 
LogFileWriter → ReplicationLog.append(). When the signal  is true, clear 
currentBatch — all records up to this point are durable and do not need replay.
After this change, replay on rotation is proportional to the last partial block 
(bounded by maxBlockSize), not the full inter-sync window. Using the same 
example: rotation at record 9,500 replays only ~500 records instead of 9,500.

No change to durability semantics — this only leverages an existing durability 
point that was previously not propagated. 

  was:
Problem:                                                                        
                                                 ReplicationLog maintains a 
currentBatch which accumulates every successful append and clears only on an 
explicit sync() call. On writer rotation mid-batch, replayCurrentBatch() 
re-appends every record in the batch onto the new writer. For workloads with 
many appends between explicit syncs, the replay cost scales linearly with batch 
size. 

There is a pre-existing implicit durability point: LogFileFormatWriter.append() 
checks the in-memory block size after each append and, when the block hits 
maxBlockSize (default 1 MB), triggers an internal sync() that flushes the block 
to HDFS. Records up to that point are durable. However, this information does 
not propagate back to ReplicationLog.append(), so currentBatch keeps growing 
past these durability points. 

For example, with a 10k-record batch (1 KB records, 1 MB block size): blocks 
fill every ~1000 records, but currentBatch grows to 10,000. Rotation at record 
9,500 replays all 9,500 records — even though  records 1–9,000 are already 
durable in completed blocks on the old writer's file.
                                                                                
                                                                              
Solution:                                                                       
                                                       

Change LogFile.Writer.append() to return a boolean indicating whether a 
block-full sync occurred. Propagate this signal through LogFileFormatWriter → 
LogFileWriter → ReplicationLog.append(). When the signal  is true, clear 
currentBatch — all records up to this point are durable and do not need replay.
After this change, replay on rotation is proportional to the last partial block 
(bounded by maxBlockSize), not the full inter-sync window. Using the same 
example: rotation at record 9,500 replays only ~500 records instead of 9,500.

No change to durability semantics — this only leverages an existing durability 
point that was previously not propagated. 


> Bound rotation replay cost for large commit batches
> ---------------------------------------------------
>
>                 Key: PHOENIX-7846
>                 URL: https://issues.apache.org/jira/browse/PHOENIX-7846
>             Project: Phoenix
>          Issue Type: Sub-task
>            Reporter: Tanuj Khurana
>            Assignee: Tanuj Khurana
>            Priority: Major
>
> Problem:
> ReplicationLog maintains a currentBatch which accumulates every successful 
> append and clears only on an explicit sync() call. On writer rotation 
> mid-batch, replayCurrentBatch() re-appends every record in the batch onto the 
> new writer. For workloads with many appends between explicit syncs, the 
> replay cost scales linearly with batch size. 
> There is a pre-existing implicit durability point: 
> LogFileFormatWriter.append() checks the in-memory block size after each 
> append and, when the block hits maxBlockSize (default 1 MB), triggers an 
> internal sync() that flushes the block to HDFS. Records up to that point are 
> durable. However, this information does not propagate back to 
> ReplicationLog.append(), so currentBatch keeps growing past these durability 
> points. 
> For example, with a 10k-record batch (1 KB records, 1 MB block size): blocks 
> fill every ~1000 records, but currentBatch grows to 10,000. Rotation at 
> record 9,500 replays all 9,500 records — even though  records 1–9,000 are 
> already durable in completed blocks on the old writer's file.
> Solution:
> Change LogFile.Writer.append() to return a boolean indicating whether a 
> block-full sync occurred. Propagate this signal through LogFileFormatWriter → 
> LogFileWriter → ReplicationLog.append(). When the signal  is true, clear 
> currentBatch — all records up to this point are durable and do not need 
> replay.
> After this change, replay on rotation is proportional to the last partial 
> block (bounded by maxBlockSize), not the full inter-sync window. Using the 
> same example: rotation at record 9,500 replays only ~500 records instead of 
> 9,500.
> No change to durability semantics — this only leverages an existing 
> durability point that was previously not propagated. 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to