Re: [PR] PHOENIX-7878 CDC perf improvement - skip redundant cell versions on data table scans [phoenix]

via GitHub Wed, 03 Jun 2026 16:38:26 -0700


palashc commented on code in PR #2493:
URL: https://github.com/apache/phoenix/pull/2493#discussion_r3352418501



##########
phoenix-core/src/it/java/org/apache/phoenix/end2end/CDCQueryIT.java:
##########
@@ -434,6 +434,86 @@ public void testSelectCDC() throws Exception {
     }
   }
 
+  /**
+   * Exercises CDC PRE/POST image reconstruction over a row with a deep stack 
of cell versions
+   * interleaved with column-level nulls, full-row deletes, consecutive 
deletes and re-inserts. This
+   * specifically stresses the server-side version pruning applied to the data 
table scan: the PRE
+   * and POST images are recomputed independently via SCN queries on the data 
table and compared
+   * against the CDC output, so any over-pruning of needed versions surfaces 
as a mismatch.
+   */
+  @Test
+  public void testSelectCDCPreAndPostImageWithVersionPruning() throws 
Exception {
+    String cdcName, cdc_sql;
+    String schemaName = getSchemaName();
+    String tableName = getTableOrViewName(schemaName);
+    String datatableName = tableName;
+    try (Connection conn = newConnection()) {
+      createTable(conn,
+        "CREATE TABLE  " + tableName + " (" + (multitenant ? "TENANT_ID 
CHAR(5) NOT NULL, " : "")
+          + "k INTEGER NOT NULL, v1 INTEGER, v2 INTEGER, v3 INTEGER, B.vb 
INTEGER, "
+          + "CONSTRAINT PK PRIMARY KEY " + (multitenant ? "(TENANT_ID, k) " : 
"(k)") + ")",
+        encodingScheme, multitenant, tableSaltBuckets, false, null);
+      if (forView) {
+        String viewName = getTableOrViewName(schemaName);
+        createTable(conn, "CREATE VIEW " + viewName + " AS SELECT * FROM " + 
tableName,
+          encodingScheme);
+        tableName = viewName;
+      }
+      cdcName = getCDCName();
+      cdc_sql = "CREATE CDC " + cdcName + " ON " + tableName;
+      createCDC(conn, cdc_sql, encodingScheme);
+    }
+
+    String tenantId = multitenant ? "1000" : null;
+    String[] tenantids = { tenantId };
+    if (multitenant) {
+      tenantids = new String[] { tenantId, "2000" };
+    }
+
+    long startTS = System.currentTimeMillis();
+    List<ChangeRow> changes =
+      generateChangesForPrePostImage(startTS, tenantids, tableName, 
COMMIT_SUCCESS);
+    long currentTime = System.currentTimeMillis();
+    long endTS = changes.get(changes.size() - 1).getTimestamp() + 1;
+    if (endTS > currentTime) {
+      Thread.sleep(endTS - currentTime);
+    }
+
+    Map<String, String> dataColumns = new TreeMap<String, String>() {
+      {
+        put("V1", "INTEGER");
+        put("V2", "INTEGER");
+        put("V3", "INTEGER");
+        put("B.VB", "INTEGER");
+      }
+    };
+    String cdcFullName = SchemaUtil.getTableName(schemaName, cdcName);
+    try (Connection conn = newConnection(tenantId)) {
+      // For debug: uncomment to see the exact results logged to console.
+      dumpCDCResults(conn, cdcName, new TreeMap<String, String>() {

Review Comment:
   Was this meant to be commented out? 



##########
phoenix-core-server/src/main/java/org/apache/phoenix/coprocessor/CDCGlobalIndexRegionScanner.java:
##########
@@ -123,21 +128,60 @@ protected Scan prepareDataTableScan(Collection<byte[]> 
dataRowKeys) throws IOExc
     ) {
       return null;
     }
-    // TODO: Get Timerange from the start row and end row of the index scan 
object
-    // and set it in the datatable scan object.
-    // if (scan.getStartRow().length == 8) {
-    // startTimeRange = PLong.INSTANCE.getCodec().decodeLong(
-    // scan.getStartRow(), 0, SortOrder.getDefault());
-    // }
-    // if (scan.getStopRow().length == 8) {
-    // stopTimeRange = PLong.INSTANCE.getCodec().decodeLong(
-    // scan.getStopRow(), 0, SortOrder.getDefault());
-    // }
     Scan dataScan = prepareDataTableScan(dataRowKeys, true);
     if (dataScan == null) {
       return null;
     }
-    return CDCUtil.setupScanForCDC(dataScan);
+    CDCUtil.setupScanForCDC(dataScan);
+    Map<ImmutableBytesPtr, long[]> timestampMap = 
buildDataRowTimestampMap(dataRowKeys);

Review Comment:
   Can we avoid building this timestamp map in every task and precompute 
beforehand? But maybe it is okay since number of tasks will usually be small - 
based on number of regions involved and number of rowkeys? 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] PHOENIX-7878 CDC perf improvement - skip redundant cell versions on data table scans [phoenix]

Reply via email to