[ https://issues.apache.org/jira/browse/PHOENIX-7672?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Himanshu Gwalani updated PHOENIX-7672: -------------------------------------- Description: As of now, if a replication log file is not closed gracefully and reader start reading it before HDFS lease timeout on the file, it can lead to partial data being read by the reader. Hence the replication replay must ensure either file is closed or if not, acquire the lease and accordingly validate header and trailer in the file before processing it. Pseudo code (by Andrew Purtell) {code:java} isClosed = ((LeaseRecoverable) fs).isFileClosed(filePath); // May throw ClassCastException - bad! if (isClosed) { // This will assert that the file has a valid header and trailer. reader = createLogReader(filePath); // may throw IOE } else { // Recover lease via custom method using LeaseRecoverable#recoverLease. // Wait until we get the lease with retries. recoverLease(filePath); // may throw IOE if (fs.getFileStatus(filePath).getLen() > 0) { // may throw IOE try { // This will assert that the file has a valid header and trailer. reader = createLogReader(filePath); // may throw IOE } catch (IOException e) { // how about MissingTrailerException to give clarity? // check some exception details to confirm it was a trailer issue. // if not a trailer issue, just rethrow the exception. otherwise, // should we continue even though the file is truncated? we are never going // to get that truncated data back, whatever it was. ignoring the whole // file converts potential data loss into certain data loss. LOG.warn("Replication log file {} is missing its trailer, continuing", filePath); reader = createLogReader(filePath, false); // may throw another IOE } } else { // Ignore the file. LOG.info("Ignoring zero length replication log file {}", filePath); } } // Clean up. Remove the replication log file at filePath. {code} After PHOENIX-7565A, the low level reader would throw appropriate exceptions and those needs to be handled in replication log replay as part of this Jira (along with acquire lease logic and mentioned in above pseudo code) was: As of now, while initializing the ReplicationLogReader, it has optional trailer validation ([code reference|https://github.com/apache/phoenix/blob/295848b44600689c626e404fd7a37e84f3c14d02/phoenix-core-server/src/main/java/org/apache/phoenix/replication/log/LogFileFormatReader.java#L58-L77]) and no validation for header (it seeks to the first row, which would throw IOException if header is missing, but not validate if header is as expected or not). Also the writer as of now writes header in lazy fashion (i.e. on receiving the first mutation for log file). This can lead to empty (zero length) log files on target cluster if RS does not receive any mutations for log file rotation time (1 min). This would also make it difficult DOD: 1. Source writer must add header as soon as the file is created (instead of waiting for new mutation) 2. While initializing the ReplicationLogReader, it should validate that file has valid header and trailer 3. Another inititilazation method that optionally allows skipping the trailer validation (to deal with scenarios when RS was not able to close the file successfully) 4. Throw MissingTrailerException / InvalidTrailerException in case of missing/corrupt trailer (and similar for header, i.e. MissingHeaderException) > ReplicationLogReplay should acquire lease on unclosed file before processing > them > --------------------------------------------------------------------------------- > > Key: PHOENIX-7672 > URL: https://issues.apache.org/jira/browse/PHOENIX-7672 > Project: Phoenix > Issue Type: Sub-task > Reporter: Himanshu Gwalani > Assignee: Himanshu Gwalani > Priority: Major > Fix For: PHOENIX-7562-feature > > > As of now, if a replication log file is not closed gracefully and reader > start reading it before HDFS lease timeout on the file, it can lead to > partial data being read by the reader. Hence the replication replay must > ensure either file is closed or if not, acquire the lease and accordingly > validate header and trailer in the file before processing it. > Pseudo code (by Andrew Purtell) > {code:java} > isClosed = ((LeaseRecoverable) fs).isFileClosed(filePath); // May throw > ClassCastException - bad! > if (isClosed) { > // This will assert that the file has a valid header and trailer. > reader = createLogReader(filePath); // may throw IOE > } else { > // Recover lease via custom method using LeaseRecoverable#recoverLease. > // Wait until we get the lease with retries. > recoverLease(filePath); // may throw IOE > if (fs.getFileStatus(filePath).getLen() > 0) { // may throw IOE > try { > // This will assert that the file has a valid header and trailer. > reader = createLogReader(filePath); // may throw IOE > } catch (IOException e) { // how about MissingTrailerException to > give clarity? > // check some exception details to confirm it was a trailer issue. > // if not a trailer issue, just rethrow the exception. otherwise, > // should we continue even though the file is truncated? we are > never going > // to get that truncated data back, whatever it was. ignoring the > whole > // file converts potential data loss into certain data loss. > LOG.warn("Replication log file {} is missing its trailer, > continuing", filePath); > reader = createLogReader(filePath, false); // may throw another > IOE > } > } else { > // Ignore the file. > LOG.info("Ignoring zero length replication log file {}", filePath); > } > } > // Clean up. Remove the replication log file at filePath. {code} > After PHOENIX-7565A, the low level reader would throw appropriate exceptions > and those needs to be handled in replication log replay as part of this Jira > (along with acquire lease logic and mentioned in above pseudo code) -- This message was sent by Atlassian Jira (v8.20.10#820010)