oneby-wang commented on PR #4607:
URL: https://github.com/apache/bookkeeper/pull/4607#issuecomment-3228330508

   Hi, guys, I think I figured it out, the key point is the following code in 
this pull request, although setting logfileHeader to zeros is also very 
important. I'll explain later.
   ```java
   if (header.ledgersCount == 0) {
       throw new IOException("No ledgers map found in entryLogId " + entryLogId 
+ ", do scan to double confirm");
   }
   ```
   I readed the journal redo code, I found that bookeeper redo will create a 
new entryLog file, instead of using the old entryLog file after bookie crashed. 
So, the old entryLog file still has the uncorrect logFileHeader(not updated by 
BufferedLogChannel.appendLedgersMap() method).
   In DefaultEntryLogger, if getEntryLogMetadata() method returns empty 
entryLogMeta, the GC thread will delete the entryLog file, which may contain 
some ledgers can't be deleted.
   ```java
   protected void extractMetaFromEntryLogs() throws 
EntryLogMetadataMapException {
        // ...
        // Read through the entry log file and extract the entry log meta
        EntryLogMetadata entryLogMeta = 
entryLogger.getEntryLogMetadata(entryLogId, throttler);
        removeIfLedgerNotExists(entryLogMeta);
        if (entryLogMeta.isEmpty()) {
            // This means the entry log is not associated with any active
            // ledgers anymore.
            // We can remove this entry log file now.
            LOG.info("Deleting entryLogId {} as it has no active ledgers!", 
entryLogId);
            if (removeEntryLog(entryLogId)) {
                
gcStats.getReclaimedSpaceViaDeletes().addCount(entryLogMeta.getTotalSize());
            } else {
                gcStats.getReclaimFailedToDelete().inc();
            }
        }
   ```
   How does this happen? here is the old unfixed code. If 
logfileHeader.ledgersMapOffset != 0(random value, logfileHeader is 
uninitialized), and logfileHeader.ledgersCount == 0(random value, logfileHeader 
is uninitialized), and in while loop read ledgersMapSize < 0(some lucky value 
in entryLog file random location, interesting huh), the while loop is breaked 
out, and the entryLogMeta is empty and returned to the caller method. I think 
we should throw an IOException when ledgersMapSize < 0 instead of breaking out 
the while loop.
   ```java
   EntryLogMetadata extractEntryLogMetadataFromIndex(long entryLogId) throws 
IOException {
       Header header = getHeaderForLogId(entryLogId);
   
       if (header.version < HEADER_V1) {
           throw new IOException("Old log file header without ledgers map on 
entryLogId " + entryLogId);
       }
   
       if (header.ledgersMapOffset == 0L) {
           // The index was not stored in the log file (possibly because the 
bookie crashed before flushing it)
           throw new IOException("No ledgers map index found on entryLogId " + 
entryLogId);
       }
   
       if (LOG.isDebugEnabled()) {
           LOG.debug("Recovering ledgers maps for log {} at offset: {}", 
entryLogId, header.ledgersMapOffset);
       }
   
       BufferedReadChannel bc = getChannelForLogId(entryLogId);
   
       // There can be multiple entries containing the various components of 
the serialized ledgers map
       long offset = header.ledgersMapOffset;
       EntryLogMetadata meta = new EntryLogMetadata(entryLogId);
   
       final int maxMapSize = LEDGERS_MAP_HEADER_SIZE + LEDGERS_MAP_ENTRY_SIZE 
* LEDGERS_MAP_MAX_BATCH_SIZE;
       ByteBuf ledgersMap = allocator.directBuffer(maxMapSize);
   
       try {
           while (offset < bc.size()) {
               // Read ledgers map size
               sizeBuffer.get().clear();
               bc.read(sizeBuffer.get(), offset);
   
               int ledgersMapSize = sizeBuffer.get().readInt();
               if (ledgersMapSize <= 0) {
                   break;
               }
               // Read the index into a buffer
               ledgersMap.clear();
               bc.read(ledgersMap, offset + 4, ledgersMapSize);
   
               // Discard ledgerId and entryId
               long lid = ledgersMap.readLong();
               if (lid != INVALID_LID) {
                   throw new IOException("Cannot deserialize ledgers map from 
ledger " + lid);
               }
   
               long entryId = ledgersMap.readLong();
               if (entryId != LEDGERS_MAP_ENTRY_ID) {
                   throw new IOException("Cannot deserialize ledgers map from 
entryId " + entryId);
               }
   
               // Read the number of ledgers in the current entry batch
               int ledgersCount = ledgersMap.readInt();
   
               // Extract all (ledger,size) tuples from buffer
               for (int i = 0; i < ledgersCount; i++) {
                   long ledgerId = ledgersMap.readLong();
                   long size = ledgersMap.readLong();
   
                   if (LOG.isDebugEnabled()) {
                       LOG.debug("Recovering ledgers maps for log {} -- Found 
ledger: {} with size: {}",
                               entryLogId, ledgerId, size);
                   }
                   meta.addLedgerSize(ledgerId, size);
               }
               if (ledgersMap.isReadable()) {
                   throw new IOException("Invalid entry size when reading 
ledgers map");
               }
   
               // Move to next entry, if any
               offset += ledgersMapSize + 4;
           }
       } catch (IndexOutOfBoundsException e) {
           throw new IOException(e);
       } finally {
           ReferenceCountUtil.release(ledgersMap);
       }
   
       if (meta.getLedgersMap().size() != header.ledgersCount) {
           throw new IOException("Not all ledgers were found in ledgers map 
index. expected: " + header.ledgersCount
                   + " -- found: " + meta.getLedgersMap().size() + " -- 
entryLogId: " + entryLogId);
       }
   
       return meta;
   }
   ```
   If extractEntryLogMetadataFromIndex() method throw a Throwable, 
getEntryLogMetadata() method will fall back to 
extractEntryLogMetadataByScanning() method, which will skip logfileHeader, and 
scan the rest part of entryLog file.
   ```java
   public EntryLogMetadata getEntryLogMetadata(long entryLogId, 
AbstractLogCompactor.Throttler throttler)
       throws IOException {
       // First try to extract the EntryLogMetadata from the index, if there's 
no index then fallback to scanning the
       // entry log
       try {
           return extractEntryLogMetadataFromIndex(entryLogId);
       } catch (FileNotFoundException fne) {
           LOG.warn("Cannot find entry log file {}.log : {}", 
Long.toHexString(entryLogId), fne.getMessage());
           throw fne;
       } catch (Throwable e) {
           LOG.info("Failed to get ledgers map index from: {}.log : {}", 
entryLogId, e.getMessage());
   
           // Fall-back to scanning
           return extractEntryLogMetadataByScanning(entryLogId, throttler);
       }
   }
   ```
   I think the key point is the entryLog file lacks metadata validation, eg. 
CRC validation, so we can't figure it out it's completed with flush or 
corrupted due to crash? 
   And I think the header.ledgersCount == 0 validation code should be writen 
after header.ledgersMapOffset validation(just for readability).
   ```java
   if (header.ledgersMapOffset == 0L) {
       // The index was not stored in the log file (possibly because the bookie 
crashed before flushing it)
       throw new IOException("No ledgers map index found on entryLogId " + 
entryLogId);
   }
   
   if (header.ledgersCount == 0) {
       throw new IOException("No ledgers map found in entryLogId " + entryLogId 
+ ", do scan to double confirm");
   }
   ```
   If bookie can't be shutdown gracefully, eg. bookie instance crashes, the old 
entryLog file may be deleted. So, what can we do in old version bookie after 
crash? First, don't start up bookie instance directly, backup your data, andd 
then use some hex editor to set ledgersMapOffset to zero, then start up your 
bookie instance, the GC thread will fall back to 
extractEntryLogMetadataByScanning() method, which will not delete the entryLog 
file.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to