[PR] add output formatting options for DumpLogSegments [kafka]

via GitHub Fri, 14 Feb 2025 17:17:04 -0800


jrmcclurg opened a new pull request, #18910:
URL: https://github.com/apache/kafka/pull/18910


   *More detailed description of your change,
   if necessary. The PR title and PR message become
   the squashed commit message, so use a separate
   comment to ping reviewers.*
   
   Currently the output of the `DumpLogSegments` tool is quite tricky to parse, 
making it difficult to use as part of disaster-recovery tooling. Here is an 
example output to demonstrate some of the issues:
   ```
   Dumping 2.log
   Log starting offset: 2
   baseOffset: 0 lastOffset: 0 count: 1 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false 
isControl: false deleteHorizonMs: OptionalLong.empty position: 0 CreateTime: 
1739569505569 size: 131 magic: 2 compresscodec: none crc: 90285099 isvalid: true
   | offset: 0 CreateTime: 1739569505569 keySize: -1 valueSize: 14 sequence: -1 
headerKeys: [myheader,myotherheader,mythird:header] payload: This is a test
   baseOffset: 1 lastOffset: 1 count: 1 baseSequence: -1 lastSequence: -1 
producerId: -1 producerEpoch: -1 partitionLeaderEpoch: 0 isTransactional: false 
isControl: false deleteHorizonMs: OptionalLong.empty position: 131 CreateTime: 
1739569599085 size: 149 magic: 2 compresscodec: none crc: 3989822952 isvalid: 
true
   | offset: 1 CreateTime: 1739569599085 keySize: -1 valueSize: 14 sequence: -1 
headerKeys: [myheader,myotherheader,mythird:header,fourth,header] payload: This 
is a test
   ```
   Note that the actual stored header key/values look like this:
   `myheader` -> `yes`
   `myotherheader` -> `no`
   `mythird:header` -> `ok`
   `fourth,header` -> `wow`
   
   Key issues:
   1. The printed fields are _space_-separated, meaning a context-sensitive 
parser needs to be used (normally I would just split on a comma or newline to 
parse such fields).
   2. Header keys that contain commas cause ambiguity in the output (e.g., in 
the above printout, it looks like there are 5 keys rather than 4.
   3. Values for the header keys are not shown in the output.
   4. There is not a clear designation of where one outputted record ends and 
another begins. In the above case, `baseOffset` marks the beginning of a new 
record, but when dumping indexes etc., different field names would need to be 
matched.
   
   I have added four command-line options to address these issues:
   ```
   /opt/kafka/bin/kafka-run-class.sh kafka.tools.DumpLogSegments 
--deep-iteration --print-data-log --files 2.log --field-sep "; " 
--entry-caption "ENTRY\n" --record-caption "\nRECORD\n" --print-key-values
   ```
   This gives the following output for the above example:
   ```
   Dumping 2.log
   Log starting offset: 2
   ENTRY
   baseOffset: 0; lastOffset: 0; count: 1; baseSequence: -1; lastSequence: -1; 
producerId: -1; producerEpoch: -1; partitionLeaderEpoch: 0; isTransactional: 
false; isControl: false; deleteHorizonMs: OptionalLong.empty; position: 0; 
CreateTime: 1739569505569; size: 131; magic: 2; compresscodec: none; crc: 
90285099; isvalid: true
   RECORD
   offset: 0; CreateTime: 1739569505569; keySize: -1; valueSize: 14; sequence: 
-1; numHeaders: 3; headerKey(8): myheader; headerVal(3): yes; headerKey(13): 
myotherheader; headerVal(2): no; headerKey(14): mythird:header; headerVal(2): 
ok; payload: This is a test
   ENTRY
   baseOffset: 1; lastOffset: 1; count: 1; baseSequence: -1; lastSequence: -1; 
producerId: -1; producerEpoch: -1; partitionLeaderEpoch: 0; isTransactional: 
false; isControl: false; deleteHorizonMs: OptionalLong.empty; position: 131; 
CreateTime: 1739569599085; size: 149; magic: 2; compresscodec: none; crc: 
3989822952; isvalid: true
   RECORD
   offset: 1; CreateTime: 1739569599085; keySize: -1; valueSize: 14; sequence: 
-1; numHeaders: 4; headerKey(8): myheader; headerVal(3): yes; headerKey(13): 
myotherheader; headerVal(2): no; headerKey(14): mythird:header; headerVal(2): 
ok; headerKey(13): fourth,header; headerVal(3): wow; payload: This is a test
   ```
   Now the fields are semicolon-separated (any string can be used as a 
separator), and header keys/values are printed along with their lengths, 
allowing easy parsing.
   
   
   *Summary of testing strategy (including rationale)
   for the feature or bug fix. Unit and/or integration
   tests are expected for any behaviour change and
   system tests should be considered for larger changes.*
   
   The default values of the new command-line arguments are set to preserve the 
current functionality, so no existing tests should be affected.
   
   ### Committer Checklist (excluded from commit message)
   - [ ] Verify design and implementation 
   - [ ] Verify test coverage and CI build status
   - [ ] Verify documentation (including upgrade notes)
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[PR] add output formatting options for DumpLogSegments [kafka]

Reply via email to