lurongjiang commented on PR #1044:
URL: https://github.com/apache/poi/pull/1044#issuecomment-4210591787

   
   ### Key Changes:
   
   1. **Added Configurable Tolerance Mode** 
(`org.apache.poi.poifs.allowCorruptBlocks`)
      - Modified `ByteArrayBackedDataSource.java` and 
`FileBackedDataSource.java`
      - Default behavior: **Strict mode** - throws `IndexOutOfBoundsException` 
when encountering blocks beyond EOF
      - Optional tolerant mode: Set system property to `true` to allow reading 
corrupt files with missing blocks
      - System property is checked dynamically at runtime, allowing flexible 
configuration
   
   2. **Improved Test Coverage**
      - Updated `TestHWPFParser.java` with comprehensive tests:
        - `testDocRead()`: Verifies file can be read in tolerant mode with 
actual text content validation
        - `testDocReadStrictMode()`: Verifies strict mode properly rejects 
corrupt files
        - `testWpsDocByFs()`: Tests file system-based reading with content 
validation
        - `testOffice97_2003DocRead()`: Validates normal document reading
      - All tests now verify actual text content (not just non-null), checking:
        - Text is not null
        - Text is not empty
        - Text is not blank (after trimming)
      - Added `getRootCause()` helper method with depth limit (20 levels) for 
robust exception chain analysis
   
   3. **Clear Error Messages**
      - Exception messages guide users on how to enable tolerant mode if needed
      - Example: "Position X is beyond EOF (Y). Set system property 
'org.apache.poi.poifs.allowCorruptBlocks' to true to allow reading corrupt 
files with missing blocks."
   
   ### Design Rationale:
   
   - **Fail-fast by default**: Aligns with Apache POI's principle of strict 
validation
   - **Opt-in tolerance**: Users who need to handle damaged documents can 
enable it via system property
   - **No API changes**: Backward compatible, uses standard Java system 
properties
   - **Follows existing patterns**: Similar to other POI configuration options 
like `org.apache.poi.ss.ignoreMissingFontSystem`
   
   ### Testing:
   
   All tests pass successfully, verifying both strict and tolerant modes work 
as expected. The implementation has been tested with the problematic 
`issue_1041.doc` file that was previously causing issues.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to