Re: [PR] 4x-reg-sax-fixes [tika]

via GitHub Fri, 17 Apr 2026 05:13:05 -0700


utafrali commented on code in PR #2773:
URL: https://github.com/apache/tika/pull/2773#discussion_r3100148783



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java:
##########
@@ -142,24 +142,55 @@ protected void buildXHTML(XHTMLContentHandler xhtml)
             throws SAXException, IOException {
         OPCPackage container = opcPackage;
 
-        XSSFSharedStringsShim stringsShim;
+        XSSFSharedStringsShim stringsShim = null;
         XSSFReader.SheetIterator iter;
         XSSFReader xssfReader;
-        XSSFStylesShim stylesShim;
+        XSSFStylesShim stylesShim = null;
         try {
             xssfReader = new XSSFReader(container);
-            stylesShim = new XSSFStylesShim(xssfReader.getStylesData(), 
parseContext);
-
             iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
+        } catch (OpenXML4JException | RuntimeException e) {
+            throw new IOException(e);
+        }
+        // Styles and shared strings are optional — if either part is missing 
or
+        // unreadable, log to metadata and continue with degraded extraction.
+        try {
+            stylesShim = new XSSFStylesShim(xssfReader.getStylesData(), 
parseContext);
+        } catch (Exception e) {
+            metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+                    org.apache.tika.utils.ExceptionUtils.getStackTrace(e));
+        }
+        try {
             stringsShim = new 
XSSFSharedStringsShim(xssfReader.getSharedStringsData(),
                     config.isConcatenatePhoneticRuns(), parseContext);
-        } catch (OpenXML4JException | TikaException e) {
-            throw new IOException(e);
+        } catch (Exception e) {
+            metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+                    org.apache.tika.utils.ExceptionUtils.getStackTrace(e));
         }
-        while (iter.hasNext()) {
+        while (true) {
+            try {
+                if (!iter.hasNext()) {
+                    break;
+                }
+            } catch (RuntimeException e) {
+                metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+                        org.apache.tika.utils.ExceptionUtils.getStackTrace(e));
+                break;
+            }
             SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(config, 
xhtml);
             PackagePart sheetPart = null;
-            try (InputStream stream = iter.next()) {
+            InputStream nextStream;
+            try {
+                nextStream = iter.next();
+            } catch (RuntimeException e) {
+                // POI can throw POIXMLException for missing sheet parts (e.g.,
+                // truncated workbook references a sheet that isn't in the 
zip).
+                // Skip the sheet but keep what we've already extracted.
+                metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+                        org.apache.tika.utils.ExceptionUtils.getStackTrace(e));

Review Comment:
   When `iter.next()` throws a `RuntimeException`, the `continue` returns to 
the top of the loop where `iter.hasNext()` is called again. If POI's iterator 
leaves `hasNext()` returning `true` even after the failed `next()` call (e.g., 
the internal index didn't advance), this becomes an infinite loop, spinning 
through the same broken sheet repeatedly.
   
   Using `break` instead of `continue` is the safer choice here. If `next()` 
failed, the iterator state is undefined and abandoning iteration is preferable 
to risking a hang. Any sheets already processed are preserved in the output.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java:
##########
@@ -281,7 +281,23 @@ public MediaType detect(ZipFile zipFile, TikaInputStream 
stream) throws IOExcept
             //no need to close zipEntrySource because it
             //only closes the underlying zipFile, not any other resources
             //as of this writing.... :'(
-            return null;
+            //fall through to [Content_Types].xml fallback below
+        }
+        // POI may have failed (caught above) OR returned null because the
+        // rels were malformed and POI silently produced an empty relationship
+        // collection. Either way, fall back to parsing [Content_Types].xml
+        // directly — same approach as the streaming detector.
+        if (type == null) {
+            try (InputStream contentTypesStream =
+                         zipEntrySource.getInputStream(
+                                 
zipEntrySource.getEntry("[Content_Types].xml"))) {

Review Comment:
   **NPE risk:** `zipEntrySource.getEntry("[Content_Types].xml")` returns 
`null` when the entry is absent. Passing that `null` to `getInputStream()` will 
throw a `NullPointerException`, which the `catch (IOException ignore)` block 
below won't catch, so the exception propagates up and defeats the entire 
fallback.
   
   Fix: assign the entry first, check it before opening the stream:
   ```java
   ZipArchiveEntry ctEntry = zipEntrySource.getEntry("[Content_Types].xml");
   if (ctEntry != null) {
       try (InputStream contentTypesStream = 
zipEntrySource.getInputStream(ctEntry)) {
           type = parseOOXMLContentTypes(contentTypesStream);
       } catch (IOException ignore) {
           //swallow
       }
   }
   ```



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java:
##########
@@ -142,24 +142,55 @@ protected void buildXHTML(XHTMLContentHandler xhtml)
             throws SAXException, IOException {
         OPCPackage container = opcPackage;
 
-        XSSFSharedStringsShim stringsShim;
+        XSSFSharedStringsShim stringsShim = null;
         XSSFReader.SheetIterator iter;
         XSSFReader xssfReader;
-        XSSFStylesShim stylesShim;
+        XSSFStylesShim stylesShim = null;
         try {
             xssfReader = new XSSFReader(container);
-            stylesShim = new XSSFStylesShim(xssfReader.getStylesData(), 
parseContext);
-
             iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
+        } catch (OpenXML4JException | RuntimeException e) {
+            throw new IOException(e);
+        }
+        // Styles and shared strings are optional — if either part is missing 
or
+        // unreadable, log to metadata and continue with degraded extraction.
+        try {
+            stylesShim = new XSSFStylesShim(xssfReader.getStylesData(), 
parseContext);
+        } catch (Exception e) {
+            metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+                    org.apache.tika.utils.ExceptionUtils.getStackTrace(e));

Review Comment:
   The new catch blocks use fully-qualified names 
(`org.apache.tika.utils.ExceptionUtils`, 
`org.apache.tika.exception.WriteLimitReachedException`) while the rest of the 
file uses short names via imports. This pattern appears three times across the 
two new catch blocks. Add the missing imports at the top of the file and use 
the short names, which matches the existing code style.



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/ZipParser.java:
##########
@@ -374,12 +375,22 @@ private void parseStreamEntries(ZipArchiveInputStream 
zis, Metadata metadata,
             throws TikaException, IOException, SAXException {
 
         try {
-            ArchiveEntry entry = zis.getNextEntry();
-            while (entry != null) {
+            ArchiveEntry entry;
+            while (true) {
+                try {
+                    entry = zis.getNextEntry();
+                } catch (java.util.zip.ZipException ze) {

Review Comment:
   No test resource exercises the `ZipException` path added here. The same 
applies to the Excel and Word robustness fixes in this PR. Consider adding a 
few malformed test files (a truncated ZIP, a ZIP-based XLSX with a missing 
sheet part, a DOCX with corrupt body XML) under `src/test/resources` and 
corresponding test cases. Without them, future regressions in this 
error-handling logic won't be caught automatically.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Re: [PR] 4x-reg-sax-fixes [tika]

Reply via email to