utafrali commented on code in PR #2773:
URL: https://github.com/apache/tika/pull/2773#discussion_r3100148783
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java:
##########
@@ -142,24 +142,55 @@ protected void buildXHTML(XHTMLContentHandler xhtml)
throws SAXException, IOException {
OPCPackage container = opcPackage;
- XSSFSharedStringsShim stringsShim;
+ XSSFSharedStringsShim stringsShim = null;
XSSFReader.SheetIterator iter;
XSSFReader xssfReader;
- XSSFStylesShim stylesShim;
+ XSSFStylesShim stylesShim = null;
try {
xssfReader = new XSSFReader(container);
- stylesShim = new XSSFStylesShim(xssfReader.getStylesData(),
parseContext);
-
iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
+ } catch (OpenXML4JException | RuntimeException e) {
+ throw new IOException(e);
+ }
+ // Styles and shared strings are optional — if either part is missing
or
+ // unreadable, log to metadata and continue with degraded extraction.
+ try {
+ stylesShim = new XSSFStylesShim(xssfReader.getStylesData(),
parseContext);
+ } catch (Exception e) {
+ metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+ org.apache.tika.utils.ExceptionUtils.getStackTrace(e));
+ }
+ try {
stringsShim = new
XSSFSharedStringsShim(xssfReader.getSharedStringsData(),
config.isConcatenatePhoneticRuns(), parseContext);
- } catch (OpenXML4JException | TikaException e) {
- throw new IOException(e);
+ } catch (Exception e) {
+ metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+ org.apache.tika.utils.ExceptionUtils.getStackTrace(e));
}
- while (iter.hasNext()) {
+ while (true) {
+ try {
+ if (!iter.hasNext()) {
+ break;
+ }
+ } catch (RuntimeException e) {
+ metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+ org.apache.tika.utils.ExceptionUtils.getStackTrace(e));
+ break;
+ }
SheetTextAsHTML sheetExtractor = new SheetTextAsHTML(config,
xhtml);
PackagePart sheetPart = null;
- try (InputStream stream = iter.next()) {
+ InputStream nextStream;
+ try {
+ nextStream = iter.next();
+ } catch (RuntimeException e) {
+ // POI can throw POIXMLException for missing sheet parts (e.g.,
+ // truncated workbook references a sheet that isn't in the
zip).
+ // Skip the sheet but keep what we've already extracted.
+ metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+ org.apache.tika.utils.ExceptionUtils.getStackTrace(e));
Review Comment:
When `iter.next()` throws a `RuntimeException`, the `continue` returns to
the top of the loop where `iter.hasNext()` is called again. If POI's iterator
leaves `hasNext()` returning `true` even after the failed `next()` call (e.g.,
the internal index didn't advance), this becomes an infinite loop, spinning
through the same broken sheet repeatedly.
Using `break` instead of `continue` is the safer choice here. If `next()`
failed, the iterator state is undefined and abandoning iteration is preferable
to risking a hang. Any sheets already processed are preserved in the output.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/detect/microsoft/ooxml/OPCPackageDetector.java:
##########
@@ -281,7 +281,23 @@ public MediaType detect(ZipFile zipFile, TikaInputStream
stream) throws IOExcept
//no need to close zipEntrySource because it
//only closes the underlying zipFile, not any other resources
//as of this writing.... :'(
- return null;
+ //fall through to [Content_Types].xml fallback below
+ }
+ // POI may have failed (caught above) OR returned null because the
+ // rels were malformed and POI silently produced an empty relationship
+ // collection. Either way, fall back to parsing [Content_Types].xml
+ // directly — same approach as the streaming detector.
+ if (type == null) {
+ try (InputStream contentTypesStream =
+ zipEntrySource.getInputStream(
+
zipEntrySource.getEntry("[Content_Types].xml"))) {
Review Comment:
**NPE risk:** `zipEntrySource.getEntry("[Content_Types].xml")` returns
`null` when the entry is absent. Passing that `null` to `getInputStream()` will
throw a `NullPointerException`, which the `catch (IOException ignore)` block
below won't catch, so the exception propagates up and defeats the entire
fallback.
Fix: assign the entry first, check it before opening the stream:
```java
ZipArchiveEntry ctEntry = zipEntrySource.getEntry("[Content_Types].xml");
if (ctEntry != null) {
try (InputStream contentTypesStream =
zipEntrySource.getInputStream(ctEntry)) {
type = parseOOXMLContentTypes(contentTypesStream);
} catch (IOException ignore) {
//swallow
}
}
```
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/XSSFExcelExtractorDecorator.java:
##########
@@ -142,24 +142,55 @@ protected void buildXHTML(XHTMLContentHandler xhtml)
throws SAXException, IOException {
OPCPackage container = opcPackage;
- XSSFSharedStringsShim stringsShim;
+ XSSFSharedStringsShim stringsShim = null;
XSSFReader.SheetIterator iter;
XSSFReader xssfReader;
- XSSFStylesShim stylesShim;
+ XSSFStylesShim stylesShim = null;
try {
xssfReader = new XSSFReader(container);
- stylesShim = new XSSFStylesShim(xssfReader.getStylesData(),
parseContext);
-
iter = (XSSFReader.SheetIterator) xssfReader.getSheetsData();
+ } catch (OpenXML4JException | RuntimeException e) {
+ throw new IOException(e);
+ }
+ // Styles and shared strings are optional — if either part is missing
or
+ // unreadable, log to metadata and continue with degraded extraction.
+ try {
+ stylesShim = new XSSFStylesShim(xssfReader.getStylesData(),
parseContext);
+ } catch (Exception e) {
+ metadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+ org.apache.tika.utils.ExceptionUtils.getStackTrace(e));
Review Comment:
The new catch blocks use fully-qualified names
(`org.apache.tika.utils.ExceptionUtils`,
`org.apache.tika.exception.WriteLimitReachedException`) while the rest of the
file uses short names via imports. This pattern appears three times across the
two new catch blocks. Add the missing imports at the top of the file and use
the short names, which matches the existing code style.
##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-pkg-module/src/main/java/org/apache/tika/parser/pkg/ZipParser.java:
##########
@@ -374,12 +375,22 @@ private void parseStreamEntries(ZipArchiveInputStream
zis, Metadata metadata,
throws TikaException, IOException, SAXException {
try {
- ArchiveEntry entry = zis.getNextEntry();
- while (entry != null) {
+ ArchiveEntry entry;
+ while (true) {
+ try {
+ entry = zis.getNextEntry();
+ } catch (java.util.zip.ZipException ze) {
Review Comment:
No test resource exercises the `ZipException` path added here. The same
applies to the Excel and Word robustness fixes in this PR. Consider adding a
few malformed test files (a truncated ZIP, a ZIP-based XLSX with a missing
sheet part, a DOCX with corrupt body XML) under `src/test/resources` and
corresponding test cases. Without them, future regressions in this
error-handling logic won't be caught automatically.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]