Copilot commented on code in PR #2817:
URL: https://github.com/apache/tika/pull/2817#discussion_r3263442705


##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/ooxml/AbstractOOXMLExtractor.java:
##########
@@ -672,6 +672,18 @@ void handleGeneralTextContainingPart(String contentType, 
String xhtmlClassLabel,
                     } catch (IOException | TikaException e) {
                         
parentMetadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
                                 ExceptionUtils.getStackTrace(e));
+                    } catch (SAXException e) {
+                        // Don't let a per-part SAX failure cancel the rest of
+                        // the loop -- and crucially, don't let it propagate
+                        // past the matching </div> emitted after this loop.
+                        // The contentHandler's internal bookkeeping (e.g.,
+                        // OOXMLTikaBodyPartHandler) may still have 
<p>/<td>/etc.
+                        // open after the partial parse; that cleanup needs a
+                        // handle to the body handler, which this generic 
method
+                        // doesn't have. Tracked for a follow-up.
+                        WriteLimitReachedException.throwIfWriteLimitReached(e);
+                        
parentMetadata.add(TikaCoreProperties.TIKA_META_EXCEPTION_WARNING,
+                                ExceptionUtils.getStackTrace(e));

Review Comment:
   In this SAXException catch you continue the loop but don’t (and can’t) 
repair the downstream ContentHandler’s element stack. If the embedded parse 
aborted after emitting <p>/<td>/etc., the later endElement("div") will be 
cross-nested or will throw, and the captured XHTML can still end up malformed. 
Consider either (a) rethrowing SAXException here to avoid emitting a mismatched 
closing </div>, or (b) changing this helper to take a handler that can be 
drained/closed on SAXException (e.g., pass an OOXMLTikaBodyPartHandler 
reference alongside the ContentHandler, or wrap contentHandler in a decorator 
that tracks opens and can auto-close before leaving the catch).
   



##########
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-cad-module/src/main/java/org/apache/tika/parser/prt/PRTParser.java:
##########
@@ -77,53 +77,61 @@ public void parse(TikaInputStream tis, ContentHandler 
handler, Metadata metadata
                       ParseContext context) throws IOException, SAXException, 
TikaException {
 
         XHTMLContentHandler xhtml = new XHTMLContentHandler(handler, metadata, 
context);
-        Last5 l5 = new Last5();
-        int read;
+        xhtml.startDocument();
+        // try/finally so endDocument fires even if a header read, characters
+        // emit, or recursive helper throws 
IOException/SAXException/TikaException
+        // mid-parse. Without this the captured XHTML would be left 
unterminated.
+        try {
+            Last5 l5 = new Last5();
+            int read;
 
-        // Try to get the creation date, which is YYYYMMDDhhmm
-        byte[] header = new byte[30];
-        IOUtils.readFully(tis, header);
-        byte[] date = new byte[12];
-        IOUtils.readFully(tis, date);
+            // Try to get the creation date, which is YYYYMMDDhhmm
+            byte[] header = new byte[30];
+            IOUtils.readFully(tis, header);
+            byte[] date = new byte[12];
+            IOUtils.readFully(tis, date);
 
-        String dateStr = new String(date, US_ASCII);
-        if (dateStr.startsWith("19") || dateStr.startsWith("20")) {
-            String formattedDate = dateStr.substring(0, 4) + "-" + 
dateStr.substring(4, 6) + "-" +
-                    dateStr.substring(6, 8) + "T" + dateStr.substring(8, 10) + 
":" +
-                    dateStr.substring(10, 12) + ":00";
-            metadata.set(TikaCoreProperties.CREATED, formattedDate);
-            // TODO Metadata.DATE is used as modified, should it be here?
-            metadata.set(TikaCoreProperties.CREATED, formattedDate);
-        }
-        metadata.set(Metadata.CONTENT_TYPE, PRT_MIME_TYPE);
+            String dateStr = new String(date, US_ASCII);
+            if (dateStr.startsWith("19") || dateStr.startsWith("20")) {
+                String formattedDate = dateStr.substring(0, 4) + "-" + 
dateStr.substring(4, 6) + "-" +
+                        dateStr.substring(6, 8) + "T" + dateStr.substring(8, 
10) + ":" +
+                        dateStr.substring(10, 12) + ":00";
+                metadata.set(TikaCoreProperties.CREATED, formattedDate);
+                // TODO Metadata.DATE is used as modified, should it be here?
+                metadata.set(TikaCoreProperties.CREATED, formattedDate);

Review Comment:
   These two consecutive calls both set TikaCoreProperties.CREATED to the same 
formattedDate value, which is redundant and likely a copy/paste error. If the 
intent was to set a separate modified/date property, update the second 
assignment accordingly; otherwise remove the duplicate set() to avoid confusion.
   



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

Reply via email to