[ https://issues.apache.org/jira/browse/TIKA-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689400#comment-17689400 ]
Hudson commented on TIKA-3972: ------------------------------ FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #1024 (See [https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/1024/]) TIKA-3972 -- fix closing <a> elements when there are also style elements (tallison: [https://github.com/apache/tika/commit/4f599dfa3d72c724a846356bf867db45f221170a]) * (edit) CHANGES.txt * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/rtf/RTFParserTest.java * (edit) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/TextExtractor.java * (add) tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testRTFHyperlinkAndStyles.rtf > Parsing RTF sample with hyperlink and ToXMLContentHandler returns malformed > XHTML from toString method call > ----------------------------------------------------------------------------------------------------------- > > Key: TIKA-3972 > URL: https://issues.apache.org/jira/browse/TIKA-3972 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.7.0 > Environment: Tested with Java 8 (Temurin Eclipse) and Tika 2.7.0 on > Windows 11. > Reporter: Martin Honnen > Priority: Major > Labels: RTFParser, rtf > Attachments: hyperlink.rtf > > > I am exploring Tika for RTF to X(HT)ML parsing, I have run into a problem > with some RTF having an hyperlink where unfortunately the result of using a > ContentHandler created with ToXMLContentHandler and calling the toString() > method on the handler returns a malformed X(HT)ML document where the starting > `<a>` tag is not properly closed. > I have attached the relevant RTF sample document. The output I get is > ``` > <html xmlns="http://www.w3.org/1999/xhtml"> > <head> > <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" > /> > <meta name="X-TIKA:Parsed-By" > content="org.apache.tika.parser.microsoft.rtf.RTFParser" /> > <meta name="Content-Type" content="application/rtf" /> > <title></title> > </head> > <body><p /> > <p /> > <p> 10”Flour Tortilla</p> > <p> Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, > Caesar.doc">Dip, Caesar.doc</b><b /></b></p> > <p><b /> Ripped Romaine</p> > <p> Blackened Salmon julienne</p> > <p> Shaved Red Onion</p> > <p> Julienne Tomato</p> > <p> Grated Parmesan</p> > <p> Blackening spice: <a href="..\\..\\SPICE\\Blackening > Spice.doc">Blackening Spice.doc</a></p> > <p /> > <p>Method</p> > <p>Procedure Text </p> > <p /> > <p /> > </body></html> > ``` > where the part `<p> Caesar <b><i>DIP</i>: <a > href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b > /></b></p>` is flawed as the `<a href>` is not closed. > -- This message was sent by Atlassian Jira (v8.20.10#820010)