[ 
https://issues.apache.org/jira/browse/TIKA-3972?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17689400#comment-17689400
 ] 

Hudson commented on TIKA-3972:
------------------------------

FAILURE: Integrated in Jenkins build Tika » tika-main-jdk8 #1024 (See 
[https://ci-builds.apache.org/job/Tika/job/tika-main-jdk8/1024/])
TIKA-3972 -- fix closing <a> elements when there are also style elements 
(tallison: 
[https://github.com/apache/tika/commit/4f599dfa3d72c724a846356bf867db45f221170a])
* (edit) CHANGES.txt
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/java/org/apache/tika/parser/microsoft/rtf/RTFParserTest.java
* (edit) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/main/java/org/apache/tika/parser/microsoft/rtf/TextExtractor.java
* (add) 
tika-parsers/tika-parsers-standard/tika-parsers-standard-modules/tika-parser-microsoft-module/src/test/resources/test-documents/testRTFHyperlinkAndStyles.rtf


> Parsing RTF sample with hyperlink and ToXMLContentHandler returns malformed 
> XHTML from toString method call
> -----------------------------------------------------------------------------------------------------------
>
>                 Key: TIKA-3972
>                 URL: https://issues.apache.org/jira/browse/TIKA-3972
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.7.0
>         Environment: Tested with Java 8 (Temurin Eclipse) and Tika 2.7.0 on 
> Windows 11.
>            Reporter: Martin Honnen
>            Priority: Major
>              Labels: RTFParser, rtf
>         Attachments: hyperlink.rtf
>
>
> I am exploring Tika for RTF to X(HT)ML parsing, I have run into a problem 
> with some RTF having an hyperlink where unfortunately the result of using a 
> ContentHandler created with ToXMLContentHandler and calling the toString() 
> method on the handler returns a malformed X(HT)ML document where the starting 
> `<a>` tag is not properly closed.
> I have attached the relevant RTF sample document. The output I get is
> ```
> <html xmlns="http://www.w3.org/1999/xhtml";>
> <head>
> <meta name="X-TIKA:Parsed-By" content="org.apache.tika.parser.DefaultParser" 
> />
> <meta name="X-TIKA:Parsed-By" 
> content="org.apache.tika.parser.microsoft.rtf.RTFParser" />
> <meta name="Content-Type" content="application/rtf" />
> <title></title>
> </head>
> <body><p />
> <p />
> <p>    10”Flour Tortilla</p>
> <p>    Caesar <b><i>DIP</i>: <a href="..\\..\\SAUCES\\Dips\\Dip, 
> Caesar.doc">Dip, Caesar.doc</b><b /></b></p>
> <p><b />    Ripped Romaine</p>
> <p>    Blackened Salmon julienne</p>
> <p>    Shaved Red Onion</p>
> <p>    Julienne Tomato</p>
> <p>    Grated Parmesan</p>
> <p>    Blackening spice: <a href="..\\..\\SPICE\\Blackening 
> Spice.doc">Blackening Spice.doc</a></p>
> <p />
> <p>Method</p>
> <p>Procedure Text </p>
> <p />
> <p />
> </body></html>
> ```
> where the part `<p>    Caesar <b><i>DIP</i>: <a 
> href="..\\..\\SAUCES\\Dips\\Dip, Caesar.doc">Dip, Caesar.doc</b><b 
> /></b></p>` is flawed as the `<a href>` is not closed.
>  



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

Reply via email to