[ 
https://issues.apache.org/jira/browse/TIKA-2599?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16667821#comment-16667821
 ] 

ASF GitHub Bot commented on TIKA-2599:
--------------------------------------

dameikle closed pull request #254: TIKA-2599: Fixed closing of styles around 
Hyperlinks. Contributed by Ronan O'Sullivan.
URL: https://github.com/apache/tika/pull/254
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
index 30bd4bb969..6f7d3785bd 100644
--- 
a/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
+++ 
b/tika-parsers/src/main/java/org/apache/tika/parser/microsoft/WordExtractor.java
@@ -528,8 +528,8 @@ private int handleSpecialCharacterRuns(Paragraph p, int 
index, boolean skipStyli
                     url = text.substring(start, end);
                 }
 
-                xhtml.startElement("a", "href", url);
                 closeStyleElements(skipStyling, xhtml);
+                xhtml.startElement("a", "href", url);
                 for (CharacterRun cr : texts) {
                     handleCharacterRun(cr, skipStyling, xhtml);
                 }
diff --git 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
index 7456ac409e..d2c38a42d5 100644
--- 
a/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
+++ 
b/tika-parsers/src/test/java/org/apache/tika/parser/microsoft/WordParserTest.java
@@ -560,6 +560,15 @@ public void testBoldHyperlink() throws Exception {
         assertContains("<a 
href=\"http://tika.apache.org/\";><b><u>hyper</u></b><u> link</u></a>; bold" , 
xml);
     }
 
+    @Test
+    public void testHyperlinkSurroundedByItalics() throws Exception {
+        //TIKA-2599
+        String xml = getXML("testWORD_italicsSurroundingHyperlink.doc").xml;
+        xml = xml.replaceAll("\\s+", " ");
+        assertContains("<body><p><i>Italic Test before link </i><a 
href=\"http://www.google.com\";><b><i>" +
+                "<u>hyperlink italics</u></i></b></a><i> Italic text after 
hyperlink</i></p>", xml);
+    }
+
     @Test
     public void testMacros() throws  Exception {
 
diff --git 
a/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
 
b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
new file mode 100644
index 0000000000..24edb8f718
Binary files /dev/null and 
b/tika-parsers/src/test/resources/test-documents/testWord_italicsSurroundingHyperlink.doc
 differ


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


> Hyperlink surrounded by Italics not closed Properly
> ---------------------------------------------------
>
>                 Key: TIKA-2599
>                 URL: https://issues.apache.org/jira/browse/TIKA-2599
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 1.14, 1.15, 1.16, 1.17
>         Environment: Any
>            Reporter: Ronan O'Sullivan
>            Assignee: Dave Meikle
>            Priority: Minor
>             Fix For: 1.20
>
>         Attachments: diff-TIKA-2599.txt, 
> testWord_italicsSurroundingHyperlink.doc
>
>
> If a Word document contains a hyperlink surrounded by italicized text, the 
> resulting xhtml is:
>  
> <p><i>Italic Test before link <a 
> href="http://www.google.com"/><b><i><u>hyperlink italics</u></i></b></a><i> 
> Italic text after hyperlink</i></p>
>  
> The opening italics tag is not closed which is not valid XHTML.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to