TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag --------------------------------------------------------------------
Key: TIKA-692 URL: https://issues.apache.org/jira/browse/TIKA-692 Project: Tika Issue Type: Bug Components: parser Reporter: Michael McCandless Priority: Minor Fix For: 1.0 [Note: spinoff from the tika-dev thread "Issue in text extraction in Solr / Tika" on Aug 19 2011, by nirnaydewan] When parsing a Word doc where some contiguous text is bolded, due to differences in how the user had bolded different parts of the text with Word, TikaCLI -x or -h will sometimes generate output like this: {noformat} <p>F<b>oob</b>a<b>r</b> </p> {noformat} and other times like this (extra newline & 2 adjacent bold sections): {noformat} <p>F<b>oo</b> <b>b</b>a<b>r</b> </p> {noformat} The extra newline in the second example causes browsers (I tried Firefox, Safari, Chrome), JTidy and Tika itself to (incorrectly) insert a space when rending/extracting text, breaking up the word. While this might be technically correct/OK (ie, XML white space rules might allow for non-significant space after the </b> within a <p> should be ignored), I think we should still fix Tika to not insert newlines, if we can. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira