TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
--------------------------------------------------------------------

                 Key: TIKA-692
                 URL: https://issues.apache.org/jira/browse/TIKA-692
             Project: Tika
          Issue Type: Bug
          Components: parser
            Reporter: Michael McCandless
            Priority: Minor
             Fix For: 1.0



[Note: spinoff from the tika-dev thread "Issue in text extraction in
Solr / Tika" on Aug 19 2011, by nirnaydewan]

When parsing a Word doc where some contiguous text is bolded, due to
differences in how the user had bolded different parts of the text
with Word, TikaCLI -x or -h will sometimes generate output like this:

{noformat}
<p>F<b>oob</b>a<b>r</b>
</p>
{noformat}

and other times like this (extra newline & 2 adjacent bold sections):

{noformat}
<p>F<b>oo</b>
<b>b</b>a<b>r</b>
</p>
{noformat}

The extra newline in the second example causes browsers (I tried
Firefox, Safari, Chrome), JTidy and Tika itself to (incorrectly)
insert a space when rending/extracting text, breaking up the word.

While this might be technically correct/OK (ie, XML white space rules
might allow for non-significant space after the </b> within a <p>
should be ignored), I think we should still fix Tika to not insert
newlines, if we can.


--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Reply via email to