[
https://issues.apache.org/jira/browse/TIKA-692?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Michael McCandless updated TIKA-692:
------------------------------------
Attachment: TIKA-692.patch
OK, I found the source of the issue: the XML/XHTML transform handler
(pulled from SAXTransformerFactory) is inserting this newline, but
(apparently) it only does so if you close a tag and immediately
re-open the same tag.
I'm not sure why it does that... but I was able to work around it, in
WordExtractor, by only closing the b (or s or i) tag if the next
character run didn't also have that tag (or, if closing was required
due to nesting).
Ie, it basically coalesces two adjacent character runs
<b>text1</b><b>text2</b> into a single <b>text1text2</b>.
With this change the new tests (and all tests) pass, and I no longer see an
extra space inserted
when viewing the X/HTML output in browsers.
> TikaCLI -x or -h on a Word doc sometimes adds newline after </b> tag
> --------------------------------------------------------------------
>
> Key: TIKA-692
> URL: https://issues.apache.org/jira/browse/TIKA-692
> Project: Tika
> Issue Type: Bug
> Components: parser
> Reporter: Michael McCandless
> Priority: Minor
> Fix For: 1.0
>
> Attachments: TIKA-692.patch, TIKA-692.patch,
> testWORD_bold_character_runs.doc, testWORD_bold_character_runs2.doc
>
>
> [Note: spinoff from the tika-dev thread "Issue in text extraction in
> Solr / Tika" on Aug 19 2011, by nirnaydewan]
> When parsing a Word doc where some contiguous text is bolded, due to
> differences in how the user had bolded different parts of the text
> with Word, TikaCLI -x or -h will sometimes generate output like this:
> {noformat}
> <p>F<b>oob</b>a<b>r</b>
> </p>
> {noformat}
> and other times like this (extra newline & 2 adjacent bold sections):
> {noformat}
> <p>F<b>oo</b>
> <b>b</b>a<b>r</b>
> </p>
> {noformat}
> The extra newline in the second example causes browsers (I tried
> Firefox, Safari, Chrome), JTidy and Tika itself to (incorrectly)
> insert a space when rending/extracting text, breaking up the word.
> While this might be technically correct/OK (ie, XML white space rules
> might allow for non-significant space after the </b> within a <p>
> should be ignored), I think we should still fix Tika to not insert
> newlines, if we can.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira