[ https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914481#comment-16914481 ]
Ken Krugler commented on TIKA-2928: ----------------------------------- Hi [~Sargent_D] - thanks for trying this out! I'm going to bump the priority of the issue re switching to JSoup, as this is another sign that it (hopefully) will work better. > Less than sign within tag boundaries considered as start of a new tag. > ---------------------------------------------------------------------- > > Key: TIKA-2928 > URL: https://issues.apache.org/jira/browse/TIKA-2928 > Project: Tika > Issue Type: Improvement > Components: parser, server > Affects Versions: 1.22 > Reporter: Desmond David > Priority: Minor > > So I have been attempting to parse some (somewhat non-standard) HTML > documents using Tika and I have observed that if the document contains a > less-than sign (<) as part of a tag's body, Tika parses it as the start of a > new tag and eventually omits the rest of the text in the final document, up > to the point when the next newline is to be entered. > For example, consider the following HTML snippet: > > {code:html} > <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure > </td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code} > The result is: > {code:java} > GFR > ENZYMES & BILIRUBIN > {code} > Here, the rest of the content after the first `GFR` gets omitted. Based on > this observation I think this means that the `<60` and it's subsequent > characters are getting interpreted as part of a tag, and since are getting > ignored. Then at some point, `</td></tr>` is encountered which short-circuits > the execution and starts processing the next line. > This behaviour was observed using both, the Tika App and the Tika Server. > I think expected behaviour should be that all text within data tags (p, td, > etc.) should be considered as raw text. Or at least Tika's behaviour should > be configurable to be allowed to do so. > -- This message was sent by Atlassian Jira (v8.3.2#803003)