[ 
https://issues.apache.org/jira/browse/TIKA-2928?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16914481#comment-16914481
 ] 

Ken Krugler commented on TIKA-2928:
-----------------------------------

Hi [~Sargent_D] - thanks for trying this out! I'm going to bump the priority of 
the issue re switching to JSoup, as this is another sign that it (hopefully) 
will work better.

> Less than sign within tag boundaries considered as start of a new tag.
> ----------------------------------------------------------------------
>
>                 Key: TIKA-2928
>                 URL: https://issues.apache.org/jira/browse/TIKA-2928
>             Project: Tika
>          Issue Type: Improvement
>          Components: parser, server
>    Affects Versions: 1.22
>            Reporter: Desmond David
>            Priority: Minor
>
> So I have been attempting to parse some (somewhat non-standard) HTML 
> documents using Tika and I have observed that if the document contains a 
> less-than sign (<) as part of a tag's body, Tika parses it as the start of a 
> new tag and eventually omits the rest of the text in the final document, up 
> to the point when the next newline is to be entered.
> For example, consider the following HTML snippet:
>  
> {code:html}
> <tr ><td > GFR<60 = Chronic Kidney Disease, GFR<15 = Kidney Failure 
> </td></tr><tr ><td ></td></tr><tr ><td > ENZYMES & BILIRUBIN</td></tr>{code}
> The result is:
> {code:java}
> GFR
> ENZYMES & BILIRUBIN
> {code}
> Here, the rest of the content after the first `GFR` gets omitted. Based on 
> this observation I think this means that the `<60`  and it's subsequent 
> characters are getting interpreted as part of a tag, and since are getting 
> ignored. Then at some point, `</td></tr>` is encountered which short-circuits 
> the execution and starts processing the next line.
> This behaviour was observed using both, the Tika App and the Tika Server.
> I think expected behaviour should be that all text within data tags (p, td, 
> etc.) should be considered as raw text. Or at least Tika's behaviour should 
> be configurable to be allowed to do so.
>  



--
This message was sent by Atlassian Jira
(v8.3.2#803003)

Reply via email to