[jira] [Comment Edited] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

Tim Allison (Jira) Thu, 14 Jul 2022 13:30:20 -0700


    [ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566979#comment-17566979
 ]


Tim Allison edited comment on TIKA-3814 at 7/14/22 8:29 PM:
------------------------------------------------------------

I'm sorry for our team's delay.  I haven't looked at the relevant specs 
recently.  Do you happen to know if we're breaking portion of the spec on HTML 
parsing here?

If we convert {{\r\n}} to spaces, is that the solution or do we need to delete 
them?  I'm deeply worried that the latter will cause serious problems for word 
spacing on HTML that wasn't put together well.

Do we need to do this only between {{<span/> <p/> <text/>}}, or do we need to 
do this for everything within {{<body>.*</body}} tags?

If you're running Tika programmatically, you can easily enough subclass the 
BodyContentHandler to do what you need generally.  I worry about second order 
effects for non-html files, though.

I'm going to drop this back to minor because we haven't heard complaints about 
this before, and this is not a change in behavior (if I'm wrong, let me know).  
I recognize that it is painful personally for you and your customers.


was (Author: talli...@mitre.org):
I'm sorry for our team's delay.  I haven't looked at the relevant specs 
recently.  Do you happen to know if we're breaking portion of the spec on HTML 
parsing here?

If we convert {{\r\n}} to spaces, is that the solution or do we need to delete 
them?  I'm deeply worried that the latter will cause serious problems for word 
spacing on HTML that wasn't put together well.

Do we need to do this only between {{<span/> <p/> <text/>}}, or do we need to 
do this for everything within {{<body>.*</body}} tags?

If you're running Tika programmatically, you can easily enough subclass the 
BodyContentHandler to do what you need generally.  I worry about second order 
effects for non-html files, though.

Given that we haven't heard complaints about this before, my inclination would 
be to fix it locally via that solution.

> Extracted text from HTML file does not exclude newline chars from body
> ----------------------------------------------------------------------
>
>                 Key: TIKA-3814
>                 URL: https://issues.apache.org/jira/browse/TIKA-3814
>             Project: Tika
>          Issue Type: Bug
>          Components: parser
>    Affects Versions: 2.3.0
>            Reporter: Sai Konuri
>            Priority: Critical
>         Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> <span>,<p>,<text>, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
>     parser.parse(stream, handler, metadata);
>     System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

[jira] [Comment Edited] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

Reply via email to