[ https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566979#comment-17566979 ]
Tim Allison edited comment on TIKA-3814 at 7/14/22 8:29 PM: ------------------------------------------------------------ I'm sorry for our team's delay. I haven't looked at the relevant specs recently. Do you happen to know if we're breaking portion of the spec on HTML parsing here? If we convert {{\r\n}} to spaces, is that the solution or do we need to delete them? I'm deeply worried that the latter will cause serious problems for word spacing on HTML that wasn't put together well. Do we need to do this only between {{<span/> <p/> <text/>}}, or do we need to do this for everything within {{<body>.*</body}} tags? If you're running Tika programmatically, you can easily enough subclass the BodyContentHandler to do what you need generally. I worry about second order effects for non-html files, though. I'm going to drop this back to minor because we haven't heard complaints about this before, and this is not a change in behavior (if I'm wrong, let me know). I recognize that it is painful personally for you and your customers. was (Author: talli...@mitre.org): I'm sorry for our team's delay. I haven't looked at the relevant specs recently. Do you happen to know if we're breaking portion of the spec on HTML parsing here? If we convert {{\r\n}} to spaces, is that the solution or do we need to delete them? I'm deeply worried that the latter will cause serious problems for word spacing on HTML that wasn't put together well. Do we need to do this only between {{<span/> <p/> <text/>}}, or do we need to do this for everything within {{<body>.*</body}} tags? If you're running Tika programmatically, you can easily enough subclass the BodyContentHandler to do what you need generally. I worry about second order effects for non-html files, though. Given that we haven't heard complaints about this before, my inclination would be to fix it locally via that solution. > Extracted text from HTML file does not exclude newline chars from body > ---------------------------------------------------------------------- > > Key: TIKA-3814 > URL: https://issues.apache.org/jira/browse/TIKA-3814 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.3.0 > Reporter: Sai Konuri > Priority: Critical > Attachments: bug.html, image-2022-07-06-19-08-30-437.png, > image-2022-07-06-19-09-54-534.png > > > When there is a newline character ('\n') within the text of a > <span>,<p>,<text>, etc, the text that is extracted is not excluding those > newlines. > A sample html file is attached. > > {*}Expected{*}: > !image-2022-07-06-19-08-30-437.png! > > {*}Actual{*}: > !image-2022-07-06-19-09-54-534.png! > > > This is the code I am using to extract the text of the HTML file: > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler handler = new BodyContentHandler(); > Metadata metadata = new Metadata(); > try (InputStream stream = > this.getClass().getClassLoader().getResourceAsStream("bug.html")) { > parser.parse(stream, handler, metadata); > System.out.println(handler); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)