[ https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Sai Konuri updated TIKA-3814: ----------------------------- Priority: Critical (was: Trivial) > Extracted text from HTML file does not exclude newline chars from body > ---------------------------------------------------------------------- > > Key: TIKA-3814 > URL: https://issues.apache.org/jira/browse/TIKA-3814 > Project: Tika > Issue Type: Bug > Components: parser > Affects Versions: 2.3.0 > Reporter: Sai Konuri > Priority: Critical > Attachments: bug.html, image-2022-07-06-19-08-30-437.png, > image-2022-07-06-19-09-54-534.png > > > When there is a newline character ('\n') within the text of a > <span>,<p>,<text>, etc, the text that is extracted is not excluding those > newlines. > A sample html file is attached. > > {*}Expected{*}: > !image-2022-07-06-19-08-30-437.png! > > {*}Actual{*}: > !image-2022-07-06-19-09-54-534.png! > > > This is the code I am using to extract the text of the HTML file: > {code:java} > AutoDetectParser parser = new AutoDetectParser(); > BodyContentHandler handler = new BodyContentHandler(); > Metadata metadata = new Metadata(); > try (InputStream stream = > this.getClass().getClassLoader().getResourceAsStream("bug.html")) { > parser.parse(stream, handler, metadata); > System.out.println(handler); > } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)