Sai Konuri created TIKA-3814: -------------------------------- Summary: Extracted text from HTML file does not exclude newline chars from body Key: TIKA-3814 URL: https://issues.apache.org/jira/browse/TIKA-3814 Project: Tika Issue Type: Bug Components: parser Affects Versions: 2.3.0 Reporter: Sai Konuri Attachments: bug.html, image-2022-07-06-19-08-30-437.png, image-2022-07-06-19-09-54-534.png
When there is a newline character ('\n') within the text of a <span>,<p>,<text>, etc, the text that is extracted is not excluding those newlines. A sample html file is attached. {*}Expected{*}: !image-2022-07-06-19-08-30-437.png! {*}Actual{*}: !image-2022-07-06-19-09-54-534.png! This is the code I am using to extract the text of the HTML file: {code:java} AutoDetectParser parser = new AutoDetectParser(); BodyContentHandler handler = new BodyContentHandler(); Metadata metadata = new Metadata(); try (InputStream stream = this.getClass().getClassLoader().getResourceAsStream("bug.html")) { parser.parse(stream, handler, metadata); System.out.println(handler); } {code} -- This message was sent by Atlassian Jira (v8.20.10#820010)