[jira] [Comment Edited] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Sai Konuri (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566996#comment-17566996
 ] 

Sai Konuri edited comment on TIKA-3814 at 7/14/22 10:12 PM:


Thanks Nick and Tim! 

 

As suggested, I agree that we are left with two options: 
 # Write a custom handler that skips the line break characters if we are inside 
the text tags (etc.)
 # We also tried replacing the "\r\n" with "" before parsing and that is giving 
us desirable results so far. We need to do some more testing for this however

 


was (Author: JIRAUSER292407):
Thanks Nick and Tim! 

 

As suggested, I agree that we are left with two options: 
 # Write a custom handler that skips the line break characters if we are inside 
the text tags (etc.)
 # We also tried replacing the "\r\n" with "" and that is giving us desirable 
results so far. We need to do some more testing for this however

 

> Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Minor
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Comment Edited] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566979#comment-17566979
 ] 

Tim Allison edited comment on TIKA-3814 at 7/14/22 8:29 PM:


I'm sorry for our team's delay.  I haven't looked at the relevant specs 
recently.  Do you happen to know if we're breaking portion of the spec on HTML 
parsing here?

If we convert {{\r\n}} to spaces, is that the solution or do we need to delete 
them?  I'm deeply worried that the latter will cause serious problems for word 
spacing on HTML that wasn't put together well.

Do we need to do this only between {{  }}, or do we need to 
do this for everything within {{.*  }}, or do we need to 
do this for everything within {{.* Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Critical
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)