[jira] [Commented] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Sai Konuri (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566971#comment-17566971
 ] 

Sai Konuri commented on TIKA-3814:
--

This is impacting our customers for our feature, so marking this as critical. 

> Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Trivial
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Tim Allison (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566979#comment-17566979
 ] 

Tim Allison commented on TIKA-3814:
---

I'm sorry for our team's delay.  I haven't looked at the relevant specs 
recently.  Do you happen to know if we're breaking portion of the spec on HTML 
parsing here?

If we convert {{\r\n}} to spaces, is that the solution or do we need to delete 
them?  I'm deeply worried that the latter will cause serious problems for word 
spacing on HTML that wasn't put together well.

Do we need to do this only between {{  }}, or do we need to 
do this for everything within {{.* Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Critical
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Nick Burch (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566991#comment-17566991
 ] 

Nick Burch commented on TIKA-3814:
--

I have a feeling that the Text content handler might rely on these coming 
through in the character stream to nicely-ish format the text output?

I do agree that a custom content handler that tracks if it's inside of the "no 
breaks wanted" tags, and skips newlines in the character stream if so, is 
likely to be the likely-best solution here

> Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Minor
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)


[jira] [Commented] (TIKA-3814) Extracted text from HTML file does not exclude newline chars from body

2022-07-14 Thread Sai Konuri (Jira)


[ 
https://issues.apache.org/jira/browse/TIKA-3814?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17566996#comment-17566996
 ] 

Sai Konuri commented on TIKA-3814:
--

Thanks Nick and Tim! 

 

As suggested, I agree that we are left with two options: 
 # Write a custom handler that skips the line break characters if we are inside 
the text tags (etc.)
 # We also tried replacing the "\r\n" with "" and that is giving us desirable 
results so far. We need to do some more testing for this however

 

> Extracted text from HTML file does not exclude newline chars from body
> --
>
> Key: TIKA-3814
> URL: https://issues.apache.org/jira/browse/TIKA-3814
> Project: Tika
>  Issue Type: Bug
>  Components: parser
>Affects Versions: 2.3.0
>Reporter: Sai Konuri
>Priority: Minor
> Attachments: bug.html, image-2022-07-06-19-08-30-437.png, 
> image-2022-07-06-19-09-54-534.png
>
>
> When there is a newline character ('\n') within the text of a 
> ,,, etc, the text that is extracted is not excluding those 
> newlines. 
> A sample html file is attached.
>  
> {*}Expected{*}:
> !image-2022-07-06-19-08-30-437.png!
>  
> {*}Actual{*}: 
> !image-2022-07-06-19-09-54-534.png!
>  
>  
> This is the code I am using to extract the text of the HTML file: 
> {code:java}
> AutoDetectParser parser = new AutoDetectParser();
> BodyContentHandler handler = new BodyContentHandler();
> Metadata metadata = new Metadata();
> try (InputStream stream = 
> this.getClass().getClassLoader().getResourceAsStream("bug.html")) {
> parser.parse(stream, handler, metadata);
> System.out.println(handler);
> } {code}



--
This message was sent by Atlassian Jira
(v8.20.10#820010)