[jira] [Commented] (CONNECTORS-1660) Patch for MCF HTML extractor connector

2020-12-11 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248054#comment-17248054
 ] 

Olivier Tavard commented on CONNECTORS-1660:


There is the global patch that includes the previous one without the log 
statement : [^patch_html_extractor_connector_11_12_2020.txt] 


> Patch for MCF HTML extractor connector
> --
>
> Key: CONNECTORS-1660
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1660
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: HTML extractor
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_html_extractor_connector_02_12_2020.txt, 
> patch_html_extractor_connector_11_12_2020.txt
>
>
> Hello,
> Here is a patch for the HTML extractor connector regarding the text 
> extraction with or without HTML stripping : 
> [^patch_html_extractor_connector_02_12_2020.txt]
>  * Extraction of HTML code : I added a whitelist through the Jsoup cleaner to 
> define what HTML elements are allowed to inforce the security. In the code I 
> set to “relaxed”:
> This whitelist allows a full range of text and structural body HTML: a, b, 
> blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, 
> h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, 
> sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul
> (more details here : 
> [https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed()])
> A future improvement of the code would be to add a new parameter on the 
> interface to choose what whitelist to choose.
>  
>  * Extraction of text with stripping HTML activated : we keep only text nodes 
> : all HTML will be stripped (same thing as before). The change is the Jsoup 
> pretty print option is now set to false to keep line breaks.
>  
> Best regards



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1660) Patch for MCF HTML extractor connector

2020-12-11 Thread Karl Wright (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248000#comment-17248000
 ] 

Karl Wright commented on CONNECTORS-1660:
-

Please remove the log statement, since it will dump the entire document and 
will overwhelm the logs.


> Patch for MCF HTML extractor connector
> --
>
> Key: CONNECTORS-1660
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1660
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: HTML extractor
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_html_extractor_connector_02_12_2020.txt
>
>
> Hello,
> Here is a patch for the HTML extractor connector regarding the text 
> extraction with or without HTML stripping : 
> [^patch_html_extractor_connector_02_12_2020.txt]
>  * Extraction of HTML code : I added a whitelist through the Jsoup cleaner to 
> define what HTML elements are allowed to inforce the security. In the code I 
> set to “relaxed”:
> This whitelist allows a full range of text and structural body HTML: a, b, 
> blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, 
> h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, 
> sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul
> (more details here : 
> [https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed()])
> A future improvement of the code would be to add a new parameter on the 
> interface to choose what whitelist to choose.
>  
>  * Extraction of text with stripping HTML activated : we keep only text nodes 
> : all HTML will be stripped (same thing as before). The change is the Jsoup 
> pretty print option is now set to false to keep line breaks.
>  
> Best regards



--
This message was sent by Atlassian Jira
(v8.3.4#803005)