[ https://issues.apache.org/jira/browse/CONNECTORS-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Karl Wright updated CONNECTORS-1660: ------------------------------------ Fix Version/s: ManifoldCF 2.18 > Patch for MCF HTML extractor connector > -------------------------------------- > > Key: CONNECTORS-1660 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1660 > Project: ManifoldCF > Issue Type: Improvement > Components: HTML extractor > Reporter: Olivier Tavard > Assignee: Karl Wright > Priority: Minor > Fix For: ManifoldCF 2.18 > > Attachments: patch_html_extractor_connector_02_12_2020.txt > > > Hello, > Here is a patch for the HTML extractor connector regarding the text > extraction with or without HTML stripping : > [^patch_html_extractor_connector_02_12_2020.txt] > * Extraction of HTML code : I added a whitelist through the Jsoup cleaner to > define what HTML elements are allowed to inforce the security. In the code I > set to “relaxed”: > This whitelist allows a full range of text and structural body HTML: a, b, > blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, > h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, > sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul > (more details here : > [https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed()]) > A future improvement of the code would be to add a new parameter on the > interface to choose what whitelist to choose. > > * Extraction of text with stripping HTML activated : we keep only text nodes > : all HTML will be stripped (same thing as before). The change is the Jsoup > pretty print option is now set to false to keep line breaks. > > Best regards -- This message was sent by Atlassian Jira (v8.3.4#803005)