[
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16833628#comment-16833628
]
Olivier Tavard commented on CONNECTORS-1500:
--------------------------------------------
Hi,
I would like to add a new patch for the HTML extractor connector. It handles
the case where the englobing tag chosen by the user is not present in a crawled
page. It fallbacks to the body tag in this case. In the current version of the
code, it is not handled and can cause a null pointer exception on the document.
Thanks,
Olivier
[^patch_HTML_extractor_connector_05_06_19.txt]
> HTML Extractor transformation connector contribution
> ----------------------------------------------------
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
> Issue Type: Improvement
> Affects Versions: ManifoldCF 2.9.1
> Reporter: Olivier Tavard
> Assignee: Karl Wright
> Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt,
> html_extractor_transformation_connector.txt,
> patch_HTML_extractor_connector_05_06_19.txt,
> patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code
> is to simply choose an encompassing tag in a HTML document for text
> extracting. And inside this tag, this connector allows you to remove subparts
> that you do no want : all the tags corresponding to declared types or
> specific attribute tag names for example.
> The code is in Apache V2 licence and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>
> It does not use additional libraries that the ones already present in MCF
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier
--
This message was sent by Atlassian JIRA
(v7.6.3#76005)