Hello MCF community, I developed a transformation connector based on Jsoup. The goal of this code id to simply choose an encompassing tag in a HTML document for text extracting. And inside this tag, this connector allows you to remove subparts that you do no want : all the tags corresponding to declared types or specific attribute tag names for example. I would like to know if it could interest you. The code is in Apache V2 licence and I integrated it in our enterprise search solution (Datafari). This morning I integrated the code in a fork MCF project on GitHub. Obviously it needs some work including code refactoring, renaming classes, unit tests that I will be able to do if you are interested by the code. The code is here : https://github.com/otavard/manifoldcf/tree/htmlextractorconnector <https://github.com/otavard/manifoldcf/commits/htmlextractorconnector> And the documentation here : https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector <https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector>
Best regards, Olivier TAVARD
