[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2019-05-06 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833691#comment-16833691
 ] 

Karl Wright commented on CONNECTORS-1500:
-

Hi [~olivierfl], please open a new ticket for further changes to shipping 
connectors.  Thanks!



> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_HTML_extractor_connector_05_06_19.txt, 
> patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2019-05-06 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833628#comment-16833628
 ] 

Olivier Tavard commented on CONNECTORS-1500:


Hi,

I would like to add a new patch for the HTML extractor connector. It handles 
the case where the englobing tag chosen by the user is not present in a crawled 
page. It fallbacks to the body tag in this case. In the current version of the 
code, it is not handled and can cause a null pointer exception on the document.

Thanks,

Olivier

[^patch_HTML_extractor_connector_05_06_19.txt]

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_HTML_extractor_connector_05_06_19.txt, 
> patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-08-14 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579906#comment-16579906
 ] 

Olivier Tavard commented on CONNECTORS-1500:


Hello,

[~kwri...@metacarta.com] I saw that you did some modifications on the code : 
thanks for that. I did a new patch for logging fix for JsoupProcessing class 
that uses connector logger as you did for other part of the code.

This patch replaces the previous one.

Thank you,

Olivier

[^patch_html_extractor_08_14_18.txt]

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-08-10 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575855#comment-16575855
 ] 

Olivier Tavard commented on CONNECTORS-1500:


Hi,

I did a minor patch to fix log levels of the messages displayed by the 
connector and delete some of them. Could you integrate it on the trunk please ?

Thanks,

Olivier

[^patch_html_extractor_fix_logs_08_10_18.txt]

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-03-17 Thread Olivier Tavard (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403778#comment-16403778
 ] 

Olivier Tavard commented on CONNECTORS-1500:


Hello,
  
 First there is in attachment a patch to fix an issue with the selection of the 
englobing tag.
  
 To answer you, let me give you an example of use :
 Let’s say that we want to crawl the documentation page of MCF. We do not want 
to have in the extracted text the menu at the left in the webpage, the text in 
the the h3 headers and all the links in the page.
 So if we want to have that in MCF, we first add a Web repository connector 
with standard parameters. Then we add a job using this web repository connector 
and the HTML extractor transformation connector.
 The seed is : 
[https://manifoldcf.apache.org/release/release-2.9.1/en_US/end-user-documentation.html]
 In the HTML extractor tab, the config will be :
*englobing tag* : div#content
*html extractor tags to remove* : h3, a, div#menu

So the transformation connector will extract the text in the  englobing tag 
_div id="content"_. Then it will delete all the text included in the _h3_ tags, 
_a_ tags and the text in the _div id="menu"_ section. It also keeps all the 
meta tags in the header and will be accessible with this syntax : 
jsoup_meta_name.

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Attachments: fix_englobing_tag_selection.txt, 
> html_extractor_transformation_connector.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-03-16 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402226#comment-16402226
 ] 

Karl Wright commented on CONNECTORS-1500:
-

Classes build and the connector runs, at least so far as the UI seems to work.

I'd very much like an example to test with, including corresponding 
specification information.  


> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Attachments: html_extractor_transformation_connector.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-03-16 Thread Karl Wright (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16402157#comment-16402157
 ] 

Karl Wright commented on CONNECTORS-1500:
-

The contribution has been committed to 
https://svn.apache.org/repos/asf/manifoldcf/branches/CONNECTORS-1500.  I'm 
working on getting classes to build now.


> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Attachments: html_extractor_transformation_connector.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)