[jira] [Commented] (CONNECTORS-1492) GSOC: Add support for Docker

2021-01-04 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17258211#comment-17258211
 ] 

Olivier Tavard commented on CONNECTORS-1492:


I am not quite sure of what you mean, so I'll guess that it is about the 
“flavors” of MCF that could be proposed:
I think that we can propose multiple Docker Compose scripts for that : for 
example a development environment with HSQLDB and Jetty/Tomcat and a more 
robust one with PostgreSQL, Zookeeper and Jetty/Tomcat.
Am I inline with your suggestion?

> GSOC: Add support for Docker
> 
>
> Key: CONNECTORS-1492
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1492
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: devops, docker, gsoc2018
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to adopt Docker to provide ready to use images with 
> preconfigured architecture stack for ManifoldCF. This will include ManifoldCF 
> itself but also the related database that can be MySQL, PostgreSQL and so on.
> This will help developers to work and put in production a complete ManifoldCF 
> installation.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write Docker files
>  * Write Docker Compose files
>  * Implement unit tests
>  * Build all the integration tests
>  * Write the documentation for new component
> We have a complete documentation about ManifioldCF:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/concepts.html]
> Take a look at our book to understand better the framework and how to extend 
> it in different ways:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1660) Patch for MCF HTML extractor connector

2020-12-11 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248054#comment-17248054
 ] 

Olivier Tavard commented on CONNECTORS-1660:


There is the global patch that includes the previous one without the log 
statement : [^patch_html_extractor_connector_11_12_2020.txt] 


> Patch for MCF HTML extractor connector
> --
>
> Key: CONNECTORS-1660
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1660
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: HTML extractor
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_html_extractor_connector_02_12_2020.txt, 
> patch_html_extractor_connector_11_12_2020.txt
>
>
> Hello,
> Here is a patch for the HTML extractor connector regarding the text 
> extraction with or without HTML stripping : 
> [^patch_html_extractor_connector_02_12_2020.txt]
>  * Extraction of HTML code : I added a whitelist through the Jsoup cleaner to 
> define what HTML elements are allowed to inforce the security. In the code I 
> set to “relaxed”:
> This whitelist allows a full range of text and structural body HTML: a, b, 
> blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, 
> h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, 
> sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul
> (more details here : 
> [https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed()])
> A future improvement of the code would be to add a new parameter on the 
> interface to choose what whitelist to choose.
>  
>  * Extraction of text with stripping HTML activated : we keep only text nodes 
> : all HTML will be stripped (same thing as before). The change is the Jsoup 
> pretty print option is now set to false to keep line breaks.
>  
> Best regards



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1660) Patch for MCF HTML extractor connector

2020-12-11 Thread Olivier Tavard (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1660?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1660:
---
Attachment: patch_html_extractor_connector_11_12_2020.txt

> Patch for MCF HTML extractor connector
> --
>
> Key: CONNECTORS-1660
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1660
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: HTML extractor
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_html_extractor_connector_02_12_2020.txt, 
> patch_html_extractor_connector_11_12_2020.txt
>
>
> Hello,
> Here is a patch for the HTML extractor connector regarding the text 
> extraction with or without HTML stripping : 
> [^patch_html_extractor_connector_02_12_2020.txt]
>  * Extraction of HTML code : I added a whitelist through the Jsoup cleaner to 
> define what HTML elements are allowed to inforce the security. In the code I 
> set to “relaxed”:
> This whitelist allows a full range of text and structural body HTML: a, b, 
> blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, 
> h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, 
> sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul
> (more details here : 
> [https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed()])
> A future improvement of the code would be to add a new parameter on the 
> interface to choose what whitelist to choose.
>  
>  * Extraction of text with stripping HTML activated : we keep only text nodes 
> : all HTML will be stripped (same thing as before). The change is the Jsoup 
> pretty print option is now set to false to keep line breaks.
>  
> Best regards



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1653) Solr ingester connector contribution

2020-12-11 Thread Olivier Tavard (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1653:
---
Attachment: patch_solr_ingester_connector_11_12_2020.txt

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_solr_ingester_connector_02_12_2020.txt, 
> patch_solr_ingester_connector_03_12_2020.txt, 
> patch_solr_ingester_connector_11_12_2020.txt, 
> solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1653) Solr ingester connector contribution

2020-12-11 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17248041#comment-17248041
 ] 

Olivier Tavard commented on CONNECTORS-1653:


There is a new patch including the previous ones : 
[^patch_solr_ingester_connector_11_12_2020.txt]
I wrote the declaration : 
{code:java}
private final static String defaultAuthorityDenyToken = "DEAD_AUTHORITY";
{code}
Because it was also into the code of plenty connectors : Generic, Null, JDBC, 
Dropbox, RSS, Web, SharePoint, etc...
Anyway I changed the code as you asked to use the variable into the superclass 
which is :
{code:java}
public static final String GLOBAL_DENY_TOKEN = "DEAD_AUTHORITY";
{code}

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_solr_ingester_connector_02_12_2020.txt, 
> patch_solr_ingester_connector_03_12_2020.txt, 
> patch_solr_ingester_connector_11_12_2020.txt, 
> solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1492) GSOC: Add support for Docker

2020-12-09 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17246564#comment-17246564
 ] 

Olivier Tavard commented on CONNECTORS-1492:


Hi,

I'm trying again to revive this thread, to see if there is any update on this 
topic.
I can work on the subject if nobody currently works on it. Considering the 
popularity of Docker, we think it is a quite important topic for MCF .
Regarding the last comment of Karl, I don't think there is any legal obstacle 
to provide Dockerfile/Docker compose files for the dockerization of MCF, except 
for JCIFS-NG library. Indeed, on the Apache foundation page : 
[https://www.apache.org/legal/resolved.html#category-a] it is explicitly stated 
that PostgreSQL does not pose any issue. This means that we should only give up 
on a fully functional windows file connector OOTB.

Any opinion ?

 

> GSOC: Add support for Docker
> 
>
> Key: CONNECTORS-1492
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1492
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: devops, docker, gsoc2018
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to adopt Docker to provide ready to use images with 
> preconfigured architecture stack for ManifoldCF. This will include ManifoldCF 
> itself but also the related database that can be MySQL, PostgreSQL and so on.
> This will help developers to work and put in production a complete ManifoldCF 
> installation.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write Docker files
>  * Write Docker Compose files
>  * Implement unit tests
>  * Build all the integration tests
>  * Write the documentation for new component
> We have a complete documentation about ManifioldCF:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/concepts.html]
> Take a look at our book to understand better the framework and how to extend 
> it in different ways:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1653) Solr ingester connector contribution

2020-12-02 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242801#comment-17242801
 ] 

Olivier Tavard commented on CONNECTORS-1653:


Hello,

 

This patch includes a fix relative to the deny token document that was not set 
to the right value.

It includes the previous patch that I sent yesterday so you can directly 
integrate this one : [^patch_solr_ingester_connector_03_12_2020.txt]

 

Thanks

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_solr_ingester_connector_02_12_2020.txt, 
> patch_solr_ingester_connector_03_12_2020.txt, 
> solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1653) Solr ingester connector contribution

2020-12-02 Thread Olivier Tavard (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1653:
---
Attachment: patch_solr_ingester_connector_03_12_2020.txt

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_solr_ingester_connector_02_12_2020.txt, 
> patch_solr_ingester_connector_03_12_2020.txt, 
> solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CONNECTORS-1660) Patch for MCF HTML extractor connector

2020-12-02 Thread Olivier Tavard (Jira)
Olivier Tavard created CONNECTORS-1660:
--

 Summary: Patch for MCF HTML extractor connector
 Key: CONNECTORS-1660
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1660
 Project: ManifoldCF
  Issue Type: Improvement
  Components: HTML extractor
Reporter: Olivier Tavard
 Attachments: patch_html_extractor_connector_02_12_2020.txt

Hello,

Here is a patch for the HTML extractor connector regarding the text extraction 
with or without HTML stripping : 
[^patch_html_extractor_connector_02_12_2020.txt]
 * Extraction of HTML code : I added a whitelist through the Jsoup cleaner to 
define what HTML elements are allowed to inforce the security. In the code I 
set to “relaxed”:

This whitelist allows a full range of text and structural body HTML: a, b, 
blockquote, br, caption, cite, code, col, colgroup, dd, div, dl, dt, em, h1, 
h2, h3, h4, h5, h6, i, img, li, ol, p, pre, q, small, span, strike, strong, 
sub, sup, table, tbody, td, tfoot, th, thead, tr, u, ul

(more details here : 
[https://jsoup.org/apidocs/org/jsoup/safety/Whitelist.html#relaxed()])

A future improvement of the code would be to add a new parameter on the 
interface to choose what whitelist to choose.

 
 * Extraction of text with stripping HTML activated : we keep only text nodes : 
all HTML will be stripped (same thing as before). The change is the Jsoup 
pretty print option is now set to false to keep line breaks.

 

Best regards



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1653) Solr ingester connector contribution

2020-12-02 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17242393#comment-17242393
 ] 

Olivier Tavard commented on CONNECTORS-1653:


Hi [~kwri...@metacarta.com] 

Here is a patch to better manage the incremental indexation : 
[^patch_solr_ingester_connector_02_12_2020.txt].

Could you integrate it please ?

 

Thanks

 

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_solr_ingester_connector_02_12_2020.txt, 
> solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Updated] (CONNECTORS-1653) Solr ingester connector contribution

2020-12-02 Thread Olivier Tavard (Jira)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1653:
---
Attachment: patch_solr_ingester_connector_02_12_2020.txt

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: patch_solr_ingester_connector_02_12_2020.txt, 
> solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1653) Solr ingester connector contribution

2020-11-21 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236664#comment-17236664
 ] 

Olivier Tavard commented on CONNECTORS-1653:


I tested directly the trunk branch because the code was incorporated into it. 
And The CONNECTORS-1653 branch contains the bug that you fixed in the commit 
named "Add missing Jetty JSP jar so crawler UI works in the examples".

Anyway, the Solr repository connector works as expected : I indexed some 
example data from a Solr docker container with the gettingstarted collection 
and the indexation was fine so I think the integration is OK.

Regarding the documentation, tell me if there is something I can do about this. 
If I remember correctly, the MCF website code is versioned somewhere. So tell 
me if I can propose a patch to include the documentation relative to the Solr 
ingester connector code because it will be difficult to MCF users to use it 
without any documentation on it.

Thanks 

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.18
>
> Attachments: solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1653) Solr ingester connector contribution

2020-11-21 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17236655#comment-17236655
 ] 

Olivier Tavard commented on CONNECTORS-1653:


Hi [~kwri...@metacarta.com]

Sure thing. I will build the branch today or tomorrow and I let you know.

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Attachments: solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1653) Solr ingester connector contribution

2020-10-15 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17214976#comment-17214976
 ] 

Olivier Tavard commented on CONNECTORS-1653:


No it is not relevant, sorry about that. It only needs the solr-solrj*.jar 
mentioned upper in the file.

> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Attachments: solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Commented] (CONNECTORS-1653) Solr ingester connector contribution

2020-10-14 Thread Olivier Tavard (Jira)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1653?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17213828#comment-17213828
 ] 

Olivier Tavard commented on CONNECTORS-1653:


Not sure if you had the time yet to look at my contribution, but I'm available 
if you need some explanations about the code or the documentation.


> Solr ingester connector contribution
> 
>
> Key: CONNECTORS-1653
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Olivier Tavard
>Priority: Minor
> Attachments: solr_ingester_connector_patch.txt
>
>
> Hi,
> We developed a new repository connector for crawling data from Solr and we 
> would like to contribute to MCF by releasing the code into Apache v2 license.
> The goal of this connector is to crawl Solr instances and manage it in MCF 
> rather than using DIH for instance.
> So to do it, we send requests to Solr and we manage the large number of 
> results thanks to the cursormark. The Solr fields must be stored in order to 
> be gathered.
> By the way we do not use any specific libraries, all the dependencies are 
> already into MCF. We tested it so far for Solr 7 and 8 versions.
> The documentation is here : 
> https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector
> The code is attached.
> Best regards,
> Olivier Tavard



--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CONNECTORS-1653) Solr ingester connector contribution

2020-09-21 Thread Olivier Tavard (Jira)
Olivier Tavard created CONNECTORS-1653:
--

 Summary: Solr ingester connector contribution
 Key: CONNECTORS-1653
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1653
 Project: ManifoldCF
  Issue Type: New Feature
Reporter: Olivier Tavard
 Attachments: solr_ingester_connector_patch.txt

Hi,

We developed a new repository connector for crawling data from Solr and we 
would like to contribute to MCF by releasing the code into Apache v2 license.

The goal of this connector is to crawl Solr instances and manage it in MCF 
rather than using DIH for instance.
So to do it, we send requests to Solr and we manage the large number of results 
thanks to the cursormark. The Solr fields must be stored in order to be 
gathered.

By the way we do not use any specific libraries, all the dependencies are 
already into MCF. We tested it so far for Solr 7 and 8 versions.

The documentation is here : 
https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/673742849/Solr+ingester+crawler+connector

The code is attached.

Best regards,

Olivier Tavard




--
This message was sent by Atlassian Jira
(v8.3.4#803005)


[jira] [Created] (CONNECTORS-1614) UI bug on parameters deletion on Generic Connector

2019-07-17 Thread Olivier Tavard (JIRA)
Olivier Tavard created CONNECTORS-1614:
--

 Summary: UI bug on parameters deletion on Generic Connector
 Key: CONNECTORS-1614
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1614
 Project: ManifoldCF
  Issue Type: Bug
Affects Versions: ManifoldCF 2.13
Reporter: Olivier Tavard
 Attachments: patch_generic_connector.txt

Hi,
There is a bug on the UI job of the Generic Connector.
To reproduce : 
- add a Generic repository connector
- add a Job related to this connector
- Click on the parameters tab
- Add some parameters
- Click on the Delete button next to the new parameters : it does not work.
Please find the patch attached.

Best regards,

Olivier Tavard



--
This message was sent by Atlassian JIRA
(v7.6.14#76016)


[jira] [Commented] (CONNECTORS-1492) GSOC: Add support for Docker

2019-06-20 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16868387#comment-16868387
 ] 

Olivier Tavard commented on CONNECTORS-1492:


Hello,

Have you managed to move forward on this topic since last year?

Is the MCF Docker development still in the roadmap or is it too uncertain due 
to legal reason ?

Thanks,

Olivier

> GSOC: Add support for Docker
> 
>
> Key: CONNECTORS-1492
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1492
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: devops, docker, gsoc2018
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to adopt Docker to provide ready to use images with 
> preconfigured architecture stack for ManifoldCF. This will include ManifoldCF 
> itself but also the related database that can be MySQL, PostgreSQL and so on.
> This will help developers to work and put in production a complete ManifoldCF 
> installation.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write Docker files
>  * Write Docker Compose files
>  * Implement unit tests
>  * Build all the integration tests
>  * Write the documentation for new component
> We have a complete documentation about ManifioldCF:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/concepts.html]
> Take a look at our book to understand better the framework and how to extend 
> it in different ways:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1610) handle error 500 in WindowsShare repository connector

2019-05-28 Thread Olivier Tavard (JIRA)
Olivier Tavard created CONNECTORS-1610:
--

 Summary: handle error 500 in WindowsShare repository connector 
 Key: CONNECTORS-1610
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1610
 Project: ManifoldCF
  Issue Type: Bug
Reporter: Olivier Tavard


Hi,
 
I have a question regarding error 500 in the WindowsShare repository connector.
 
I  recently noticed that I have a problem with a particular file that contains 
metadata with non ASCII characters. My pipeline in MCF basically contains the 
embedded Tika and the data is sent to Solr.
 
For this particular file (it is a autocad file btw) there is an error 500 that 
occurs in Solr. This happens after the embedded Tika in MCF has extracted 
content+metadata and has sent it to Solr.
 
The job does not stop and the file is sent many times to Solr which responds 
with the same error again and again :
The detail of the error in Solr is :
null:org.apache.commons.fileupload.FileUploadException: Header section has more 
than 10240 bytes (maybe it is not properly terminated)
at 
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)
at 
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:115)
 
 
In the MCF simple history, I can see that the same file is retried endlessly 
(see below) and the job is still running.
Is there a chance to change this behavior to skip the file in this case or at 
least to stop the job after a certain number of retries ?
 
PS : I sent 2 times an email in the dev mailing list but the emails never 
showed up, it is why I have created directly this issue.
 
Thanks,
 
Olivier
 
 
{code:java}
27/05/19 14:24:48 document ingest (DatafariSolrNoTika) 
file:/x.x.x.x/testfiler0...
.dwg
500 34 369 Error from server at http://127.0.0.1:8983/solr/FileShare: Expected 
mime type application/octet-stream but got application/json. { "error":{ 
"msg":"Header section has more than 10240 bytes (maybe it is not properly 
terminated)", "trace":"org.apache.commons.fileupload.FileUploadException: 
Header section has more than 10240 bytes (maybe it is not properly 
terminated)\n\tat 
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)\n\tat
 
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:115)\n\tat
 
org.apache.solr.servlet.SolrRequestParsers$MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:602)\n\tat
 
org.apache.solr.servlet.SolrRequestParsers$StandardRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:784)\n\tat
 org.apache.solr.servlet.So     27/05/19 14:24:47 extract [Tika] 
file:/x.x.x.x/testfiler0...
.dwg
OK 34 74 
27/05/19 14:23:45 document ingest (DatafariSolrNoTika) 
file:/x.x.x.x/testfiler0...
.dwg
500 34 393 Error from server at http://127.0.0.1:8983/solr/FileShare: Expected 
mime type application/octet-stream but got application/json. { "error":{ 
"msg":"Header section has more than 10240 bytes (maybe it is not properly 
terminated)", "trace":"org.apache.commons.fileupload.FileUploadException: 
Header section has more than 10240 bytes (maybe it is not properly 
terminated)\n\tat 
org.apache.commons.fileupload.FileUploadBase.parseRequest(FileUploadBase.java:362)\n\tat
 
org.apache.commons.fileupload.servlet.ServletFileUpload.parseRequest(ServletFileUpload.java:115)\n\tat
 
org.apache.solr.servlet.SolrRequestParsers$MultipartRequestParser.parseParamsAndFillStreams(SolrRequestParsers.java:602)\n\tat
 org.apache.solr.servlet{code}
 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2019-05-06 Thread Olivier Tavard (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1500:
---
Attachment: patch_HTML_extractor_connector_05_06_19.txt

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_HTML_extractor_connector_05_06_19.txt, 
> patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2019-05-06 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16833628#comment-16833628
 ] 

Olivier Tavard commented on CONNECTORS-1500:


Hi,

I would like to add a new patch for the HTML extractor connector. It handles 
the case where the englobing tag chosen by the user is not present in a crawled 
page. It fallbacks to the body tag in this case. In the current version of the 
code, it is not handled and can cause a null pointer exception on the document.

Thanks,

Olivier

[^patch_HTML_extractor_connector_05_06_19.txt]

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_HTML_extractor_connector_05_06_19.txt, 
> patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector

2018-10-17 Thread Olivier Tavard (JIRA)
Olivier Tavard created CONNECTORS-1547:
--

 Summary: No activity record for for excluded documents in 
WebCrawlerConnector
 Key: CONNECTORS-1547
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1547
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Reporter: Olivier Tavard
 Attachments: manifoldcf_local_files.log, manifoldcf_web.log, 
simple_history_files.jpg, simple_history_web.jpg

Hi,

I noticed that there is no activity record logged for documents excluded by the 
Document Filter transformation connector  in the WebCrawler connector.

To reproduce the issue on MCF out of the box :

Null output connector 

Web repository connector 

Job :

- DocumentFilter added which only accepts application/msword (doc/docx) 
documents

The simple history does not mention the documents excluded (excepted for html 
documents). They have fetch activity and that's all (see 
simple_history_web.jpeg).
We can only see the documents excluded by the MCF log (with DEBUG verbosity 
activity on connectors) :
{code:java}
Removing url 
'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png'
 because it had the wrong content type ('image/png'){code}
(see manifoldcf_local_files.log)

The related code is in WebcrawlerConnector.java l.904 :
{code:java}
fetchStatus.contextMessage = "it had the wrong content type 
('"+contentType+"')";
 fetchStatus.resultSignal = RESULT_NO_DOCUMENT;
 activityResultCode = null;{code}
The activityResultCode is null.

 

 

If we configure the same job but for a Local File system connector with the 
same Document Filter transformation connector, the simple history mentions all 
the documents excluded in the simple history (see simple_history_files.jpeg)  
and the code mentions a specific error code with an activity record logged 
(class FileConnector l. 415) : 
{code:java}
if (!activities.checkMimeTypeIndexable(mimeType))
 {
 errorCode = activities.EXCLUDED_MIMETYPE;
 errorDesc = "Excluded because mime type ('"+mimeType+"')";
 Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because mime 
type ('"+mimeType+"') was excluded by output connector.");
 activities.noDocument(documentIdentifier,versionString);
 continue;
 }{code}
 

So the Web Crawler connector should have the same behaviour than for 
FileConnector and explicitly mention all the documents excluded by the user I 
think.

 

Best regards,

Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-08-14 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16579906#comment-16579906
 ] 

Olivier Tavard commented on CONNECTORS-1500:


Hello,

[~kwri...@metacarta.com] I saw that you did some modifications on the code : 
thanks for that. I did a new patch for logging fix for JsoupProcessing class 
that uses connector logger as you did for other part of the code.

This patch replaces the previous one.

Thank you,

Olivier

[^patch_html_extractor_08_14_18.txt]

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-08-14 Thread Olivier Tavard (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1500:
---
Attachment: patch_html_extractor_08_14_18.txt

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_html_extractor_08_14_18.txt, patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-08-10 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575855#comment-16575855
 ] 

Olivier Tavard commented on CONNECTORS-1500:


Hi,

I did a minor patch to fix log levels of the messages displayed by the 
connector and delete some of them. Could you integrate it on the trunk please ?

Thanks,

Olivier

[^patch_html_extractor_fix_logs_08_10_18.txt]

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-08-10 Thread Olivier Tavard (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1500:
---
Attachment: patch_html_extractor_fix_logs_08_10_18.txt

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt, 
> patch_html_extractor_fix_logs_08_10_18.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1523) HTML Extractor transformation connector - "No englobing tag specified"

2018-08-10 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575850#comment-16575850
 ] 

Olivier Tavard commented on CONNECTORS-1523:


Hello,

In fact the connector does two jobs : extract the part of the html document 
that you want thanks to englobing tag/filters to remove and also extracts the 
metadata in the tags  named "meta tags" and in some other tags like the title 
one (complete list in JsoupProcessing class).

For the englobing tag, it only picks the first one : you can see that on the 
HtmlExtractor class line 153 :
metadataExtracted = 
JsoupProcessing.extractTextAndMetadataHtmlDocument(document.getBinaryStream(),*sp.includeFilters.get(0)*,
 sp.excludeFilters, sp.striphtml);
 
 

> HTML Extractor transformation connector - "No englobing tag specified"
> --
>
> Key: CONNECTORS-1523
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1523
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Priority: Major
>
> When adding Englobing tag to HTML Extractor transformation, Englobing tag is 
> not persisted. 
> Can add on config screen in job edit, but value is not persisted.
> Results in "No englobing tag specified".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1523) HTML Extractor transformation connector - "No englobing tag specified"

2018-08-10 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575830#comment-16575830
 ] 

Olivier Tavard commented on CONNECTORS-1523:


Hello,

OK thanks for the tests.

For your question Steph, be aware that you can have only one englobing tag. On 
the UI you can choose multiple tags but for the 'englobing tag menu' but only 
the first one is important (body by default). The code will need a fix to adapt 
the UI consequently.

For the sections "tags to remove" on the other hand you can have multiple tags 
that will be all taken into account.

The goal of the code is to choose one englobing tag ie the part that you want 
to index and after you can have multiple filters to filter only the part that 
interests you in this englobing section.

 

> HTML Extractor transformation connector - "No englobing tag specified"
> --
>
> Key: CONNECTORS-1523
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1523
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Priority: Major
>
> When adding Englobing tag to HTML Extractor transformation, Englobing tag is 
> not persisted. 
> Can add on config screen in job edit, but value is not persisted.
> Results in "No englobing tag specified".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1523) HTML Extractor transformation connector - "No englobing tag specified"

2018-08-09 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575098#comment-16575098
 ] 

Olivier Tavard commented on CONNECTORS-1523:


No it was not available for 2.10, it was just the first version of the code.

If you want more documentation about it you can go there : 
[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]

Could you give me more details about your question ? You mean if you want to 
select more than one englobing tab ?

 

> HTML Extractor transformation connector - "No englobing tag specified"
> --
>
> Key: CONNECTORS-1523
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1523
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Priority: Major
>
> When adding Englobing tag to HTML Extractor transformation, Englobing tag is 
> not persisted. 
> Can add on config screen in job edit, but value is not persisted.
> Results in "No englobing tag specified".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1523) HTML Extractor transformation connector - "No englobing tag specified"

2018-08-09 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575082#comment-16575082
 ] 

Olivier Tavard commented on CONNECTORS-1523:


Hello,

I checked it out and Karl already included the patch (r1831269) so I hope that 
the code will be included for next release of MCF.

 

 

> HTML Extractor transformation connector - "No englobing tag specified"
> --
>
> Key: CONNECTORS-1523
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1523
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Priority: Major
>
> When adding Englobing tag to HTML Extractor transformation, Englobing tag is 
> not persisted. 
> Can add on config screen in job edit, but value is not persisted.
> Results in "No englobing tag specified".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1523) HTML Extractor transformation connector - "No englobing tag specified"

2018-08-09 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1523?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16575050#comment-16575050
 ] 

Olivier Tavard commented on CONNECTORS-1523:


Hello,

I submitted a patch about that.
See https://issues.apache.org/jira/browse/CONNECTORS-1500.
I do not know if the patch was added on the MCF code.


> HTML Extractor transformation connector - "No englobing tag specified"
> --
>
> Key: CONNECTORS-1523
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1523
> Project: ManifoldCF
>  Issue Type: Bug
>Affects Versions: ManifoldCF 2.10
>Reporter: Steph van Schalkwyk
>Priority: Major
>
> When adding Englobing tag to HTML Extractor transformation, Englobing tag is 
> not persisted. 
> Can add on config screen in job edit, but value is not persisted.
> Results in "No englobing tag specified".



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1492) GSOC: Add support for Docker

2018-08-07 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16571295#comment-16571295
 ] 

Olivier Tavard commented on CONNECTORS-1492:


Hello,

Is this task on the roadmap of MCF yet ? I mean do you have a release date 
already planned for that ?

Anyway if I can help you on this, please tell me.

Thanks,

Best regards,

Olivier

> GSOC: Add support for Docker
> 
>
> Key: CONNECTORS-1492
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1492
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: devops, docker, gsoc2018
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to adopt Docker to provide ready to use images with 
> preconfigured architecture stack for ManifoldCF. This will include ManifoldCF 
> itself but also the related database that can be MySQL, PostgreSQL and so on.
> This will help developers to work and put in production a complete ManifoldCF 
> installation.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write Docker files
>  * Write Docker Compose files
>  * Implement unit tests
>  * Build all the integration tests
>  * Write the documentation for new component
> We have a complete documentation about ManifioldCF:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/concepts.html]
> Take a look at our book to understand better the framework and how to extend 
> it in different ways:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1492) GSOC: Add support for Docker

2018-06-13 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16511154#comment-16511154
 ] 

Olivier Tavard commented on CONNECTORS-1492:


Hello [~piergiorgioluc...@gmail.com],

Thanks for the status update. Too bad that there was no proposal for this topic 
:(

 

> GSOC: Add support for Docker
> 
>
> Key: CONNECTORS-1492
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1492
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: devops, docker, gsoc2018
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to adopt Docker to provide ready to use images with 
> preconfigured architecture stack for ManifoldCF. This will include ManifoldCF 
> itself but also the related database that can be MySQL, PostgreSQL and so on.
> This will help developers to work and put in production a complete ManifoldCF 
> installation.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write Docker files
>  * Write Docker Compose files
>  * Implement unit tests
>  * Build all the integration tests
>  * Write the documentation for new component
> We have a complete documentation about ManifioldCF:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/concepts.html]
> Take a look at our book to understand better the framework and how to extend 
> it in different ways:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1492) GSOC: Add support for Docker

2018-06-06 Thread Olivier Tavard (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1492?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16502934#comment-16502934
 ] 

Olivier Tavard commented on CONNECTORS-1492:


Hi,

Is it possible to have an update on the issue please ? I mean is this project 
accepted by GOSC finally ? 

Thanks !

> GSOC: Add support for Docker
> 
>
> Key: CONNECTORS-1492
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1492
> Project: ManifoldCF
>  Issue Type: New Feature
>Reporter: Piergiorgio Lucidi
>Assignee: Piergiorgio Lucidi
>Priority: Major
>  Labels: devops, docker, gsoc2018
>   Original Estimate: 240h
>  Remaining Estimate: 240h
>
> This is a project idea for [Google Summer of 
> Code|https://summerofcode.withgoogle.com/] (GSOC).
> To discuss this or other ideas with your potential mentor from the Apache 
> ManifoldCF project, sign up and post to the dev@manifoldcf.apache.org list, 
> including "[GSOC]" in the subject. You may also comment on this Jira issue if 
> you have created an account. 
> We would like to adopt Docker to provide ready to use images with 
> preconfigured architecture stack for ManifoldCF. This will include ManifoldCF 
> itself but also the related database that can be MySQL, PostgreSQL and so on.
> This will help developers to work and put in production a complete ManifoldCF 
> installation.
> You will be involved in the development of the following tasks, you will 
> learn how to:
>  * Write Docker files
>  * Write Docker Compose files
>  * Implement unit tests
>  * Build all the integration tests
>  * Write the documentation for new component
> We have a complete documentation about ManifioldCF:
> [https://manifoldcf.apache.org/release/release-2.9.1/en_US/concepts.html]
> Take a look at our book to understand better the framework and how to extend 
> it in different ways:
> [https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs]
>  
> Prospective GSOC mentor: 
> [piergior...@apache.org|mailto:piergior...@apache.org]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-05-09 Thread Olivier Tavard (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1500:
---
Attachment: global_patch.txt

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Fix For: ManifoldCF 2.10
>
> Attachments: fix_englobing_tag_selection.txt, global_patch.txt, 
> html_extractor_transformation_connector.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-03-17 Thread Olivier Tavard (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16403778#comment-16403778
 ] 

Olivier Tavard commented on CONNECTORS-1500:


Hello,
  
 First there is in attachment a patch to fix an issue with the selection of the 
englobing tag.
  
 To answer you, let me give you an example of use :
 Let’s say that we want to crawl the documentation page of MCF. We do not want 
to have in the extracted text the menu at the left in the webpage, the text in 
the the h3 headers and all the links in the page.
 So if we want to have that in MCF, we first add a Web repository connector 
with standard parameters. Then we add a job using this web repository connector 
and the HTML extractor transformation connector.
 The seed is : 
[https://manifoldcf.apache.org/release/release-2.9.1/en_US/end-user-documentation.html]
 In the HTML extractor tab, the config will be :
*englobing tag* : div#content
*html extractor tags to remove* : h3, a, div#menu

So the transformation connector will extract the text in the  englobing tag 
_div id="content"_. Then it will delete all the text included in the _h3_ tags, 
_a_ tags and the text in the _div id="menu"_ section. It also keeps all the 
meta tags in the header and will be accessible with this syntax : 
jsoup_meta_name.

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Attachments: fix_englobing_tag_selection.txt, 
> html_extractor_transformation_connector.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-03-17 Thread Olivier Tavard (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1500?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1500:
---
Attachment: fix_englobing_tag_selection.txt

> HTML Extractor transformation connector contribution
> 
>
> Key: CONNECTORS-1500
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
> Project: ManifoldCF
>  Issue Type: Improvement
>Affects Versions: ManifoldCF 2.9.1
>Reporter: Olivier Tavard
>Assignee: Karl Wright
>Priority: Minor
> Attachments: fix_englobing_tag_selection.txt, 
> html_extractor_transformation_connector.txt
>
>
> Hi,
> I developed a transformation connector based on Jsoup. The goal of this code 
> is to simply choose an encompassing tag in a HTML document for text 
> extracting. And inside this tag, this connector allows you to remove subparts 
> that you do no want : all the tags corresponding to declared types or 
> specific attribute tag names for example.
> The code is in Apache V2 licence  and it is in attachment.
> It needs some work including code refactoring, renaming classes, unit tests 
> that I will be able to do if you are interested by the code.
> The documentation is here :
> [https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]
>  
> It does not use additional libraries that the ones already present in MCF 
> project. It is based on Jsoup library on lib folder.
> Best regards,
> Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Created] (CONNECTORS-1500) HTML Extractor transformation connector contribution

2018-03-15 Thread Olivier Tavard (JIRA)
Olivier Tavard created CONNECTORS-1500:
--

 Summary: HTML Extractor transformation connector contribution
 Key: CONNECTORS-1500
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1500
 Project: ManifoldCF
  Issue Type: Improvement
Affects Versions: ManifoldCF 2.9.1
Reporter: Olivier Tavard
 Attachments: html_extractor_transformation_connector.txt

Hi,

I developed a transformation connector based on Jsoup. The goal of this code is 
to simply choose an encompassing tag in a HTML document for text extracting. 
And inside this tag, this connector allows you to remove subparts that you do 
no want : all the tags corresponding to declared types or specific attribute 
tag names for example.
The code is in Apache V2 licence  and it is in attachment.

It needs some work including code refactoring, renaming classes, unit tests 
that I will be able to do if you are interested by the code.
The documentation is here :

[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]<[https://datafari.atlassian.net/wiki/spaces/DATAFARI/pages/237240321/HTML+Extractor+Transformation+connector]

 

It does not use additional libraries that the ones already present in MCF 
project. It is based on Jsoup library on lib folder.

Best regards,

Olivier



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1337) Version of Tika dependencies

2016-09-05 Thread Olivier Tavard (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1337?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1337:
---
Attachment: CONNECTORS-1337.patch

> Version of Tika dependencies
> 
>
> Key: CONNECTORS-1337
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1337
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Build
>Affects Versions: ManifoldCF 2.5
>Reporter: Olivier Tavard
> Attachments: CONNECTORS-1337.patch
>
>
> Hello,
> It seems that some dependencies of Tika 1.13 are not in the recommended 
> version in MCF.
> I mean particularly for Apache POI libs. I check on Maven repository : 
> http://mvnrepository.com/artifact/org.apache.tika/tika-parsers/1.13
> POI dependencies are at version 3.15-beta1 : poi, poi-ooxml and 
> poi-scratchpad. In MCF 2.5, the POI libs are in version 3.13.
> I noticed that because I had the same issue than mentioned here with MCF 2.4 
> : https://issues.apache.org/jira/browse/CONNECTORS-1311
> After upgrading to MCF 2.5 I had the same issue. After updating the POI libs, 
> I do not have the issue anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CONNECTORS-1337) Version of Tika dependencies

2016-09-05 Thread Olivier Tavard (JIRA)
Olivier Tavard created CONNECTORS-1337:
--

 Summary: Version of Tika dependencies
 Key: CONNECTORS-1337
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1337
 Project: ManifoldCF
  Issue Type: Bug
  Components: Build
Affects Versions: ManifoldCF 2.5
Reporter: Olivier Tavard


Hello,

It seems that some dependencies of Tika 1.13 are not in the recommended version 
in MCF.
I mean particularly for Apache POI libs. I check on Maven repository : 
http://mvnrepository.com/artifact/org.apache.tika/tika-parsers/1.13
POI dependencies are at version 3.15-beta1 : poi, poi-ooxml and poi-scratchpad. 
In MCF 2.5, the POI libs are in version 3.13.
I noticed that because I had the same issue than mentioned here with MCF 2.4 : 
https://issues.apache.org/jira/browse/CONNECTORS-1311
After upgrading to MCF 2.5 I had the same issue. After updating the POI libs, I 
do not have the issue anymore.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Updated] (CONNECTORS-1322) Metadata "FROM" not present in Email connector

2016-06-10 Thread Olivier Tavard (JIRA)

 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Olivier Tavard updated CONNECTORS-1322:
---
Attachment: CONNECTORS-1322.patch

> Metadata "FROM" not present in Email connector
> --
>
> Key: CONNECTORS-1322
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1322
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Email connector
>Affects Versions: ManifoldCF 2.4
>Reporter: Olivier Tavard
>Priority: Minor
> Attachments: CONNECTORS-1322.patch
>
>
> The metadata "FROM" is not present in the metadata extracted from the mails 
> like "TO", "SUBJETC" etc... in the email connector.
> The bug is due by an incorrect name of variable in the EmailConnector class. 
> I modified it in the code, please see the patch in attachment.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1322) Metadata "FROM" not present in Email connector

2016-06-10 Thread Olivier Tavard (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1322?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15324328#comment-15324328
 ] 

Olivier Tavard commented on CONNECTORS-1322:


Sorry, I do not find how to upload the patch, here it is the link : 
http://www.francelabs.com/files/CONNECTORS-1322.patch

> Metadata "FROM" not present in Email connector
> --
>
> Key: CONNECTORS-1322
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1322
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Email connector
>Affects Versions: ManifoldCF 2.4
>Reporter: Olivier Tavard
>Priority: Minor
>
> The metadata "FROM" is not present in the metadata extracted from the mails 
> like "TO", "SUBJETC" etc... in the email connector.
> The bug is due by an incorrect name of variable in the EmailConnector class. 
> I modified it in the code, please see the patch in attachment.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Created] (CONNECTORS-1322) Metadata "FROM" not present in Email connector

2016-06-10 Thread Olivier Tavard (JIRA)
Olivier Tavard created CONNECTORS-1322:
--

 Summary: Metadata "FROM" not present in Email connector
 Key: CONNECTORS-1322
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1322
 Project: ManifoldCF
  Issue Type: Bug
  Components: Email connector
Affects Versions: ManifoldCF 2.4
Reporter: Olivier Tavard
Priority: Minor


The metadata "FROM" is not present in the metadata extracted from the mails 
like "TO", "SUBJETC" etc... in the email connector.
The bug is due by an incorrect name of variable in the EmailConnector class. 
I modified it in the code, please see the patch in attachment.
Thanks




--
This message was sent by Atlassian JIRA
(v6.3.4#6332)


[jira] [Commented] (CONNECTORS-1281) Bug for crawling documents from Microsoft SQL database

2016-02-24 Thread Olivier Tavard (JIRA)

[ 
https://issues.apache.org/jira/browse/CONNECTORS-1281?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15165646#comment-15165646
 ] 

Olivier Tavard commented on CONNECTORS-1281:


Hi Karl,

Thank you very much for your quick answer and for your patch. I applied it and 
all is fine now.


> Bug for crawling documents from Microsoft SQL database 
> ---
>
> Key: CONNECTORS-1281
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1281
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: JDBC connector
>Affects Versions: ManifoldCF 2.2, ManifoldCF 2.3
>Reporter: Olivier Tavard
>Assignee: Karl Wright
> Fix For: ManifoldCF 2.4
>
> Attachments: CONNECTORS-1281.patch
>
>
> Hi,
> I have some issue to crawl documents from a Microsoft SQL database.
> The jtds-1.2.4.jar is well present in my MCF installation (MCF built from 
> sources and jtds.jar added in lib-proprietary from the ant script)
> After crawling about 200 documents the job does not do anything and the 
> following error is present for each document crawled :
> {code:xml}
> FATAL 2016-02-24 18:27:47,078 (Startup thread) - Error tossed: null
> java.lang.AbstractMethodError
> at 
> net.sourceforge.jtds.jdbc.ConnectionJDBC2.isValid(ConnectionJDBC2.java:2589)
> at 
> org.apache.manifoldcf.core.jdbcpool.ConnectionPool.getConnection(ConnectionPool.java:92)
> at 
> org.apache.manifoldcf.jdbc.JDBCConnectionFactory.getConnection(JDBCConnectionFactory.java:132)
> at 
> org.apache.manifoldcf.jdbc.JDBCConnection$PreparedStatementQueryThread.run(JDBCConnection.java:1389)
> {code}
> It seems that there is an issue with the validation of the connection between 
> MCF and JTDS : 
> http://stackoverflow.com/questions/26404283/java-hibernate-with-sql-server-2012-not-working
> What is the best way to fix it ? Do we need to implement our own isValid() 
> method and  do a simple select into it ?
> By the way, we do not have this issue with MCF 2.0.1.
> Thanks



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)