[jira] [Commented] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773278#comment-17773278 ] Julien Massiera commented on CONNECTORS-1740: - Bonjour, Merci pour votre email. Mr Massiera a quitte ses fonctions le 14 avril 2023 et cet email sera bientot desactive. Pour toute question sur France Labs ou sur des projets en cours, merci de contacter cedric.ulmer att francelabs.com > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Attachments: CONNECTORS-1740.patch > > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17773268#comment-17773268 ] Julien Massiera commented on CONNECTORS-1740: - Bonjour, Merci pour votre email. Mr Massiera a quitte ses fonctions le 14 avril 2023 et cet email sera bientot desactive. Pour toute question sur France Labs ou sur des projets en cours, merci de contacter cedric.ulmer att francelabs.com > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Attachments: CONNECTORS-1740.patch > > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740 ] Julien Massiera deleted comment on CONNECTORS-1740: - was (Author: julienfl): Bonjour, Merci pour votre email. Mr Massiera a quitte ses fonctions le 14 avril 2023 et cet email sera bientot desactive. Pour toute question sur France Labs ou sur des projets en cours, merci de contacter cedric.ulmer att francelabs.com > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Attachments: CONNECTORS-1740.patch > > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740 ] Julien Massiera deleted comment on CONNECTORS-1740: - was (Author: julienfl): Bonjour, Merci pour votre email. Mr Massiera a quitte ses fonctions le 14 avril 2023 et cet email sera bientot desactive. Pour toute question sur France Labs ou sur des projets en cours, merci de contacter cedric.ulmer att francelabs.com > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Attachments: CONNECTORS-1740.patch > > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740 ] Julien Massiera deleted comment on CONNECTORS-1740: - was (Author: julienfl): Bonjour, Merci pour votre email. Mr Massiera a quitte ses fonctions le 14 avril 2023 et cet email sera bientot desactive. Pour toute question sur France Labs ou sur des projets en cours, merci de contacter cedric.ulmer att francelabs.com > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Attachments: CONNECTORS-1740.patch > > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17744458#comment-17744458 ] Julien Massiera commented on CONNECTORS-1740: - Bonjour, Merci pour votre email. Mr Massiera a quitte ses fonctions le 14 avril 2023 et cet email sera bientot desactive. Pour toute question sur France Labs ou sur des projets en cours, merci de contacter cedric.ulmer att francelabs.com > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Attachments: CONNECTORS-1740.patch > > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729127#comment-17729127 ] Julien Massiera commented on CONNECTORS-1740: - Bonjour, Merci pour votre email. Mr Massiera a quitte ses fonctions le 14 avril 2023 et cet email sera bientot desactive. Pour toute question sur France Labs ou sur des projets en cours, merci de contacter cedric.ulmer att francelabs.com > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Attachments: CONNECTORS-1740.patch > > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17729125#comment-17729125 ] Julien Massiera commented on CONNECTORS-1740: - Bonjour, Merci pour votre email. Mr Massiera a quitte ses fonctions le 14 avril 2023 et cet email sera bientot desactive. Pour toute question sur France Labs ou sur des projets en cours, merci de contacter cedric.ulmer att francelabs.com > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Attachments: CONNECTORS-1740.patch > > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17712355#comment-17712355 ] Julien Massiera commented on CONNECTORS-1740: - No I did not try with an older Zookeeper version, I used the recommended one specified for Solr 9. We can indeed test this ! > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17711425#comment-17711425 ] Julien Massiera commented on CONNECTORS-1740: - r1909097 of the branch, the Solr output connector has been fixed and now multipart is applied on all POST requests The Unit tests of framework core (Zookeeper tests) and Solr connector are still broken. I would really appreciate help to fix the tests ! > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Priority: Major > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (CONNECTORS-1742) Handle CSV in JDBC connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1742. - Fix Version/s: ManifoldCF next Resolution: Fixed r1905864 > Handle CSV in JDBC connector > > > Key: CONNECTORS-1742 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1742 > Project: ManifoldCF > Issue Type: Improvement > Components: JDBC connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > > A JDBC CSV driver exists [https://github.com/jprante/jdbc-driver-csv] and can > be useful to crawl big CSV files. We should add the possibility to use it in > the JDBC connector -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1742) Handle CSV in JDBC connector
Julien Massiera created CONNECTORS-1742: --- Summary: Handle CSV in JDBC connector Key: CONNECTORS-1742 URL: https://issues.apache.org/jira/browse/CONNECTORS-1742 Project: ManifoldCF Issue Type: Improvement Components: JDBC connector Affects Versions: ManifoldCF 2.23 Reporter: Julien Massiera Assignee: Julien Massiera A JDBC CSV driver exists [https://github.com/jprante/jdbc-driver-csv] and can be useful to crawl big CSV files. We should add the possibility to use it in the JDBC connector -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1740) Solr 9 output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1740?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17643980#comment-17643980 ] Julien Massiera commented on CONNECTORS-1740: - I created a CONNECTORS-1740 branch. I updated the SolrJ version to 9 and Zookeeper to 3.8.0 Some problems appear: * The tests of the framework core are broken because of Zookeeper * The tests of the Solr connector are broken * The updated Solr output connector works with Solr 9 and older versions but I did not port the basic/preemptive auth, neither the multipart post requests custom code. After some tests it appears that they are required because some documents with a lot of metadata trigger errors during the ingest phase Unfortunately I currently don't have more time to spend on these issues and I would appreciate any help to solve them ! > Solr 9 output connector > --- > > Key: CONNECTORS-1740 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Priority: Major > > The current Solr output connector is not compatible with Solr 9.x > We need to update the connector with SolrJ 9 and make sure that the custom > code (multipart post requests, basic/preemptive auth) is still required, and, > in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1740) Solr 9 output connector
Julien Massiera created CONNECTORS-1740: --- Summary: Solr 9 output connector Key: CONNECTORS-1740 URL: https://issues.apache.org/jira/browse/CONNECTORS-1740 Project: ManifoldCF Issue Type: Improvement Components: Lucene/SOLR connector Affects Versions: ManifoldCF 2.23 Reporter: Julien Massiera The current Solr output connector is not compatible with Solr 9.x We need to update the connector with SolrJ 9 and make sure that the custom code (multipart post requests, basic/preemptive auth) is still required, and, in case it is, port it ! -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (CONNECTORS-1736) LDAP Mapper: attribute condition
[ https://issues.apache.org/jira/browse/CONNECTORS-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1736: Component/s: LDAP Mapper (was: LDAP authority) > LDAP Mapper: attribute condition > > > Key: CONNECTORS-1736 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1736 > Project: ManifoldCF > Issue Type: Improvement > Components: LDAP Mapper >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.24 > > > Sometimes, the user mapping may depends on a specific attribute value. It > would be good to provide a way to configure a mapping condition, based on an > LDAP attribute matching a regexp, that will determine the final mapping to > perform. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (CONNECTORS-1736) LDAP Mapper: attribute condition
[ https://issues.apache.org/jira/browse/CONNECTORS-1736?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1736. - Fix Version/s: ManifoldCF 2.24 Resolution: Fixed r1904380 > LDAP Mapper: attribute condition > > > Key: CONNECTORS-1736 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1736 > Project: ManifoldCF > Issue Type: Improvement > Components: LDAP authority >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.24 > > > Sometimes, the user mapping may depends on a specific attribute value. It > would be good to provide a way to configure a mapping condition, based on an > LDAP attribute matching a regexp, that will determine the final mapping to > perform. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1736) LDAP Mapper: attribute condition
Julien Massiera created CONNECTORS-1736: --- Summary: LDAP Mapper: attribute condition Key: CONNECTORS-1736 URL: https://issues.apache.org/jira/browse/CONNECTORS-1736 Project: ManifoldCF Issue Type: Improvement Components: LDAP authority Affects Versions: ManifoldCF 2.23 Reporter: Julien Massiera Assignee: Julien Massiera Sometimes, the user mapping may depends on a specific attribute value. It would be good to provide a way to configure a mapping condition, based on an LDAP attribute matching a regexp, that will determine the final mapping to perform. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (CONNECTORS-1734) Add space and user details in error logs of Confluence authority
[ https://issues.apache.org/jira/browse/CONNECTORS-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1734: Fix Version/s: ManifoldCF 2.24 (was: ManifoldCF next) > Add space and user details in error logs of Confluence authority > > > Key: CONNECTORS-1734 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1734 > Project: ManifoldCF > Issue Type: Improvement > Components: Confluence connector >Affects Versions: ManifoldCF 2.22 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.24 > > > Currently in the Confluence authority connector, when an error occurs when > retrieving user permissions, we generate an error log that does not specify > the user and the space concerned. It would be better to put them in the log > for debugging purposes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Updated] (CONNECTORS-1733) TikaServiceRmeta does not properly handle unknown tika exceptions
[ https://issues.apache.org/jira/browse/CONNECTORS-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1733: Fix Version/s: ManifoldCF 2.24 (was: ManifoldCF next) > TikaServiceRmeta does not properly handle unknown tika exceptions > - > > Key: CONNECTORS-1733 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1733 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector >Affects Versions: ManifoldCF 2.22 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.24 > > > With the introduction of new exception formats in Tika 2.0, the > TikaServiceRmeta connector does not correctly handle some of them, resulting > in metadata and content extraction issues for some files -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (CONNECTORS-1735) TikaServiceRmeta does not properly handle embedded resources
[ https://issues.apache.org/jira/browse/CONNECTORS-1735?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1735. - Resolution: Fixed r1904280 > TikaServiceRmeta does not properly handle embedded resources > > > Key: CONNECTORS-1735 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1735 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector >Affects Versions: ManifoldCF 2.23 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.24 > > > Currently when a file processed by Tika contains embedded resources, the > TikaServiceRmeta connector simply ignores the embedded resources. > The connector should at least add the extracted content of embedded resources > to the main document content if the "Extract archives content" option is > enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1735) TikaServiceRmeta does not properly handle embedded resources
Julien Massiera created CONNECTORS-1735: --- Summary: TikaServiceRmeta does not properly handle embedded resources Key: CONNECTORS-1735 URL: https://issues.apache.org/jira/browse/CONNECTORS-1735 Project: ManifoldCF Issue Type: Bug Components: Tika service connector Affects Versions: ManifoldCF 2.23 Reporter: Julien Massiera Assignee: Julien Massiera Fix For: ManifoldCF 2.24 Currently when a file processed by Tika contains embedded resources, the TikaServiceRmeta connector simply ignores the embedded resources. The connector should at least add the extracted content of embedded resources to the main document content if the "Extract archives content" option is enabled. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (CONNECTORS-1734) Add space and user details in error logs of Confluence authority
[ https://issues.apache.org/jira/browse/CONNECTORS-1734?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1734. - Fix Version/s: ManifoldCF next Resolution: Fixed r1904267 > Add space and user details in error logs of Confluence authority > > > Key: CONNECTORS-1734 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1734 > Project: ManifoldCF > Issue Type: Improvement > Components: Confluence connector >Affects Versions: ManifoldCF 2.22 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > > Currently in the Confluence authority connector, when an error occurs when > retrieving user permissions, we generate an error log that does not specify > the user and the space concerned. It would be better to put them in the log > for debugging purposes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1734) Add space and user details in error logs of Confluence authority
Julien Massiera created CONNECTORS-1734: --- Summary: Add space and user details in error logs of Confluence authority Key: CONNECTORS-1734 URL: https://issues.apache.org/jira/browse/CONNECTORS-1734 Project: ManifoldCF Issue Type: Improvement Components: Confluence connector Affects Versions: ManifoldCF 2.22 Reporter: Julien Massiera Assignee: Julien Massiera Currently in the Confluence authority connector, when an error occurs when retrieving user permissions, we generate an error log that does not specify the user and the space concerned. It would be better to put them in the log for debugging purposes -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (CONNECTORS-1733) TikaServiceRmeta does not properly handle unknown tika exceptions
[ https://issues.apache.org/jira/browse/CONNECTORS-1733?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1733. - Fix Version/s: ManifoldCF next Resolution: Fixed r1904264 > TikaServiceRmeta does not properly handle unknown tika exceptions > - > > Key: CONNECTORS-1733 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1733 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector >Affects Versions: ManifoldCF 2.22 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > > With the introduction of new exception formats in Tika 2.0, the > TikaServiceRmeta connector does not correctly handle some of them, resulting > in metadata and content extraction issues for some files -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1733) TikaServiceRmeta does not properly handle unknown tika exceptions
Julien Massiera created CONNECTORS-1733: --- Summary: TikaServiceRmeta does not properly handle unknown tika exceptions Key: CONNECTORS-1733 URL: https://issues.apache.org/jira/browse/CONNECTORS-1733 Project: ManifoldCF Issue Type: Bug Components: Tika service connector Affects Versions: ManifoldCF 2.22 Reporter: Julien Massiera Assignee: Julien Massiera With the introduction of new exception formats in Tika 2.0, the TikaServiceRmeta connector does not correctly handle some of them, resulting in metadata and content extraction issues for some files -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (CONNECTORS-1732) Github mirror out of sync
[ https://issues.apache.org/jira/browse/CONNECTORS-1732?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1732. - Resolution: Invalid wrong place for that issue > Github mirror out of sync > - > > Key: CONNECTORS-1732 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1732 > Project: ManifoldCF > Issue Type: Bug > Components: Build >Affects Versions: ManifoldCF 2.22 >Reporter: Julien Massiera >Priority: Major > > The Github mirror seems out of sync with the SVN repo since June 2022 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1732) Github mirror out of sync
Julien Massiera created CONNECTORS-1732: --- Summary: Github mirror out of sync Key: CONNECTORS-1732 URL: https://issues.apache.org/jira/browse/CONNECTORS-1732 Project: ManifoldCF Issue Type: Bug Components: Build Affects Versions: ManifoldCF 2.22 Reporter: Julien Massiera The Github mirror seems out of sync with the SVN repo since June 2022 -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (CONNECTORS-1721) Confluence v6 does not distinguish 404 errors
[ https://issues.apache.org/jira/browse/CONNECTORS-1721?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1721. - Fix Version/s: ManifoldCF 2.23 Resolution: Fixed r1902854 > Confluence v6 does not distinguish 404 errors > - > > Key: CONNECTORS-1721 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1721 > Project: ManifoldCF > Issue Type: Improvement > Components: Confluence connector >Affects Versions: ManifoldCF 2.22 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.23 > > > The ConfluenceV6 connector does not distinguish 404 errors from others. It is > problematic concerning the authority because the 404 error corresponds to a > "user not found" response instead of a "dead authority" > The connector must correctly handle the 404 errors -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1721) Confluence v6 does not distinguish 404 errors
Julien Massiera created CONNECTORS-1721: --- Summary: Confluence v6 does not distinguish 404 errors Key: CONNECTORS-1721 URL: https://issues.apache.org/jira/browse/CONNECTORS-1721 Project: ManifoldCF Issue Type: Improvement Components: Confluence connector Affects Versions: ManifoldCF 2.22 Reporter: Julien Massiera Assignee: Julien Massiera The ConfluenceV6 connector does not distinguish 404 errors from others. It is problematic concerning the authority because the 404 error corresponds to a "user not found" response instead of a "dead authority" The connector must correctly handle the 404 errors -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Resolved] (CONNECTORS-1719) Handle MariaDB in JDBC connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1719?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1719. - Fix Version/s: ManifoldCF 2.23 Resolution: Fixed r1902537 > Handle MariaDB in JDBC connector > > > Key: CONNECTORS-1719 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1719 > Project: ManifoldCF > Issue Type: Improvement > Components: JDBC connector >Affects Versions: ManifoldCF 2.22 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.23 > > > Currently the JDBC connector does not officially handle MariaDB databases. > It may work with MariaDB databases up to v2.x using the MySQL type because > before v3, MariaDB was compatible with the MySQL JDBC driver and provider > name, but it is not anymore the case. > -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Created] (CONNECTORS-1719) Handle MariaDB in JDBC connector
Julien Massiera created CONNECTORS-1719: --- Summary: Handle MariaDB in JDBC connector Key: CONNECTORS-1719 URL: https://issues.apache.org/jira/browse/CONNECTORS-1719 Project: ManifoldCF Issue Type: Improvement Components: JDBC connector Affects Versions: ManifoldCF 2.22 Reporter: Julien Massiera Assignee: Julien Massiera Currently the JDBC connector does not officially handle MariaDB databases. It may work with MariaDB databases up to v2.x using the MySQL type because before v3, MariaDB was compatible with the MySQL JDBC driver and provider name, but it is not anymore the case. -- This message was sent by Atlassian Jira (v8.20.10#820010)
[jira] [Commented] (CONNECTORS-1667) New Tika Service Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17551454#comment-17551454 ] Julien Massiera commented on CONNECTORS-1667: - Hi [~cguzel], no the Tika service connector does not correctly handle Tika server 2.x because of the metadata keys indeed. You should consider using the tika-service-rmeta-connector instead which is better in terms of performances and stability, and has been updated to be compatible with the latest version of Tika Server (see CONNECTORS-1703) I am currently only maintaining that version of tika service connector by the way, because as you said, the maintenance cost is very limited, and having an external Tika instead of an embedded one is more reliable. > New Tika Service Connector > -- > > Key: CONNECTORS-1667 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1667 > Project: ManifoldCF > Issue Type: New Feature > Components: Tika service connector >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.20 > > > The current Tika Service Connector exploits the '/unpack/all' endpoint of a > Tika Server. This endpoint is not optimal to only extract document's metadata > and content. We should develop a new connector based on the 'rmeta' endpoint > which is more suited for our needs. -- This message was sent by Atlassian Jira (v8.20.7#820007)
[jira] [Created] (CONNECTORS-1712) Broken Velocity UI
Julien Massiera created CONNECTORS-1712: --- Summary: Broken Velocity UI Key: CONNECTORS-1712 URL: https://issues.apache.org/jira/browse/CONNECTORS-1712 Project: ManifoldCF Issue Type: Bug Components: API Affects Versions: ManifoldCF 2.22 Reporter: Julien Massiera In the mcf-crawler-ui, we cannot enter in edition mode for any connector because there is a problem with Velocity. We obtain the following error in the logs: {code:java} java.lang.NoSuchMethodError: 'void org.apache.velocity.app.VelocityEngine.setExtendedProperties(org.apache.commons.collections.ExtendedProperties)' at org.apache.manifoldcf.core.i18n.Messages.createVelocityEngine(Messages.java:62) ~[mcf-core.jar:?] at org.apache.manifoldcf.ui.i18n.Messages.outputResourceWithVelocity(Messages.java:132) ~[mcf-ui-core.jar:?] at com.francelabs.datafari.connectors.share.Messages.outputResourceWithVelocity(Messages.java:111) ~[?:?] at com.francelabs.datafari.connectors.share.SharedDriveConnector.outputSpecificationHeader(SharedDriveConnector.java:2829) ~[?:?] at org.apache.jsp.editjob_jsp._jspService(editjob_jsp.java:977) ~[mcf-crawler-ui.jar:?] at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) ~[jasper.jar:9.0.56] at javax.servlet.http.HttpServlet.service(HttpServlet.java:764) ~[servlet-api.jar:4.0.FR] at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:466) ~[jasper.jar:9.0.56] at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:379) ~[jasper.jar:9.0.56] at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:327) ~[jasper.jar:9.0.56] at javax.servlet.http.HttpServlet.service(HttpServlet.java:764) ~[servlet-api.jar:4.0.FR] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:227) ~[catalina.jar:9.0.56] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[catalina.jar:9.0.56] at org.apache.tomcat.websocket.server.WsFilter.doFilter(WsFilter.java:53) ~[tomcat-websocket.jar:9.0.56] at org.apache.catalina.core.ApplicationFilterChain.internalDoFilter(ApplicationFilterChain.java:189) ~[catalina.jar:9.0.56] at org.apache.catalina.core.ApplicationFilterChain.doFilter(ApplicationFilterChain.java:162) ~[catalina.jar:9.0.56] at org.apache.catalina.core.StandardWrapperValve.invoke(StandardWrapperValve.java:197) [catalina.jar:9.0.56] at org.apache.catalina.core.StandardContextValve.invoke(StandardContextValve.java:97) [catalina.jar:9.0.56] at org.apache.catalina.authenticator.AuthenticatorBase.invoke(AuthenticatorBase.java:540) [catalina.jar:9.0.56] at org.apache.catalina.core.StandardHostValve.invoke(StandardHostValve.java:135) [catalina.jar:9.0.56] at org.apache.catalina.valves.ErrorReportValve.invoke(ErrorReportValve.java:92) [catalina.jar:9.0.56] at org.apache.catalina.valves.AbstractAccessLogValve.invoke(AbstractAccessLogValve.java:687) [catalina.jar:9.0.56] at org.apache.catalina.core.StandardEngineValve.invoke(StandardEngineValve.java:78) [catalina.jar:9.0.56] at org.apache.catalina.connector.CoyoteAdapter.service(CoyoteAdapter.java:357) [catalina.jar:9.0.56] at org.apache.coyote.ajp.AjpProcessor.service(AjpProcessor.java:433) [tomcat-coyote.jar:9.0.56] at org.apache.coyote.AbstractProcessorLight.process(AbstractProcessorLight.java:65) [tomcat-coyote.jar:9.0.56] at org.apache.coyote.AbstractProtocol$ConnectionHandler.process(AbstractProtocol.java:895) [tomcat-coyote.jar:9.0.56] at org.apache.tomcat.util.net.NioEndpoint$SocketProcessor.doRun(NioEndpoint.java:1732) [tomcat-coyote.jar:9.0.56] at org.apache.tomcat.util.net.SocketProcessorBase.run(SocketProcessorBase.java:49) [tomcat-coyote.jar:9.0.56] at org.apache.tomcat.util.threads.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1191) [tomcat-util.jar:9.0.56] at org.apache.tomcat.util.threads.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:659) [tomcat-util.jar:9.0.56] at org.apache.tomcat.util.threads.TaskThread$WrappingRunnable.run(TaskThread.java:61) [tomcat-util.jar:9.0.56] at java.lang.Thread.run(Thread.java:829) [?:?] {code} After some investigations it seems related to the updated velocity lib from velocity-1.7 to velocity-engine-core-2.3 The first problem found is that the old velocity version libs and their dependencies are still present in the MCF build for the MCF Agent AND the mcf-crawler-ui. The concerned libs are: - commons-collections-3.2.2.jar - commons-lang-2.6.jar - velocity-1.7.jar The second problem is that with the new velocity version, the way to set properties to the engine has changed. So the code in the org.apache.manifoldcf.core.i18n.Messages.createVelocityEngine method
[jira] [Commented] (CONNECTORS-1707) LiveLink Connector Ant build broken
[ https://issues.apache.org/jira/browse/CONNECTORS-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528777#comment-17528777 ] Julien Massiera commented on CONNECTORS-1707: - Thanks [~kwri...@metacarta.com], indeed I missed to clean but I needed to perform a global "ant clean" because just the "ant clean-core-deps" was not enough. Now I cant build everything but I still have a test fail on the cmisoutput, I will send a mail on the dev mailing list on the subject > LiveLink Connector Ant build broken > --- > > Key: CONNECTORS-1707 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1707 > Project: ManifoldCF > Issue Type: Bug > Components: LiveLink connector >Reporter: Piergiorgio Lucidi >Priority: Major > > Trying to build the LiveLink connector executing Ant returns an error. Using > Maven everything is correctly compiled. > The cause is related to the WSDL generation, the Ant process is failing but > it seems to return a success outcome even if we have the following error > executing ant classcreate-wsdls: > > {code:java} > WSDLToJava Error: org.apache.cxf.bus.extension.ExtensionException: Could not > load extension class org.apache.cxf.common.util.ASMHelperImpl.{code} > > Below the entire output of the Ant build: > > {code:java} > pjlucidi@MBP-Pj csws $ant > Buildfile: > /Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/connectors/csws/build.xmlcalculate-condition:precompile-warn:precompile-check:has-RMI-check:compile-interface:jar-interface:has-stubs-check:has-proprietary-materials-check:build-stubs-check:compile-stubs:compile-implementation:setup-rmic:rmic-build-all:compile-rmic:jar-rmistub:lib-rmi:classcreate-wsdls:classcreate-wsdl-cxf: > [java] SLF4J: Class path contains multiple SLF4J bindings. > [java] SLF4J: Found binding in > [jar:file:/Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/dist/lib/slf4j-simple-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > [java] SLF4J: Found binding in > [jar:file:/Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/dist/lib/slf4j-simple-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] > [java] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for > an explanation. > [java] SLF4J: Actual binding is of type > [org.slf4j.impl.SimpleLoggerFactory] > [java] WARNING: An illegal reflective access operation has occurred > [java] WARNING: Illegal reflective access by > com.sun.xml.bind.v2.runtime.reflect.opt.Injector > (file:/Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/dist/connector-common-lib/jaxb-impl-2.3.0.jar) > to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int) > [java] WARNING: Please consider reporting this to the maintainers of > com.sun.xml.bind.v2.runtime.reflect.opt.Injector > [java] WARNING: Use --illegal-access=warn to enable warnings of further > illegal reflective access operations > [java] WARNING: All illegal access operations will be denied in a future > release > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding source > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding domsource > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding staxsource > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding saxsource > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding jaxb > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default frontend jaxws > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default frontend jaxws21 > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default frontend cxf > [java] [main] WARN org.apache.velocity.deprecation - configuration key > 'class.resource.loader.class' has been deprecated in favor of > 'resource.loader.class.class' > [java] [main] WARN org.apache.velocity.deprecation - configuration key > 'resource.loader' has been deprecated in favor of 'resource.loaders' > [java] > [java] WSDLToJava Error: > org.apache.cxf.bus.extension.ExtensionException: Could not load extension > class org.apache.cxf.common.util.ASMHelperImpl. > [java] > [java] Java Result: 1classcreate-wsdl-cxf: > [java] SLF4J: Class path contains multiple SLF4J bindings. > [java] SLF4J: Found binding in > [jar:file:/Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/dist/lib/slf4j-simple-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > [java] SLF4J: Found binding in >
[jira] [Commented] (CONNECTORS-1707) LiveLink Connector Ant build broken
[ https://issues.apache.org/jira/browse/CONNECTORS-1707?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17528686#comment-17528686 ] Julien Massiera commented on CONNECTORS-1707: - [~kwri...@metacarta.com] , I see that the WSDLToJava includes the cxf jars of the connector-common-lib of the dist folder. So I checked it and in my case it contains two versions of each cxf lib : the version 3.3.1 and the 3.5.0. This is for sure causing troubles > LiveLink Connector Ant build broken > --- > > Key: CONNECTORS-1707 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1707 > Project: ManifoldCF > Issue Type: Bug > Components: LiveLink connector >Reporter: Piergiorgio Lucidi >Priority: Major > > Trying to build the LiveLink connector executing Ant returns an error. Using > Maven everything is correctly compiled. > The cause is related to the WSDL generation, the Ant process is failing but > it seems to return a success outcome even if we have the following error > executing ant classcreate-wsdls: > > {code:java} > WSDLToJava Error: org.apache.cxf.bus.extension.ExtensionException: Could not > load extension class org.apache.cxf.common.util.ASMHelperImpl.{code} > > Below the entire output of the Ant build: > > {code:java} > pjlucidi@MBP-Pj csws $ant > Buildfile: > /Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/connectors/csws/build.xmlcalculate-condition:precompile-warn:precompile-check:has-RMI-check:compile-interface:jar-interface:has-stubs-check:has-proprietary-materials-check:build-stubs-check:compile-stubs:compile-implementation:setup-rmic:rmic-build-all:compile-rmic:jar-rmistub:lib-rmi:classcreate-wsdls:classcreate-wsdl-cxf: > [java] SLF4J: Class path contains multiple SLF4J bindings. > [java] SLF4J: Found binding in > [jar:file:/Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/dist/lib/slf4j-simple-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > [java] SLF4J: Found binding in > [jar:file:/Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/dist/lib/slf4j-simple-1.7.36.jar!/org/slf4j/impl/StaticLoggerBinder.class] > [java] SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for > an explanation. > [java] SLF4J: Actual binding is of type > [org.slf4j.impl.SimpleLoggerFactory] > [java] WARNING: An illegal reflective access operation has occurred > [java] WARNING: Illegal reflective access by > com.sun.xml.bind.v2.runtime.reflect.opt.Injector > (file:/Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/dist/connector-common-lib/jaxb-impl-2.3.0.jar) > to method java.lang.ClassLoader.defineClass(java.lang.String,byte[],int,int) > [java] WARNING: Please consider reporting this to the maintainers of > com.sun.xml.bind.v2.runtime.reflect.opt.Injector > [java] WARNING: Use --illegal-access=warn to enable warnings of further > illegal reflective access operations > [java] WARNING: All illegal access operations will be denied in a future > release > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding source > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding domsource > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding staxsource > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding saxsource > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default databinding jaxb > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default frontend jaxws > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default frontend jaxws21 > [java] [main] INFO org.apache.cxf.tools.wsdlto.core.PluginLoader - > Replaced default frontend cxf > [java] [main] WARN org.apache.velocity.deprecation - configuration key > 'class.resource.loader.class' has been deprecated in favor of > 'resource.loader.class.class' > [java] [main] WARN org.apache.velocity.deprecation - configuration key > 'resource.loader' has been deprecated in favor of 'resource.loaders' > [java] > [java] WSDLToJava Error: > org.apache.cxf.bus.extension.ExtensionException: Could not load extension > class org.apache.cxf.common.util.ASMHelperImpl. > [java] > [java] Java Result: 1classcreate-wsdl-cxf: > [java] SLF4J: Class path contains multiple SLF4J bindings. > [java] SLF4J: Found binding in > [jar:file:/Users/pjlucidi/workspaces/manifoldcf/manifoldcf-trunk/dist/lib/slf4j-simple-1.7.25.jar!/org/slf4j/impl/StaticLoggerBinder.class] > [java] SLF4J: Found binding in >
[jira] [Resolved] (CONNECTORS-1704) Confluence v6: rename project name
[ https://issues.apache.org/jira/browse/CONNECTORS-1704?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1704. - Resolution: Fixed r1899863 > Confluence v6: rename project name > -- > > Key: CONNECTORS-1704 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1704 > Project: ManifoldCF > Issue Type: Task > Components: Confluence connector >Affects Versions: ManifoldCF 2.21 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > > The final jar name of the confluence v6 connector contains a space because > the project name in the build.xml file is "confluence v6". > Having spaces in filenames is not a good practice so it would be better to > rename the project name to avoid that -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1704) Confluence v6: rename project name
Julien Massiera created CONNECTORS-1704: --- Summary: Confluence v6: rename project name Key: CONNECTORS-1704 URL: https://issues.apache.org/jira/browse/CONNECTORS-1704 Project: ManifoldCF Issue Type: Task Components: Confluence connector Affects Versions: ManifoldCF 2.21 Reporter: Julien Massiera Assignee: Julien Massiera Fix For: ManifoldCF next The final jar name of the confluence v6 connector contains a space because the project name in the build.xml file is "confluence v6". Having spaces in filenames is not a good practice so it would be better to rename the project name to avoid that -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1703) TikaServiceRmeta: update to handle 2.4.0 changes
[ https://issues.apache.org/jira/browse/CONNECTORS-1703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1703. - Resolution: Fixed r1899815 > TikaServiceRmeta: update to handle 2.4.0 changes > > > Key: CONNECTORS-1703 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1703 > Project: ManifoldCF > Issue Type: Improvement > Components: Tika service connector >Affects Versions: ManifoldCF 2.21 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > > Tika 2.4.0 introduces a new warn message to indicate that metadata have been > truncated. The connector can be updated to consider this warn and specify it > in the doc process description -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1703) TikaServiceRmeta: update to handle 2.4.0 changes
Julien Massiera created CONNECTORS-1703: --- Summary: TikaServiceRmeta: update to handle 2.4.0 changes Key: CONNECTORS-1703 URL: https://issues.apache.org/jira/browse/CONNECTORS-1703 Project: ManifoldCF Issue Type: Improvement Components: Tika service connector Affects Versions: ManifoldCF 2.21 Reporter: Julien Massiera Assignee: Julien Massiera Fix For: ManifoldCF next Tika 2.4.0 introduces a new warn message to indicate that metadata have been truncated. The connector can be updated to consider this warn and specify it in the doc process description -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (CONNECTORS-1701) Add date info on OOM Error in WorkerThread
[ https://issues.apache.org/jira/browse/CONNECTORS-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1701: Component/s: Framework crawler agent (was: Framework agents process) > Add date info on OOM Error in WorkerThread > -- > > Key: CONNECTORS-1701 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1701 > Project: ManifoldCF > Issue Type: Improvement > Components: Framework crawler agent >Affects Versions: ManifoldCF 2.21 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > > When an OOM Error occurs, the timestamp/date of the error can be very useful > for investigations. Since it is currently not present in the output, it is > worth adding it -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1701) Add date info on OOM Error in WorkerThread
[ https://issues.apache.org/jira/browse/CONNECTORS-1701?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1701. - Fix Version/s: ManifoldCF next Resolution: Fixed r1898966 > Add date info on OOM Error in WorkerThread > -- > > Key: CONNECTORS-1701 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1701 > Project: ManifoldCF > Issue Type: Improvement > Components: Framework agents process >Affects Versions: ManifoldCF 2.21 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > > When an OOM Error occurs, the timestamp/date of the error can be very useful > for investigations. Since it is currently not present in the output, it is > worth adding it -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1701) Add date info on OOM Error in WorkerThread
Julien Massiera created CONNECTORS-1701: --- Summary: Add date info on OOM Error in WorkerThread Key: CONNECTORS-1701 URL: https://issues.apache.org/jira/browse/CONNECTORS-1701 Project: ManifoldCF Issue Type: Improvement Components: Framework agents process Affects Versions: ManifoldCF 2.21 Reporter: Julien Massiera Assignee: Julien Massiera When an OOM Error occurs, the timestamp/date of the error can be very useful for investigations. Since it is currently not present in the output, it is worth adding it -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1700) TikaServiceRmeta: Add options to filter out metadata based on size
[ https://issues.apache.org/jira/browse/CONNECTORS-1700?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1700. - Fix Version/s: ManifoldCF next Resolution: Fixed r1898949 > TikaServiceRmeta: Add options to filter out metadata based on size > -- > > Key: CONNECTORS-1700 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1700 > Project: ManifoldCF > Issue Type: Improvement > Components: Tika service connector >Affects Versions: ManifoldCF 2.21 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > > Some files may contain abnormally big metadata (several MB, be it for the > metadata values, but also for the total amount of metadata) that can be > problematic concerning the memory consumption of the connector. > To avoid this, we can provide job configuration options for the > TikaServiceRmetaConnector to set limits on both metadata values and global > amount of metadata, and exclude metadata that exceed the limits -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1700) TikaServiceRmeta: Add options to filter out metadata based on size
Julien Massiera created CONNECTORS-1700: --- Summary: TikaServiceRmeta: Add options to filter out metadata based on size Key: CONNECTORS-1700 URL: https://issues.apache.org/jira/browse/CONNECTORS-1700 Project: ManifoldCF Issue Type: Improvement Components: Tika service connector Affects Versions: ManifoldCF 2.21 Reporter: Julien Massiera Assignee: Julien Massiera Some files may contain abnormally big metadata (several MB, be it for the metadata values, but also for the total amount of metadata) that can be problematic concerning the memory consumption of the connector. To avoid this, we can provide job configuration options for the TikaServiceRmetaConnector to set limits on both metadata values and global amount of metadata, and exclude metadata that exceed the limits -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1665) WebConnector: Add activity records for excluded URLs
[ https://issues.apache.org/jira/browse/CONNECTORS-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480957#comment-17480957 ] Julien Massiera commented on CONNECTORS-1665: - r1897405 > WebConnector: Add activity records for excluded URLs > - > > Key: CONNECTORS-1665 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1665 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.18 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Trivial > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1665 > > > It would be interesting to add activity records in the WebConnector to keep > track of excluded URLs that match an exclude filter -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1665) WebConnector: Add activity records for excluded URLs
[ https://issues.apache.org/jira/browse/CONNECTORS-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17480170#comment-17480170 ] Julien Massiera commented on CONNECTORS-1665: - The patch has never been reviewed. [~kwri...@metacarta.com] can you take a look and tell if it can be integrated to the trunk please ? > WebConnector: Add activity records for excluded URLs > - > > Key: CONNECTORS-1665 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1665 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.18 >Reporter: Julien Massiera >Priority: Trivial > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1665 > > > It would be interesting to add activity records in the WebConnector to keep > track of excluded URLs that match an exclude filter -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1692) LDAP Mapper Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1692?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17472956#comment-17472956 ] Julien Massiera commented on CONNECTORS-1692: - r1896917 of branch CONNECTORS-1692 > LDAP Mapper Connector > - > > Key: CONNECTORS-1692 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1692 > Project: ManifoldCF > Issue Type: New Feature > Components: LDAP authority >Affects Versions: ManifoldCF 2.21 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > > Sometimes one need to be able to map an LDAP user id to a specific attribute. > So it would be good to develop an LDAP Mapper for this purpose -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1692) LDAP Mapper Connector
Julien Massiera created CONNECTORS-1692: --- Summary: LDAP Mapper Connector Key: CONNECTORS-1692 URL: https://issues.apache.org/jira/browse/CONNECTORS-1692 Project: ManifoldCF Issue Type: New Feature Components: LDAP authority Affects Versions: ManifoldCF 2.21 Reporter: Julien Massiera Assignee: Julien Massiera Sometimes one need to be able to map an LDAP user id to a specific attribute. So it would be good to develop an LDAP Mapper for this purpose -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1667) New Tika Service Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1667. - Fix Version/s: ManifoldCF 2.20 Resolution: Fixed > New Tika Service Connector > -- > > Key: CONNECTORS-1667 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1667 > Project: ManifoldCF > Issue Type: New Feature > Components: Tika service connector >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.20 > > > The current Tika Service Connector exploits the '/unpack/all' endpoint of a > Tika Server. This endpoint is not optimal to only extract document's metadata > and content. We should develop a new connector based on the 'rmeta' endpoint > which is more suited for our needs. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1686) Solr Ingester: issues with CursorMark
[ https://issues.apache.org/jira/browse/CONNECTORS-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17460228#comment-17460228 ] Julien Massiera commented on CONNECTORS-1686: - r1896007 > Solr Ingester: issues with CursorMark > - > > Key: CONNECTORS-1686 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1686 > Project: ManifoldCF > Issue Type: Bug > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > > The Solr Ingester connector may have some issues with the > response.getNextCursorMark() method when processing requests responses. > Indeed, sometimes the response contains errors and/or is malformed, and this > method raises an exception that is currently not handled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1686) Solr Ingester: issues with CursorMark
[ https://issues.apache.org/jira/browse/CONNECTORS-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1686. - Resolution: Fixed > Solr Ingester: issues with CursorMark > - > > Key: CONNECTORS-1686 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1686 > Project: ManifoldCF > Issue Type: Bug > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > > The Solr Ingester connector may have some issues with the > response.getNextCursorMark() method when processing requests responses. > Indeed, sometimes the response contains errors and/or is malformed, and this > method raises an exception that is currently not handled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1688) Solr Ingester: ensure single valued date field
[ https://issues.apache.org/jira/browse/CONNECTORS-1688?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1688. - Fix Version/s: ManifoldCF 2.21 Resolution: Fixed r1895960 > Solr Ingester: ensure single valued date field > -- > > Key: CONNECTORS-1688 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1688 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > > In the Solr Ingester connector, the date field is supposed to be single > valued, but there is no check in the code that it is the case, which can be > problematic for the output. > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1688) Solr Ingester: ensure single valued date field
Julien Massiera created CONNECTORS-1688: --- Summary: Solr Ingester: ensure single valued date field Key: CONNECTORS-1688 URL: https://issues.apache.org/jira/browse/CONNECTORS-1688 Project: ManifoldCF Issue Type: Improvement Components: Lucene/SOLR connector Affects Versions: ManifoldCF 2.20 Reporter: Julien Massiera Assignee: Julien Massiera In the Solr Ingester connector, the date field is supposed to be single valued, but there is no check in the code that it is the case, which can be problematic for the output. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1687) Solr Ingester: support more field types in field mappings
[ https://issues.apache.org/jira/browse/CONNECTORS-1687?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1687. - Fix Version/s: ManifoldCF 2.21 Resolution: Fixed r1895959 > Solr Ingester: support more field types in field mappings > - > > Key: CONNECTORS-1687 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1687 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > > Currently the Solr Ingester connector can only handle string type fields when > configuring the "Field mappings" parameter in a job configuration. > It would be good to also support at least long, int, and date types -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1687) Solr Ingester: support more field types in field mappings
Julien Massiera created CONNECTORS-1687: --- Summary: Solr Ingester: support more field types in field mappings Key: CONNECTORS-1687 URL: https://issues.apache.org/jira/browse/CONNECTORS-1687 Project: ManifoldCF Issue Type: Improvement Components: Lucene/SOLR connector Affects Versions: ManifoldCF 2.20 Reporter: Julien Massiera Assignee: Julien Massiera Currently the Solr Ingester connector can only handle string type fields when configuring the "Field mappings" parameter in a job configuration. It would be good to also support at least long, int, and date types -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1686) Solr Ingester: issues with CursorMark
[ https://issues.apache.org/jira/browse/CONNECTORS-1686?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1686. - Fix Version/s: ManifoldCF 2.21 Resolution: Fixed r1895958 > Solr Ingester: issues with CursorMark > - > > Key: CONNECTORS-1686 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1686 > Project: ManifoldCF > Issue Type: Bug > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > > The Solr Ingester connector may have some issues with the > response.getNextCursorMark() method when processing requests responses. > Indeed, sometimes the response contains errors and/or is malformed, and this > method raises an exception that is currently not handled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1686) Solr Ingester: issues with CursorMark
Julien Massiera created CONNECTORS-1686: --- Summary: Solr Ingester: issues with CursorMark Key: CONNECTORS-1686 URL: https://issues.apache.org/jira/browse/CONNECTORS-1686 Project: ManifoldCF Issue Type: Bug Components: Lucene/SOLR connector Affects Versions: ManifoldCF 2.20 Reporter: Julien Massiera Assignee: Julien Massiera The Solr Ingester connector may have some issues with the response.getNextCursorMark() method when processing requests responses. Indeed, sometimes the response contains errors and/or is malformed, and this method raises an exception that is currently not handled. -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (CONNECTORS-1681) TikaServiceRmeta: recordActivity can cause Database exception
[ https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1681: Description: Some files containing non UTF8 characters can cause Tika to trigger an exception describing the parsing problem. As the TikaServiceRmeta connector creates an activity record for any Tika exception containing its description (and so that contains the non UTF8 char in those cases), it causes an SQL exception when MCF tries to insert the activity record in the Database: {code:java} ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and restarting due to database connection reset: Database exception: SQLException doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: SQLException doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 {code} So to avoid this, we need to remove those problematic chars from the exception description before recording the activity was: Some files containing non ASCII characters can cause Tika to trigger an exception describing the parsing problem. As the TikaServiceRmeta connector creates an activity record for any Tika exception containing its description (and so that contains the non ASCII char in those cases), it causes an SQL exception when MCF tries to insert the activity record in Postgres: {code:java} ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and restarting due to database connection reset: Database exception: SQLException doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: SQLException doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 {code} So to avoid this, we need to remove any non ASCII chars from the exception description before recording the activity > TikaServiceRmeta: recordActivity can cause Database exception > - > > Key: CONNECTORS-1681 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1681 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > > Some files containing non UTF8 characters can cause Tika to trigger an > exception describing the parsing problem. > As the TikaServiceRmeta connector creates an activity record for any Tika > exception containing its description (and so that contains the non UTF8 char > in those cases), it causes an SQL exception when MCF tries to insert the > activity record in the Database: > {code:java} > ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - > MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and > restarting due to database connection reset: Database exception: SQLException > doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database > exception: SQLException doing query (22021): ERROR: invalid byte sequence for > encoding "UTF8": 0x00 {code} > So to avoid this, we need to remove those problematic chars from the > exception description before recording the activity > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1681) TikaServiceRmeta: recordActivity can cause Database exception
[ https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17448766#comment-17448766 ] Julien Massiera commented on CONNECTORS-1681: - Indeed [~kwri...@metacarta.com], it is the description of my issue that is wrong. I decided to remove non ASCII chars and not just non UTF8 chars because the description of the error that the TikaServiceRmeta connector is writing as activity record is just there to be readable and give a global idea of what was wrong during the Tika processing phase. So I wanted to be sure that the activity record only contains "standard" chars even if we loose some of them, the accurate exception is still available in the log file. Are you ok with that ? > TikaServiceRmeta: recordActivity can cause Database exception > - > > Key: CONNECTORS-1681 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1681 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > > Some files containing non ASCII characters can cause Tika to trigger an > exception describing the parsing problem. > As the TikaServiceRmeta connector creates an activity record for any Tika > exception containing its description (and so that contains the non ASCII char > in those cases), it causes an SQL exception when MCF tries to insert the > activity record in Postgres: > {code:java} > ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - > MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and > restarting due to database connection reset: Database exception: SQLException > doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database > exception: SQLException doing query (22021): ERROR: invalid byte sequence for > encoding "UTF8": 0x00 {code} > So to avoid this, we need to remove any non ASCII chars from the exception > description before recording the activity > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1681) TikaServiceRmeta: recordActivity can cause Database exception
[ https://issues.apache.org/jira/browse/CONNECTORS-1681?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1681. - Fix Version/s: ManifoldCF 2.21 Resolution: Fixed r1895299 > TikaServiceRmeta: recordActivity can cause Database exception > - > > Key: CONNECTORS-1681 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1681 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > > Some files containing non ASCII characters can cause Tika to trigger an > exception describing the parsing problem. > As the TikaServiceRmeta connector creates an activity record for any Tika > exception containing its description (and so that contains the non ASCII char > in those cases), it causes an SQL exception when MCF tries to insert the > activity record in Postgres: > {code:java} > ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - > MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and > restarting due to database connection reset: Database exception: SQLException > doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 > org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database > exception: SQLException doing query (22021): ERROR: invalid byte sequence for > encoding "UTF8": 0x00 {code} > So to avoid this, we need to remove any non ASCII chars from the exception > description before recording the activity > -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1681) TikaServiceRmeta: recordActivity can cause Database exception
Julien Massiera created CONNECTORS-1681: --- Summary: TikaServiceRmeta: recordActivity can cause Database exception Key: CONNECTORS-1681 URL: https://issues.apache.org/jira/browse/CONNECTORS-1681 Project: ManifoldCF Issue Type: Bug Components: Tika service connector Affects Versions: ManifoldCF 2.20 Reporter: Julien Massiera Assignee: Julien Massiera Some files containing non ASCII characters can cause Tika to trigger an exception describing the parsing problem. As the TikaServiceRmeta connector creates an activity record for any Tika exception containing its description (and so that contains the non ASCII char in those cases), it causes an SQL exception when MCF tries to insert the activity record in Postgres: {code:java} ERROR 2021-11-24T13:37:00,121 (Worker thread '41') - MCF|MCF-agent|apache.manifoldcf.crawlerthreads|Worker thread aborting and restarting due to database connection reset: Database exception: SQLException doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 org.apache.manifoldcf.core.interfaces.ManifoldCFException: Database exception: SQLException doing query (22021): ERROR: invalid byte sequence for encoding "UTF8": 0x00 {code} So to avoid this, we need to remove any non ASCII chars from the exception description before recording the activity -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1679) HTML Extractor: output has escaped entities
[ https://issues.apache.org/jira/browse/CONNECTORS-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446405#comment-17446405 ] Julien Massiera commented on CONNECTORS-1679: - r1895172 > HTML Extractor: output has escaped entities > --- > > Key: CONNECTORS-1679 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1679 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > Attachments: patch-CONNECTORS-1679.txt > > > The output of the HTML extractor is generated with escaped entities (eg '&' > becomes '& amp ;'), which is not the wanted behavior as we want this > connector to extract text from HTML as it is -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Commented] (CONNECTORS-1679) HTML Extractor: output has escaped entities
[ https://issues.apache.org/jira/browse/CONNECTORS-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17446104#comment-17446104 ] Julien Massiera commented on CONNECTORS-1679: - Patch submitted > HTML Extractor: output has escaped entities > --- > > Key: CONNECTORS-1679 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1679 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > Attachments: patch-CONNECTORS-1679.txt > > > The output of the HTML extractor is generated with escaped entities (eg '&' > becomes ''), which is not the wanted behavior as we want this connector > to extract text from HTML as it is -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Updated] (CONNECTORS-1679) HTML Extractor: output has escaped entities
[ https://issues.apache.org/jira/browse/CONNECTORS-1679?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1679: Description: The output of the HTML extractor is generated with escaped entities (eg '&' becomes '& amp ;'), which is not the wanted behavior as we want this connector to extract text from HTML as it is (was: The output of the HTML extractor is generated with escaped entities (eg '&' becomes ''), which is not the wanted behavior as we want this connector to extract text from HTML as it is) > HTML Extractor: output has escaped entities > --- > > Key: CONNECTORS-1679 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1679 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.21 > > Attachments: patch-CONNECTORS-1679.txt > > > The output of the HTML extractor is generated with escaped entities (eg '&' > becomes '& amp ;'), which is not the wanted behavior as we want this > connector to extract text from HTML as it is -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Created] (CONNECTORS-1679) HTML Extractor: output has escaped entities
Julien Massiera created CONNECTORS-1679: --- Summary: HTML Extractor: output has escaped entities Key: CONNECTORS-1679 URL: https://issues.apache.org/jira/browse/CONNECTORS-1679 Project: ManifoldCF Issue Type: Bug Components: HTML extractor Affects Versions: ManifoldCF 2.20 Reporter: Julien Massiera The output of the HTML extractor is generated with escaped entities (eg '&' becomes ''), which is not the wanted behavior as we want this connector to extract text from HTML as it is -- This message was sent by Atlassian Jira (v8.20.1#820001)
[jira] [Resolved] (CONNECTORS-1678) Confluence v6 - Configuration of retry interval and retry numbers on exceptions
[ https://issues.apache.org/jira/browse/CONNECTORS-1678?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1678. - Resolution: Fixed r1894485 > Confluence v6 - Configuration of retry interval and retry numbers on > exceptions > --- > > Key: CONNECTORS-1678 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1678 > Project: ManifoldCF > Issue Type: Improvement > Components: Confluence connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > > Currently the value of the retry interval in ms, and the number of retries > when exceptions occur are hardcoded and can be inappropriate depending on the > Confluence performances. > These values should be configurable in the connector's configuration -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1678) Confluence v6 - Configuration of retry interval and retry numbers on exceptions
Julien Massiera created CONNECTORS-1678: --- Summary: Confluence v6 - Configuration of retry interval and retry numbers on exceptions Key: CONNECTORS-1678 URL: https://issues.apache.org/jira/browse/CONNECTORS-1678 Project: ManifoldCF Issue Type: Improvement Components: Confluence connector Affects Versions: ManifoldCF 2.20 Reporter: Julien Massiera Assignee: Julien Massiera Currently the value of the retry interval in ms, and the number of retries when exceptions occur are hardcoded and can be inappropriate depending on the Confluence performances. These values should be configurable in the connector's configuration -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1677) Confluence v6 does not crawl empty pages and their children
[ https://issues.apache.org/jira/browse/CONNECTORS-1677?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17432959#comment-17432959 ] Julien Massiera commented on CONNECTORS-1677: - r1894475 > Confluence v6 does not crawl empty pages and their children > --- > > Key: CONNECTORS-1677 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1677 > Project: ManifoldCF > Issue Type: Bug > Components: Confluence connector >Affects Versions: ManifoldCF 2.20 >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > > The confluence v6 connector does not crawl empty pages and thus does not > crawl their children. Originally it was indicated that it was the only way to > detect deleted pages, but currently this is not the case and the connector > model is set to "full" anyway so it makes no sense -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1677) Confluence v6 does not crawl empty pages and their children
Julien Massiera created CONNECTORS-1677: --- Summary: Confluence v6 does not crawl empty pages and their children Key: CONNECTORS-1677 URL: https://issues.apache.org/jira/browse/CONNECTORS-1677 Project: ManifoldCF Issue Type: Bug Components: Confluence connector Affects Versions: ManifoldCF 2.20 Reporter: Julien Massiera Assignee: Julien Massiera The confluence v6 connector does not crawl empty pages and thus does not crawl their children. Originally it was indicated that it was the only way to detect deleted pages, but currently this is not the case and the connector model is set to "full" anyway so it makes no sense -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1675) Unable to delete Mapping Connections via JSON API
Julien Massiera created CONNECTORS-1675: --- Summary: Unable to delete Mapping Connections via JSON API Key: CONNECTORS-1675 URL: https://issues.apache.org/jira/browse/CONNECTORS-1675 Project: ManifoldCF Issue Type: Bug Components: API Affects Versions: ManifoldCF 2.20 Reporter: Julien Massiera The DELETE action via the JSON API mappingconnections/__ does not seem to work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1671) Solr output connector behavior on some exceptions
[ https://issues.apache.org/jira/browse/CONNECTORS-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17411817#comment-17411817 ] Julien Massiera commented on CONNECTORS-1671: - So, [~kwri...@metacarta.com] any news about integrating this patch ? Let me know if my explanations are not clear, I'm at your disposal > Solr output connector behavior on some exceptions > - > > Key: CONNECTORS-1671 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1671 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.19 >Reporter: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > Attachments: patch-CONNECTORS-1671.txt > > > In the « handleIOException » method of the « HttpPoster » class, the unknown > case triggers a job failure despite the exception can only concern the > document/action itself and not a problem with a potential "Solr down" issue > (all "Solr down" issues are handled upstream) > Same thing in the « handleSolrServerException » method -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1671) Solr output connector behavior on some exceptions
[ https://issues.apache.org/jira/browse/CONNECTORS-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390958#comment-17390958 ] Julien Massiera commented on CONNECTORS-1671: - [~kwri...@metacarta.com], it is a runtime exception that occurs on Solr side, not on connector side, and that is a cause (child) of a SolrServerException ! Furthermore, I did not create the handleRuntimeException method, it was already there, I am just using it in another place where it can happen, for the same reason described in the method Javadoc {code:java} /** Handle a SolrServerException. * These exceptions seem to be catch-all exceptions having to do with misconfiguration or * underlying IO exceptions, or request parsing exceptions. * If this method doesn't throw an exception, it means that the exception should be interpreted * as meaning that the document or action is illegal and should not be repeated. */{code} I have just updated the Javadoc to add "or request parsing exceptions" because this is what happens. {code:java} {code} > Solr output connector behavior on some exceptions > - > > Key: CONNECTORS-1671 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1671 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.19 >Reporter: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > Attachments: patch-CONNECTORS-1671.txt > > > In the « handleIOException » method of the « HttpPoster » class, the unknown > case triggers a job failure despite the exception can only concern the > document/action itself and not a problem with a potential "Solr down" issue > (all "Solr down" issues are handled upstream) > Same thing in the « handleSolrServerException » method -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1671) Solr output connector behavior on some exceptions
[ https://issues.apache.org/jira/browse/CONNECTORS-1671?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17390553#comment-17390553 ] Julien Massiera commented on CONNECTORS-1671: - [~kwri...@metacarta.com], is the patch ok for you ? > Solr output connector behavior on some exceptions > - > > Key: CONNECTORS-1671 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1671 > Project: ManifoldCF > Issue Type: Improvement > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.19 >Reporter: Julien Massiera >Priority: Major > Fix For: ManifoldCF next > > Attachments: patch-CONNECTORS-1671.txt > > > In the « handleIOException » method of the « HttpPoster » class, the unknown > case triggers a job failure despite the exception can only concern the > document/action itself and not a problem with a potential "Solr down" issue > (all "Solr down" issues are handled upstream) > Same thing in the « handleSolrServerException » method -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (CONNECTORS-1661) Admin UI does not handle UTF8 passwords
[ https://issues.apache.org/jira/browse/CONNECTORS-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1661: Comment: was deleted (was: Bonjour, Je suis actuellement absent. Je serai de retour le lundi 22 février 2021. Pour toute question, merci d'écrire à l'email suivant: cedric [point] Ulmer [att] francelabs [point] com Cordialement, Julien Massiera + Hi, I will be out of office until Sunday Feb 21st included. For any question, please contact cedric [point] Ulmer [att] francelabs [dot] com ) > Admin UI does not handle UTF8 passwords > --- > > Key: CONNECTORS-1661 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1661 > Project: ManifoldCF > Issue Type: Bug > Components: API >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Kishore Kumar >Priority: Critical > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1661.txt > > > Setting UTF-8 non alphanumeric characters in the password for the admin user > does not work when obfuscating the password and setting it through the > org.apache.manifoldcf.login.password.obfuscated parameter of the > properties.xml file. > Alphanumeric characters work well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Comment Edited] (CONNECTORS-1667) New Tika Service Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339141#comment-17339141 ] Julien Massiera edited comment on CONNECTORS-1667 at 5/4/21, 4:50 PM: -- R1889497 on branch CONNECTORS-1667 was (Author: julienfl): R1889497 on branch CONNECTORS-1667 > New Tika Service Connector > -- > > Key: CONNECTORS-1667 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1667 > Project: ManifoldCF > Issue Type: New Feature > Components: Tika service connector >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > > The current Tika Service Connector exploits the '/unpack/all' endpoint of a > Tika Server. This endpoint is not optimal to only extract document's metadata > and content. We should develop a new connector based on the 'rmeta' endpoint > which is more suited for our needs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1667) New Tika Service Connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1667?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17339141#comment-17339141 ] Julien Massiera commented on CONNECTORS-1667: - R1889497 on branch CONNECTORS-1667 > New Tika Service Connector > -- > > Key: CONNECTORS-1667 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1667 > Project: ManifoldCF > Issue Type: New Feature > Components: Tika service connector >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > > The current Tika Service Connector exploits the '/unpack/all' endpoint of a > Tika Server. This endpoint is not optimal to only extract document's metadata > and content. We should develop a new connector based on the 'rmeta' endpoint > which is more suited for our needs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1667) New Tika Service Connector
Julien Massiera created CONNECTORS-1667: --- Summary: New Tika Service Connector Key: CONNECTORS-1667 URL: https://issues.apache.org/jira/browse/CONNECTORS-1667 Project: ManifoldCF Issue Type: New Feature Components: Tika service connector Reporter: Julien Massiera Assignee: Julien Massiera The current Tika Service Connector exploits the '/unpack/all' endpoint of a Tika Server. This endpoint is not optimal to only extract document's metadata and content. We should develop a new connector based on the 'rmeta' endpoint which is more suited for our needs. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1665) WebConnector: Add activity records for excluded URLs
[ https://issues.apache.org/jira/browse/CONNECTORS-1665?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17298820#comment-17298820 ] Julien Massiera commented on CONNECTORS-1665: - Patch available > WebConnector: Add activity records for excluded URLs > - > > Key: CONNECTORS-1665 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1665 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.18 >Reporter: Julien Massiera >Priority: Trivial > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1665 > > > It would be interesting to add activity records in the WebConnector to keep > track of excluded URLs that match an exclude filter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1665) WebConnector: Add activity records for excluded URLs
Julien Massiera created CONNECTORS-1665: --- Summary: WebConnector: Add activity records for excluded URLs Key: CONNECTORS-1665 URL: https://issues.apache.org/jira/browse/CONNECTORS-1665 Project: ManifoldCF Issue Type: Improvement Components: Web connector Affects Versions: ManifoldCF 2.18 Reporter: Julien Massiera It would be interesting to add activity records in the WebConnector to keep track of excluded URLs that match an exclude filter -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Issue Comment Deleted] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera updated CONNECTORS-1656: Comment: was deleted (was: Bonjour, Je suis actuellement absent. Je serai de retour le lundi 22 février 2021. Pour toute question, merci d'écrire à l'email suivant: cedric [point] Ulmer [att] francelabs [point] com Cordialement, Julien Massiera + Hi, I will be out of office until Sunday Feb 21st included. For any question, please contact cedric [point] Ulmer [att] francelabs [dot] com ) > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1656 > > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289516#comment-17289516 ] Julien Massiera commented on CONNECTORS-1656: - Bonjour, Je suis actuellement absent. Je serai de retour le lundi 22 février 2021. Pour toute question, merci d'écrire à l'email suivant: cedric [point] Ulmer [att] francelabs [point] com Cordialement, Julien Massiera + Hi, I will be out of office until Sunday Feb 21st included. For any question, please contact cedric [point] Ulmer [att] francelabs [dot] com > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1656 > > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1661) Admin UI does not handle UTF8 passwords
[ https://issues.apache.org/jira/browse/CONNECTORS-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17289559#comment-17289559 ] Julien Massiera commented on CONNECTORS-1661: - Bonjour, Je suis actuellement absent. Je serai de retour le lundi 22 février 2021. Pour toute question, merci d'écrire à l'email suivant: cedric [point] Ulmer [att] francelabs [point] com Cordialement, Julien Massiera + Hi, I will be out of office until Sunday Feb 21st included. For any question, please contact cedric [point] Ulmer [att] francelabs [dot] com > Admin UI does not handle UTF8 passwords > --- > > Key: CONNECTORS-1661 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1661 > Project: ManifoldCF > Issue Type: Bug > Components: API >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Kishore Kumar >Priority: Critical > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1661.txt > > > Setting UTF-8 non alphanumeric characters in the password for the admin user > does not work when obfuscating the password and setting it through the > org.apache.manifoldcf.login.password.obfuscated parameter of the > properties.xml file. > Alphanumeric characters work well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283832#comment-17283832 ] Julien Massiera commented on CONNECTORS-1656: - [~kwri...@metacarta.com], is the patch ok ? > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF next > > Attachments: patch-CONNECTORS-1656 > > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1661) Admin UI does not handle UTF8 passwords
[ https://issues.apache.org/jira/browse/CONNECTORS-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17283831#comment-17283831 ] Julien Massiera commented on CONNECTORS-1661: - [~kwri...@metacarta.com] and [~kishorekumar], is the patch ok for you ? > Admin UI does not handle UTF8 passwords > --- > > Key: CONNECTORS-1661 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1661 > Project: ManifoldCF > Issue Type: Bug > Components: API >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Kishore Kumar >Priority: Critical > Fix For: ManifoldCF 2.19 > > Attachments: patch-CONNECTORS-1661.txt > > > Setting UTF-8 non alphanumeric characters in the password for the admin user > does not work when obfuscating the password and setting it through the > org.apache.manifoldcf.login.password.obfuscated parameter of the > properties.xml file. > Alphanumeric characters work well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1662) JIRA connector - NullPointerException after getCharSet method
Julien Massiera created CONNECTORS-1662: --- Summary: JIRA connector - NullPointerException after getCharSet method Key: CONNECTORS-1662 URL: https://issues.apache.org/jira/browse/CONNECTORS-1662 Project: ManifoldCF Issue Type: Bug Components: JIRA connector Affects Versions: ManifoldCF 2.17 Reporter: Julien Massiera Sometimes the following exception is triggered on some documents during crawl: {code:java} Error tossed: charsetjava.lang.NullPointerException: charset at java.io.InputStreamReader.(InputStreamReader.java:115) ~[?:?] at org.apache.manifoldcf.crawler.connectors.jira.JiraSession.convertToString(JiraSession.java:183) ~[?:?] at org.apache.manifoldcf.crawler.connectors.jira.JiraSession.getRest(JiraSession.java:237) ~[?:?] at org.apache.manifoldcf.crawler.connectors.jira.JiraSession.getIssue(JiraSession.java:317) ~[?:?] at org.apache.manifoldcf.crawler.connectors.jira.JiraRepositoryConnector$GetIssueThread.run(JiraRepositoryConnector.java:1409) ~[?:?] {code} After investigations it appears that the getCharSet method of the JiraSession class may return null charset when it is null (no check) or a UnsupportedCharsetException happens -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1661) Admin UI does not handle UTF8 passwords
[ https://issues.apache.org/jira/browse/CONNECTORS-1661?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17272131#comment-17272131 ] Julien Massiera commented on CONNECTORS-1661: - Hi [~kishorekumar], did you make progress on this issue ? Or are you still in need of additional information ? > Admin UI does not handle UTF8 passwords > --- > > Key: CONNECTORS-1661 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1661 > Project: ManifoldCF > Issue Type: Bug > Components: API >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Kishore Kumar >Priority: Critical > > Setting UTF-8 non alphanumeric characters in the password for the admin user > does not work when obfuscating the password and setting it through the > org.apache.manifoldcf.login.password.obfuscated parameter of the > properties.xml file. > Alphanumeric characters work well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1661) Admin UI does not handle UTF8 passwords
Julien Massiera created CONNECTORS-1661: --- Summary: Admin UI does not handle UTF8 passwords Key: CONNECTORS-1661 URL: https://issues.apache.org/jira/browse/CONNECTORS-1661 Project: ManifoldCF Issue Type: Bug Components: API Affects Versions: ManifoldCF 2.17 Reporter: Julien Massiera Setting UTF-8 non alphanumeric characters in the password for the admin user does not work when obfuscating the password and setting it through the org.apache.manifoldcf.login.password.obfuscated parameter of the properties.xml file. Alphanumeric characters work well. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1657) Web connector - Handle sitemap instruction in robot.txt
[ https://issues.apache.org/jira/browse/CONNECTORS-1657?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17219079#comment-17219079 ] Julien Massiera commented on CONNECTORS-1657: - Yes a warning in the log but an ERROR in the simple history. We should at least change the return code of the activity don't you agree ? > Web connector - Handle sitemap instruction in robot.txt > --- > > Key: CONNECTORS-1657 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1657 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Priority: Major > > Currently the web connector does not understand when the robot.txt file > points a sitemap. As an example, for the site > [https://www.persee.fr,|https://www.persee.fr%2C/] in the simple history one > can find the following error: > Unknown robots.txt line: 'Sitemap: [https://www.persee.fr/sitemap.xml'] > -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1657) Web connector - Handle sitemap instruction in robot.txt
Julien Massiera created CONNECTORS-1657: --- Summary: Web connector - Handle sitemap instruction in robot.txt Key: CONNECTORS-1657 URL: https://issues.apache.org/jira/browse/CONNECTORS-1657 Project: ManifoldCF Issue Type: Improvement Components: Web connector Affects Versions: ManifoldCF 2.17 Reporter: Julien Massiera Currently the web connector does not understand when the robot.txt file points a sitemap. As an example, for the site [https://www.persee.fr,|https://www.persee.fr%2C/] in the simple history one can find the following error: Unknown robots.txt line: 'Sitemap: [https://www.persee.fr/sitemap.xml'] -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1656) HTML extractor produces invalid XML
[ https://issues.apache.org/jira/browse/CONNECTORS-1656?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17218358#comment-17218358 ] Julien Massiera commented on CONNECTORS-1656: - Hi [~kwri...@metacarta.com], The document produced identifies itself as XHTML. But even if it was HTML, the default HTML parser of Tika uses SAX to parse documents. Here is the configuration of the Tika HTML parser (default configuration): HtmlParser Class: org.apache.tika.parser.html.HtmlParser Mime Types: text/html application/vnd.wap.xhtml+xm application/x-asp application/xhtml+xml So as it handles html and xhtml, the processed files have to be XML valid anyway > HTML extractor produces invalid XML > --- > > Key: CONNECTORS-1656 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 > Project: ManifoldCF > Issue Type: Bug > Components: HTML extractor >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Major > > The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' > option is disabled) but invalid XML (some tags like img do not have closing > tag), and in some cases it is problematic. For example, when Tika is used > behind, it processes the document as an XML document and most of the time a > parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1656) HTML extractor produces invalid XML
Julien Massiera created CONNECTORS-1656: --- Summary: HTML extractor produces invalid XML Key: CONNECTORS-1656 URL: https://issues.apache.org/jira/browse/CONNECTORS-1656 Project: ManifoldCF Issue Type: Bug Components: HTML extractor Affects Versions: ManifoldCF 2.17 Reporter: Julien Massiera The HTML extractor connector produces valid HTML doc (when the 'Strip HTML' option is disabled) but invalid XML (some tags like img do not have closing tag), and in some cases it is problematic. For example, when Tika is used behind, it processes the document as an XML document and most of the time a parse exception is raised. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1655) Web connector - UnsupportedEncodingException utf-8
[ https://issues.apache.org/jira/browse/CONNECTORS-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215375#comment-17215375 ] Julien Massiera commented on CONNECTORS-1655: - Thanks for the fix ! > Web connector - UnsupportedEncodingException utf-8 > -- > > Key: CONNECTORS-1655 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1655 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > Fix For: ManifoldCF 2.18 > > > When crawling some sites (for instance this one: > [http://www.antibes-juanlespins.com/] ) the job manages to index some > documents, but the stops with the following error code: > Error: IO error: utf-8; filename=rseventspro_rss20_56.xml > Here is one the MCF stacktrace: > Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml > org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; > filename=rseventspro_rss20_56.xml > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203) > ~[?:?] > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855) > ~[?:?] > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?] > Caused by: java.io.UnsupportedEncodingException: utf-8; > filename=rseventspro_rss20_56.xml > at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) > ~[?:1.8.0_212] > at java.io.InputStreamReader.(InputStreamReader.java:100) ~[?:1.8.0_212] > at > org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47) > ~[?:?] > at > org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250) > ~[?:?] > at > org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52) > ~[?:?] > at > org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74) > ~[?:?] > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174) > ~[?:?] > ... 3 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1655) Web connector - UnsupportedEncodingException utf-8
[ https://issues.apache.org/jira/browse/CONNECTORS-1655?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17215238#comment-17215238 ] Julien Massiera commented on CONNECTORS-1655: - Hi [~kwri...@metacarta.com], I am using offical OpenJDK 11 installed from the Debian repo: openjdk version "11.0.8" 2020-07-14 OpenJDK Runtime Environment 18.9 (build 11.0.8+10) OpenJDK 64-Bit Server VM 18.9 (build 11.0.8+10, mixed mode) > Web connector - UnsupportedEncodingException utf-8 > -- > > Key: CONNECTORS-1655 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1655 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.17 >Reporter: Julien Massiera >Priority: Critical > > When crawling some sites (for instance this one: > [http://www.antibes-juanlespins.com/] ) the job manages to index some > documents, but the stops with the following error code: > Error: IO error: utf-8; filename=rseventspro_rss20_56.xml > Here is one the MCF stacktrace: > Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml > org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; > filename=rseventspro_rss20_56.xml > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203) > ~[?:?] > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855) > ~[?:?] > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746) > ~[?:?] > at > org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) > [mcf-pull-agent.jar:?] > Caused by: java.io.UnsupportedEncodingException: utf-8; > filename=rseventspro_rss20_56.xml > at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) > ~[?:1.8.0_212] > at java.io.InputStreamReader.(InputStreamReader.java:100) ~[?:1.8.0_212] > at > org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47) > ~[?:?] > at > org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250) > ~[?:?] > at > org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52) > ~[?:?] > at > org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74) > ~[?:?] > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174) > ~[?:?] > ... 3 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1655) Web connector - UnsupportedEncodingException utf-8
Julien Massiera created CONNECTORS-1655: --- Summary: Web connector - UnsupportedEncodingException utf-8 Key: CONNECTORS-1655 URL: https://issues.apache.org/jira/browse/CONNECTORS-1655 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 2.17 Reporter: Julien Massiera When crawling some sites (for instance this one: [http://www.antibes-juanlespins.com/] ) the job manages to index some documents, but the stops with the following error code: Error: IO error: utf-8; filename=rseventspro_rss20_56.xml Here is one the MCF stacktrace: Exception tossed: IO error: utf-8; filename=rseventspro_rss20_56.xml org.apache.manifoldcf.core.interfaces.ManifoldCFException: IO error: utf-8; filename=rseventspro_rss20_56.xml at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4203) ~[?:?] at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.extractLinks(WebcrawlerConnector.java:3855) ~[?:?] at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.processDocuments(WebcrawlerConnector.java:746) ~[?:?] at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:399) [mcf-pull-agent.jar:?] Caused by: java.io.UnsupportedEncodingException: utf-8; filename=rseventspro_rss20_56.xml at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71) ~[?:1.8.0_212] at java.io.InputStreamReader.(InputStreamReader.java:100) ~[?:1.8.0_212] at org.apache.manifoldcf.connectorcommon.fuzzyml.DecodingByteReceiver.dealWithBytes(DecodingByteReceiver.java:47) ~[?:?] at org.apache.manifoldcf.connectorcommon.fuzzyml.BOMEncodingDetector.dealWithRemainder(BOMEncodingDetector.java:250) ~[?:?] at org.apache.manifoldcf.connectorcommon.fuzzyml.SingleByteReceiver.dealWithBytes(SingleByteReceiver.java:52) ~[?:?] at org.apache.manifoldcf.connectorcommon.fuzzyml.Parser.parseWithCharsetDetection(Parser.java:74) ~[?:?] at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.handleXML(WebcrawlerConnector.java:4174) ~[?:?] ... 3 more -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1105) Add maven delivery targets to poms
[ https://issues.apache.org/jira/browse/CONNECTORS-1105?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17143855#comment-17143855 ] Julien Massiera commented on CONNECTORS-1105: - [~kwri...@metacarta.com], [~schuch] any news about this ticket ? I am really interested to have at least the MCF jars pushed to the maven central repo > Add maven delivery targets to poms > -- > > Key: CONNECTORS-1105 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1105 > Project: ManifoldCF > Issue Type: Improvement > Components: Build >Affects Versions: ManifoldCF 1.8 >Reporter: Karl Wright >Assignee: Markus Schuch >Priority: Major > Fix For: ManifoldCF next > > > We've been asked to deliver mcf jars and wars to maven central repository by > some developers. This ticket represents that work. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (CONNECTORS-1637) New Confluence connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1637. - Fix Version/s: ManifoldCF 2.16 Resolution: Fixed > New Confluence connector > > > Key: CONNECTORS-1637 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1637 > Project: ManifoldCF > Issue Type: New Feature > Components: Confluence connector >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > Fix For: ManifoldCF 2.16 > > > We need to address 3 main issues of the current Confluence connector : > - it does not correctly implements the security > - it has performance problems when handling a huge dataset > - it generates a version string for documents that is not sufficient to > detect all changes > To resolve some of these issues, the connector has to use the new confluence > API which is available from the v6. For that reason we need to release a new > connector. -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Resolved] (CONNECTORS-1645) Identical login regex rules bug
[ https://issues.apache.org/jira/browse/CONNECTORS-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julien Massiera resolved CONNECTORS-1645. - Fix Version/s: ManifoldCF 2.17 Resolution: Fixed > Identical login regex rules bug > --- > > Key: CONNECTORS-1645 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1645 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > Fix For: ManifoldCF 2.17 > > > If a login sequence implies the same URL for different login types (ex: form > and redirect), you can't configure the same regex for each of them otherwise > they will override each other and only the last configured one will be > considered by the login sequence. > Currently the only workaround is to make a different regex for each login > type that matches the same URL -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1645) Identical login regex rules bug
[ https://issues.apache.org/jira/browse/CONNECTORS-1645?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17130915#comment-17130915 ] Julien Massiera commented on CONNECTORS-1645: - [~kwri...@metacarta.com] the method "findNextOne" in the LoginParameterIterator class has a problem, it always returns the same currentOne so the "hasNext" method always returns "true". It results in an endless loop and I suppose this explains why the Unit tests of this connector never end. > Identical login regex rules bug > --- > > Key: CONNECTORS-1645 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1645 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > > If a login sequence implies the same URL for different login types (ex: form > and redirect), you can't configure the same regex for each of them otherwise > they will override each other and only the last configured one will be > considered by the login sequence. > Currently the only workaround is to make a different regex for each login > type that matches the same URL -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Created] (CONNECTORS-1645) Identical login regex rules bug
Julien Massiera created CONNECTORS-1645: --- Summary: Identical login regex rules bug Key: CONNECTORS-1645 URL: https://issues.apache.org/jira/browse/CONNECTORS-1645 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 2.12 Reporter: Julien Massiera If a login sequence implies the same URL for different login types (ex: form and redirect), you can't configure the same regex for each of them otherwise they will override each other and only the last configured one will be considered by the login sequence. Currently the only workaround is to make a different regex for each login type that matches the same URL -- This message was sent by Atlassian Jira (v8.3.4#803005)
[jira] [Commented] (CONNECTORS-1637) New Confluence connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1637?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17053598#comment-17053598 ] Julien Massiera commented on CONNECTORS-1637: - [~kwri...@metacarta.com] I fixed the ant build. The branch is done ! > New Confluence connector > > > Key: CONNECTORS-1637 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1637 > Project: ManifoldCF > Issue Type: New Feature > Components: Confluence connector >Reporter: Julien Massiera >Assignee: Julien Massiera >Priority: Major > > We need to address 3 main issues of the current Confluence connector : > - it does not correctly implements the security > - it has performance problems when handling a huge dataset > - it generates a version string for documents that is not sufficient to > detect all changes > To resolve some of these issues, the connector has to use the new confluence > API which is available from the v6. For that reason we need to release a new > connector. -- This message was sent by Atlassian Jira (v8.3.4#803005)