[jira] [Resolved] (CONNECTORS-1373) Metadata mapping
[ https://issues.apache.org/jira/browse/CONNECTORS-1373?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1373. - Resolution: Won't Fix > Metadata mapping > > > Key: CONNECTORS-1373 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1373 > Project: ManifoldCF > Issue Type: Task > Components: CMIS Output Connector >Reporter: Piergiorgio Lucidi >Assignee: Piergiorgio Lucidi >Priority: Major > Fix For: ManifoldCF 2.12 > > > We need to add the metadata mapping for allow users to migrate not only the > content but also properties. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1529) Add "url" output element to ES Output Connector (required when used with the Web Repository Connector)
[ https://issues.apache.org/jira/browse/CONNECTORS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1529. - Resolution: Won't Fix Not a good idea; was fixed instead by adding new canonicalization capability to web connector > Add "url" output element to ES Output Connector (required when used with the > Web Repository Connector) > -- > > Key: CONNECTORS-1529 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1529 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Steph van Schalkwyk >Assignee: Steph van Schalkwyk >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: elasticsearch.patch, image-2018-09-06-10-28-45-008.png > > > Add "url" (copy of the _id field) to ES Output. > ES no longer supports copying from _id (copy-to) in the schema. > As per > !image-2018-09-06-10-28-45-008.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation
[ https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1552. - Resolution: Fixed > Apache ManifoldCF Elastic Connector for Basic Authorisation > --- > > Key: CONNECTORS-1552 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1552 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna Agrawal >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: screenshot-1.png > > > We are using the Apache Manifold CF to connect the elastic search as our > Elastic server is protected url there is no way we are able to connect from > the Admin console. > If we remove the authentication connector works well but we want to access by > passing username and password. > Please guide us so that we can complete our set up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation
[ https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1552: --- Assignee: Karl Wright (was: Steph van Schalkwyk) > Apache ManifoldCF Elastic Connector for Basic Authorisation > --- > > Key: CONNECTORS-1552 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1552 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna Agrawal >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: screenshot-1.png > > > We are using the Apache Manifold CF to connect the elastic search as our > Elastic server is protected url there is no way we are able to connect from > the Admin console. > If we remove the authentication connector works well but we want to access by > passing username and password. > Please guide us so that we can complete our set up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1562. - Resolution: Fixed Fix Version/s: ManifoldCF 2.12 r1849001 | kwright | 2018-12-15 12:47:31 -0500 (Sat, 15 Dec 2018) | 1 line Final fix for CONNECTORS-1562. r1849000 | kwright | 2018-12-15 12:02:07 -0500 (Sat, 15 Dec 2018) | 1 line More debugging and refactoring r1848999 | kwright | 2018-12-15 09:29:23 -0500 (Sat, 15 Dec 2018) | 1 line Log all delete dependencies that we record, and do more refactoring r1848992 | kwright | 2018-12-15 07:56:23 -0500 (Sat, 15 Dec 2018) | 1 line More minor refactoring of HopCount module r1848991 | kwright | 2018-12-15 07:46:16 -0500 (Sat, 15 Dec 2018) | 1 line Minor refactoring to bring code off of the java 1.4 world r1848981 | kwright | 2018-12-15 03:23:57 -0500 (Sat, 15 Dec 2018) | 1 line Improve hopcount logging further, this time on the query side r1848911 | kwright | 2018-12-14 00:58:42 -0500 (Fri, 14 Dec 2018) | 1 line Improve hopcount logging and commenting > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Karl Wright commented on CONNECTORS-1562: - It's a bit more complicated than I originally thought. Once the job has been run, the thing is corrupted because there's an existing distance that never gets invalidated, and that can never be fixed so long as the job hangs around. So I will need to capture a run from scratch with database debugging on in order to see exactly what dependencies are getting recorded, and why the query meant to pick them up and invalidate them on subsequent runs is missing the records inserted during the seeding process. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722182#comment-16722182 ] Karl Wright commented on CONNECTORS-1562: - I think I determined what the problem is: no delete dependencies are being recorded for seeds. That means we never invalidate the initial hopcount answers, which explains why this problem seems confined to seeds and nothing else. A fix should be straightforward but I will also want to construct an integration test that exercises it before a final commit is done. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722124#comment-16722124 ] Karl Wright commented on CONNECTORS-1562: - Analysis: On the reduced pass, some documents had 'link' hopcount of 1, but they all had 'redirect' hopcount of 0. The hopcount computation queue, furthermore, was processed but found always empty, which means that no "invalid" markers were ever detected. This argues that the seeding phase did not operate as expected as far as marking hop count rows as invalid. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Attachment: manifoldcf.log.reduced > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16722066#comment-16722066 ] Karl Wright commented on CONNECTORS-1562: - Tonight's researches were inconclusive because logging was not adequate for hopcount querying. I've added better logging and run the example again, recollecting the reduced phase log as before. This has been attached but not yet examined. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Attachment: (was: manifoldcf.log.reduced) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16721311#comment-16721311 ] Karl Wright commented on CONNECTORS-1562: - I had a look at the startup thread portion of the dump and found that the queries all made sense. I'll have to spot-check a hopcount inquiry to see if any of the hopcount invalidations were, in fact, actually logged. If not, then the culprit is almost certainly the management of the hopdeletedeps table, in that it doesn't have the rows in it that we expect. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Attachment: (was: manifoldcf.log.reduced) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720982#comment-16720982 ] Karl Wright commented on CONNECTORS-1562: - Attached the "reduced" step with query logging. Analysis will take some time. The entire startup log chunk is here (and it contains the seeding part, which is what we're interested in): {code} DEBUG 2018-12-14T01:07:42,367 (Startup thread) - Requested query: [UPDATE jobqueue SET needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND status=? OR jobid=? AND status=?)] DEBUG 2018-12-14T01:07:42,368 (Thread-688) - Actual query: [SET SCHEMA PUBLIC] DEBUG 2018-12-14T01:07:42,368 (Thread-688) - Done actual query (0ms): [SET SCHEMA PUBLIC] DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Actual query: [UPDATE jobqueue SET needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND status=? OR jobid=? AND status=?)] DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Parameter 0: 'T' DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Parameter 1: '1544121003866' DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Parameter 2: 'P' DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Parameter 3: '1544121003866' DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Parameter 4: 'G' DEBUG 2018-12-14T01:07:42,369 (Thread-689) - Done actual query (0ms): [UPDATE jobqueue SET needpriorityprocessid=NULL,needpriority=? WHERE (jobid=? AND status=? OR jobid=? AND status=?)] DEBUG 2018-12-14T01:07:42,369 (Startup thread) - Beginning transaction of type 2 DEBUG 2018-12-14T01:07:42,369 (Startup thread) - Marking for delete for job 1544121003866 all hopcount document references from table jobqueue t99 matching t99.status IN (?,?) DEBUG 2018-12-14T01:07:42,370 (Startup thread) - Requested query: [UPDATE hopcount SET distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM hopdeletedeps t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? AND t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) AND t99.status IN (?,?))] DEBUG 2018-12-14T01:07:42,370 (Thread-690) - Actual query: [SET SCHEMA PUBLIC] DEBUG 2018-12-14T01:07:42,370 (Thread-690) - Done actual query (0ms): [SET SCHEMA PUBLIC] DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Actual query: [UPDATE hopcount SET distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM hopdeletedeps t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? AND t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) AND t99.status IN (?,?))] DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Parameter 0: '-1' DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Parameter 1: 'D' DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Parameter 2: '1544121003866' DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Parameter 3: '1544121003866' DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Parameter 4: '1544121003866' DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Parameter 5: 'P' DEBUG 2018-12-14T01:07:42,371 (Thread-691) - Parameter 6: 'H' DEBUG 2018-12-14T01:07:42,389 (Thread-691) - Done actual query (18ms): [UPDATE hopcount SET distance=?,deathmark=? WHERE id IN(SELECT t0.ownerid FROM hopdeletedeps t0,jobqueue t99,intrinsiclink t1 WHERE t0.jobid=? AND (t1.jobid=? AND t1.parentidhash=t0.parentidhash AND t1.linktype=t0.linktype AND t1.childidhash=t0.childidhash) AND (t99.jobid=? AND t99.dochash=t0.childidhash) AND t99.status IN (?,?))] DEBUG 2018-12-14T01:07:42,390 (Startup thread) - Done setting hopcount rows for job 1544121003866 to initial distances DEBUG 2018-12-14T01:07:42,390 (Startup thread) - Requested query: [DELETE FROM intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? AND t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))] DEBUG 2018-12-14T01:07:42,390 (Thread-692) - Actual query: [DELETE FROM intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? AND t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))] DEBUG 2018-12-14T01:07:42,391 (Thread-692) - Parameter 0: '1544121003866' DEBUG 2018-12-14T01:07:42,391 (Thread-692) - Parameter 1: 'P' DEBUG 2018-12-14T01:07:42,391 (Thread-692) - Parameter 2: 'H' DEBUG 2018-12-14T01:07:42,407 (Thread-692) - Done actual query (17ms): [DELETE FROM intrinsiclink WHERE EXISTS(SELECT 'x' FROM jobqueue t99 WHERE (t99.jobid=? AND t99.dochash=intrinsiclink.childidhash) AND t99.status IN (?,?))] DEBUG 2018-12-14T01:07:42,408 (Startup thread) - Requested query: [DELETE FROM hopdeletedeps WHERE ownerid IN(SELECT id FROM hopcount WHERE (jobid=? AND deathmark=?))] DEBUG 2018-12-14T01:07:42,408 (Thread-693) - Actual query: [DELETE FROM hopdeletedeps WHERE ownerid IN(SELECT id FROM hopcount WHERE (jobid=? AND deathmark=?))] DEBUG 2018-12-14T01:07:42,409
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Attachment: manifoldcf.log.reduced > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720171#comment-16720171 ] Karl Wright commented on CONNECTORS-1563: - If you need me to debug your solr setup, you're going to need to wait a couple of weeks, I'm afraid. I'm extremely behind and I honestly have been working 20 hour days. You're on your own for now. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > Attachments: managed-schema, solrconfig.xml > > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16720061#comment-16720061 ] Karl Wright commented on CONNECTORS-1563: - That argues that your solr configuration is not correct, because this was tested thoroughly during the last release cycle. > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1563: --- Assignee: Karl Wright > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1563) SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 bytes
[ https://issues.apache.org/jira/browse/CONNECTORS-1563?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1563: Component/s: (was: Solr 7.x component) Lucene/SOLR connector > SolrException: org.apache.tika.exception.ZeroByteFileException: InputStream > must have > 0 bytes > --- > > Key: CONNECTORS-1563 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1563 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Reporter: Sneha >Assignee: Karl Wright >Priority: Major > > I am encountering this problem: > I have checked "Use the Extract Update Handler:" param then I am getting an > error on Solr i.e. null:org.apache.solr.common.SolrException: > org.apache.tika.exception.ZeroByteFileException: InputStream must have > 0 > bytes > If I ignore tika exception, my documents get indexed but dont have content > field on Solr. > I am using Solr 7.3.1 and manifoldCF 2.8.1 > I am using solr cell and hence not configured external tika extractor in > manifoldCF pipeline > Please help me with this problem > Thanks in advance -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16719945#comment-16719945 ] Karl Wright commented on CONNECTORS-1562: - I went through the invalidation logic, which was last changed in 2012 as part of ticket CONNECTORS-501. The fix for that ticket did not seem related to the current problem at least. I improved logging methods in this area, and documentation, but did not yet find any logical errors. The next step is therefore to repeat the experiment with database debugging also enabled so I can see the queries too. I can probably start by doing this only during the reduce phase, but if those queries look good, I'd have to also do the initial phase. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718615#comment-16718615 ] Karl Wright commented on CONNECTORS-1562: - What's immediately obvious is that *no* invalidation of computed hopcounts is taking place at all on the second phase. The seeding goes through the right steps but no computed hopcounts are invalidated -- either that, or they're not queried for on the second pass. If the invalidation query actually fires, then, it is either wrong, or the data kept in the invalidation table is wrong. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16718604#comment-16718604 ] Karl Wright commented on CONNECTORS-1562: - I was trivially able to verify that the hopcount system is giving incorrect answers for the documents that should be removed. I turned on hopcount debugging and made three log dumps for the example job I described - init, reduced, cleanup. These are dumps from the initial crawl, the crawl with the reduced seeds, and the final crawl with no seeds. Attached to the ticket for further analysis. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Attachment: manifoldcf.log.reduced manifoldcf.log.init manifoldcf.log.cleanup > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717278#comment-16717278 ] Karl Wright commented on CONNECTORS-1562: - [~SteenTi] The issue was reopened many hours ago. As I stated, however, it is a very complex issue and may require significant framework changes to fix. It cannot happen quickly for this reason. I estimate *at best* two weeks, and possibly a month or more. Certainly not something you should count on tomorrow. Furthermore, I continue to advise against your general approach. If you have a site map page, why can't you simply have *one* seed, pointing at that site map, no hopcount filtering, and an exclusion list to remove pages you don't want indexed? That's how the connector is designed to work. In that model URLs that are removed from the site, or put into the exclusion list, *will* be deleted from the index. If the customer's demands are rigid and they want a crawler where they simply load up the queue with URLs, you always have the option of constructing an RSS feed or developing a custom connector. RSS feeds don't follow links in listed documents at all, and they would seem to have everything else you need. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method
[ https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717159#comment-16717159 ] Karl Wright commented on LUCENE-8587: - Thinking about it, it seems safest to me to serialize and deserialize all five GeoPoint values -- lat, lon, x, y, z. If that's done then no modifications would be needed to GeoStandardCircle and GeoExactCircle, and we wouldn't need to guess at whether it's all going to work. The downside is that the serialized size is going to grow by a factor of 2 -- but that may not be horrible. > Self comparison bug in GeoComplexPolygon.equals method > -- > > Key: LUCENE-8587 > URL: https://issues.apache.org/jira/browse/LUCENE-8587 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Affects Versions: 7.1 >Reporter: Zsolt Gyulavari >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8587.patch > > > GeoComplexPolygon.equals method checks equality with own testPoint1 field > instead of the other.testPoint1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method
[ https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717068#comment-16717068 ] Karl Wright commented on LUCENE-8587: - It appears GeoStandardCircle and GeoExactCircle require lat/lon as arguments, so in order to make this work I'd need to make some changes there as well, including adding constructors that accept GeoPoints. I'm also a bit queasy about the fact that after deserialization the point methods getLatitude() and getLongitude() will return different values than they would before serialization. I don't see any obvious place where this might blow up but it will take more analysis to be sure. > Self comparison bug in GeoComplexPolygon.equals method > -- > > Key: LUCENE-8587 > URL: https://issues.apache.org/jira/browse/LUCENE-8587 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Affects Versions: 7.1 >Reporter: Zsolt Gyulavari >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8587.patch > > > GeoComplexPolygon.equals method checks equality with own testPoint1 field > instead of the other.testPoint1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method
[ https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717035#comment-16717035 ] Karl Wright commented on LUCENE-8587: - What I'd like to do is change the GeoPoint serialization and deserialization to save the (x,y,z) tuples rather than the (lat,lon) ones: {code} @Override public void write(final OutputStream outputStream) throws IOException { SerializableObject.writeDouble(outputStream, x); SerializableObject.writeDouble(outputStream, y); SerializableObject.writeDouble(outputStream, z); } {code} and {code} public GeoPoint(final PlanetModel planetModel, final InputStream inputStream) throws IOException { // Note: this relies on left-right parameter execution order!! Much code depends on that though and // it is apparently in a java spec: https://stackoverflow.com/questions/2201688/order-of-execution-of-parameters-guarantees-in-java this(planetModel, SerializableObject.readDouble(inputStream), SerializableObject.readDouble(inputStream), SerializableObject.readDouble(inputStream)); } {code} This is not a backwards compatible change, however, so we could make it only in master and not pull it up to the 7.x and 6.x branches. [~ivera], what do you think? > Self comparison bug in GeoComplexPolygon.equals method > -- > > Key: LUCENE-8587 > URL: https://issues.apache.org/jira/browse/LUCENE-8587 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Affects Versions: 7.1 >Reporter: Zsolt Gyulavari >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8587.patch > > > GeoComplexPolygon.equals method checks equality with own testPoint1 field > instead of the other.testPoint1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method
[ https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717020#comment-16717020 ] Karl Wright commented on LUCENE-8587: - Ok, you're right, this is more complex. We cannot do without the testpoint and the in/out of set boolean, even though moving these around might produce exactly the same polygon. On the other hand, blaming the serialization of the testpoint also seems odd since it's basically preserved from the constructor in whatever form was there. Perhaps serialization/deserialization of the geopoint needs to change. Let me examine that next. > Self comparison bug in GeoComplexPolygon.equals method > -- > > Key: LUCENE-8587 > URL: https://issues.apache.org/jira/browse/LUCENE-8587 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Affects Versions: 7.1 >Reporter: Zsolt Gyulavari >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8587.patch > > > GeoComplexPolygon.equals method checks equality with own testPoint1 field > instead of the other.testPoint1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method
[ https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717004#comment-16717004 ] Karl Wright commented on LUCENE-8587: - {quote} Maybe we should build the point here using the equivalent [lat, lon] {quote} [~ivera] No, that makes no sense. Polygons are never constructed using (x,y,z) coordinates; they are always constructed using lat/lon points and a planet model. If the lat/lons are the same you won't get different x,y,z points, period. So there's something else being done wrong, and I think the problem is probably the random number generator construction of the testpoint. The testpoint should *not* be included in the equals computation for that reason. I will commit a fix. > Self comparison bug in GeoComplexPolygon.equals method > -- > > Key: LUCENE-8587 > URL: https://issues.apache.org/jira/browse/LUCENE-8587 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Affects Versions: 7.1 >Reporter: Zsolt Gyulavari >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8587.patch > > > GeoComplexPolygon.equals method checks equality with own testPoint1 field > instead of the other.testPoint1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-8587) Self comparison bug in GeoComplexPolygon.equals method
[ https://issues.apache.org/jira/browse/LUCENE-8587?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned LUCENE-8587: --- Assignee: Karl Wright > Self comparison bug in GeoComplexPolygon.equals method > -- > > Key: LUCENE-8587 > URL: https://issues.apache.org/jira/browse/LUCENE-8587 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Affects Versions: 7.1 >Reporter: Zsolt Gyulavari >Assignee: Karl Wright >Priority: Major > Attachments: LUCENE-8587.patch > > > GeoComplexPolygon.equals method checks equality with own testPoint1 field > instead of the other.testPoint1. -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716994#comment-16716994 ] Karl Wright commented on CONNECTORS-1562: - By default, unless you select otherwise, the site pages you crawl are limited to those domains present in the seeds. So I think you can simply disable hopcount entirely if you have an exclusion list and you leave the domain restriction in place. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716411#comment-16716411 ] Karl Wright commented on CONNECTORS-1562: - I tried this out using a small number of the specific seeds provided. I started with the following: {code} https://www.uantwerpen.be/en/ https://www.uantwerpen.be/en/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/convention-halls/hof-van-liere/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club https://www.uantwerpen.be/en/about-uantwerp/facts-figures {code} This generated seven ingestions. I then more-or-less randomly removed a few seeds, leaving this: {code} https://www.uantwerpen.be/en/ https://www.uantwerpen.be/en/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/about-uantwerp/ https://www.uantwerpen.be/en/about-uantwerp/catering-conventionhalls/university-club https://www.uantwerpen.be/en/about-uantwerp/facts-figures {code} Rerunning produced zero deletions, and a refetch of all seven previously-ingested documents, with no new ingestions. Finally, I removed all the seeds and ran it again. A deletion was logged for every indexed document. My quick analysis of what is happening here is this: - ManifoldCF keeps grave markers around for hopcount tracking. Hopcount tracking in MCF is extremely complex and much care is taken to avoid miscalculating the number of hops to a document, no matter what order documents are processed in. In order to make that work, documents cannot be deleted from the queue just because their hopcount is too large; instead, quite a number of documents are put in the queue and may or may not be fetched, depending if they wind up with a low enough hopcount - The document deletion phase removes unreachable documents, but documents that simply have too great a hopcount but otherwise are in the queue are not precisely unreachable In other words, the cleanup phase of a job seems to interact badly with documents that are reachable but just have too great a hopcount; these documents seem to be overlooked for cleanup, and will ONLY be cleaned up when they become truly unreachable. This is not intended behavior. However, it's also a behavior change in a very complex part of the software, and will therefore require great care to correct without breaking something. Because it is not something simple, you should expect me to require a couple of weeks elapsed time to come up with the right fix. Furthermore, it is still true that this model is not one that I'd recommend for crawling a web site. The web connector is not designed to operate with hundreds of thousands of seeds; hundreds, maybe, or thousands on a bad day, but trying to control exactly what MCF indexes by fooling with the seed list is not what it was designed for. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1562: Summary: Documents unreachable due to hopcount are not considered unreachable on cleanup pass (was: Document removal Elastic) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Reopened] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reopened CONNECTORS-1562: - > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714835#comment-16714835 ] Karl Wright commented on CONNECTORS-1562: - Hi [~SteenTi], you are in essence making a seed list that is intended to be the entire list of all URLs that are crawled, and using hopcount filtering to try and make sure no links are taken. You are then removing individual seeds and expecting the individual URLs to be removed from the index. This is a usage model that is not well tested (because of the hopcount involvement), so I can well believe it doesn't do exactly what you'd expect. We do not generally recommend this model because the seed list may well wind up being huge. If there's no way you can create an index page of some kind, then you might be stuck with it, but bear in mind that the Web Connector is not designed to support this model. If this is the model you nevertheless intend to operate under, I will reopen the ticket and try to reproduce the problem, but it will not be looked at until next weekend at the earliest, as this is not my day job and this is not a supported model. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714595#comment-16714595 ] Karl Wright edited comment on CONNECTORS-1562 at 12/10/18 11:58 AM: [~SteenTi], good that the scheduler is working as expected. {quote} Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. {quote} The scheduler doesn't have any impact on the way a job runs, unless you tell it to do a "minimal" run rather than a "complete" one. There's a pulldown for every schedule record you create that lets you decide which it's going to be. What is selected for your schedule record? Also, were you able to see deletions when you followed my steps above? was (Author: kwri...@metacarta.com): [~SteenTi], good that the scheduler is working as expected. {quote} Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. {quote} The scheduler doesn't have any impact on the way a job runs, unless you tell it to do a "minimal" run rather than a "complete" one. There's a pulldown for every schedule record you create that lets you decide which it's going to be. What is selected for your schedule record? Also, were you able to see deletions when you follows my steps above? > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714595#comment-16714595 ] Karl Wright commented on CONNECTORS-1562: - [~SteenTi], good that the scheduler is working as expected. {quote} Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. {quote} The scheduler doesn't have any impact on the way a job runs, unless you tell it to do a "minimal" run rather than a "complete" one. There's a pulldown for every schedule record you create that lets you decide which it's going to be. What is selected for your schedule record? Also, were you able to see deletions when you follows my steps above? > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711866#comment-16711866 ] Karl Wright commented on CONNECTORS-1562: - Hi [~SteenTi], the only thing I have not been able to verify is whether the ES connector is working properly or not. What I'd like you to do is set up your sample job in such a way so that it is small enough to crawl in a small amount of time -- and use the Null output connector rather than the ES one. Please then make sure you know how to execute the web crawl jobs and make sure you see the same things I saw above. Once you get to that point, we can verify whether or not ES is doing the right thing. Thanks again. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711871#comment-16711871 ] Karl Wright commented on CONNECTORS-1562: - [~DonaldVdD], please see above. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711862#comment-16711862 ] Karl Wright commented on CONNECTORS-1562: - Next, I modified the job as follows: - Added the "http://manifoldcf.apache.org; url to the seeds again - Went to the "Schedule" tab - Created a schedule record that had the 48-minute value and no other minute value, and clicked the "Add" button for schedule records - Clicked on the "Connection" tab and selected "Start when schedule window starts" option - Clicked "save" - Went to the Job Status page and refreshed until 1:48 PM - Saw that the job started at 1: 48 PM I conclude that the scheduler works properly too. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711846#comment-16711846 ] Karl Wright commented on CONNECTORS-1562: - I just did a test run as follows: (1) Created a web repository connection (using all defaults except the required email address) (2) Created a null output connection (again, all defaults) (3) Created a job that used these two connections, using maximum link count of 2 and no maximum redirection count, plus seed of "http://manifoldcf.apache.org; (4) Ran the job manually to completion (5) Immediately got a simple history report for the web connection: {code} Start Time ActivityIdentifier Result Code Bytes Time Result Description 12/6/18 1:33:10 PM output notification (Null) OK 0 1 12/6/18 1:33:00 PM job end 1544121003866(test) 0 1 12/6/18 1:32:54 PM document ingest (Null) http://manifoldcf.apache.org/en_US/mail.html OK 11212 1 "Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1 12/6/18 1:32:54 PM process http://manifoldcf.apache.org/en_US/mail.html OK 11212 26 12/6/18 1:32:53 PM fetch http://manifoldcf.apache.org/en_US/mail.html 200 11212 365 12/6/18 1:32:49 PM document ingest (Null) http://manifoldcf.apache.org/en_US/who.html OK 96341 "Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1 12/6/18 1:32:49 PM process http://manifoldcf.apache.org/en_US/who.html OK 963417 12/6/18 1:32:48 PM fetch http://manifoldcf.apache.org/en_US/who.html 200 9634339 12/6/18 1:32:44 PM document ingest (Null) http://manifoldcf.apache.org/en_US/release-documentation.html OK 93491 "Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1 12/6/18 1:32:44 PM process http://manifoldcf.apache.org/en_US/release-documentation.html OK 934910 12/6/18 1:32:43 PM fetch http://manifoldcf.apache.org/en_US/release-documentation.html 200 9349338 12/6/18 1:32:39 PM document ingest (Null) http://manifoldcf.apache.org/en_US/security.html OK 13725 1 "Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1 12/6/18 1:32:39 PM process http://manifoldcf.apache.org/en_US/security.html OK 13725 15 12/6/18 1:32:38 PM fetch http://manifoldcf.apache.org/en_US/security.html 200 13725 417 12/6/18 1:32:34 PM document ingest (Null) http://manifoldcf.apache.org/en_US/books-and-presentations.html OK 11419 1 "Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1 12/6/18 1:32:34 PM process http://manifoldcf.apache.org/en_US/books-and-presentations.html OK 11419 14 12/6/18 1:32:33 PM fetch http://manifoldcf.apache.org/en_US/books-and-presentations.html 200 11419 371 12/6/18 1:32:31 PM document ingest (Null) http://manifoldcf.apache.org/en_US/download.html OK 144128 1 "Accept-Ranges":1,"Keep-Alive":1,"Server":1,"ETag":1,"Connection":1,"Vary":1,"Last-Modified":1,"Content-Type":1 12/6/18 1:32:31 PM process http://manifoldcf.apache.org/en_US/download.html OK 144128 8 12/6/18 1:32:28 PM fetch http://manifoldcf.apache.org/en_US/download.html 200 144128 2443 {code} Next: (1) I modified the job to remove the one seed I had, and saved it (2) Ran the job again (3) Immediately retrieved a Simple History report: {code} 12/6/18 1:35:20 PM output notification (Null) OK 0 1 12/6/18 1:35:10 PM job end 1544121003866(test) 0 1 12/6/18 1:35:00 PM document deletion (Null) http://manifoldcf.apache.org/ja_JP/release-documentation.html OK 0 1 12/6/18 1:35:00 PM document deletion (Null) http://manifoldcf.apache.org/skin/profile.css OK 0 1 12/6/18 1:35:00 PM document deletion (Null) http://manifoldcf.apache.org/ja_JP/download.html OK 0 1 12/6/18 1:35:00 PM document deletion (Null) http://manifoldcf.apache.org/en_US/developer-resources.html OK 0 1 12/6/18 1:35:00 PM document deletion (Null) http://manifoldcf.apache.org/en_US/who.html OK 0 1 12/6/18 1:35:00 PM document deletion (Null) http://manifoldcf.apache.org/ja_JP/developer-resources.html OK 0 1 12/6/18 1:35:00 PM document deletion (Null) http://manifoldcf.apache.org/ja_JP/index.html OK 0 1 12/6/18 1:35:00 PM
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16711409#comment-16711409 ] Karl Wright commented on CONNECTORS-1562: - Hi Tim, All the functionality you say doesn't work is exercised by integration tests. I will happily do a walkthrough today at some point to confirm this. It is an extremely busy day for me, however, so please be patient. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16710279#comment-16710279 ] Karl Wright commented on CONNECTORS-1562: - Hi [~SteenTi], you will still not get unreachable documents deleted if you run your job using the "minimal" cycle. Please be sure you are using the "full" cycle. If you need cycles that are very very short, you will need to make a tradeoff between getting new content in and removing old content. Typically we recommend that you schedule your job to use "minimal" crawls most of the time, but use "full" runs periodically to clean out unreachable documents. If you believe you are running "full" crawls and there is still not any cleanup, I can assure you that the Web Connector has automated tests that verify it does work properly to clean up unreachable documents. So there would be two possibilities: (1) this is specific to changes in seeds, or (2) the Elastic Search Connector is transmitting deletes that are failing silently for some reason. In order to figure out which it is please run a cycle manually, and look at the Simple History report to see if deletions are logged. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1562. - Resolution: Not A Problem > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709779#comment-16709779 ] Karl Wright commented on CONNECTORS-1562: - "Dynamic rescan" is the same thing as "continuous crawling". You don't want that if you want document deletions to be detected on a schedule. In fact, jobs never end in this mode; they run indefinitely. There's a whole book chapter on this and the user guide also mentions this: http://manifoldcf.apache.org/release/release-2.11/en_US/end-user-documentation.html#jobs > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709688#comment-16709688 ] Karl Wright edited comment on CONNECTORS-1562 at 12/5/18 7:35 AM: -- Hi [~SteenTi], I see this is the web connector. Can you tell me what kind of crawl you are doing? If this is a continuous crawl, or you kicked it off with "Start minimal", that's expected. was (Author: kwri...@metacarta.com): Hi [~SteenTi], can you tell me what repository connector you are using, and what kind of crawl you are doing? If this is a continuous crawl, or you kicked it off with "Start minimal", that's expected with most repository connectors. But in any case t's the repository connector that determines what happens and how deletions are found. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709688#comment-16709688 ] Karl Wright commented on CONNECTORS-1562: - Hi [~SteenTi], can you tell me what repository connector you are using, and what kind of crawl you are doing? If this is a continuous crawl, or you kicked it off with "Start minimal", that's expected with most repository connectors. But in any case t's the repository connector that determines what happens and how deletions are found. > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1562: --- Assignee: Karl Wright > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'
[ https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1546. - Resolution: Fixed > Optimize Elasticsearch performance by removing 'forcemerge' > --- > > Key: CONNECTORS-1546 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1546 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Reporter: Hans Van Goethem >Assignee: Steph van Schalkwyk >Priority: Major > Fix For: ManifoldCF 2.12 > > > After crawling with ManifoldCF, forcemerge is applied to optimize the > Elasticsearch index. This optimization makes the Elastic faster for > read-operations but not for write-opeartions. On the contrary, performance on > the write operations becomes worse after every forcemerge. > Can you remove this forcemerge in ManifoldCF to optimize perfomance for > recurrent crawling to Elasticsearch? > If somene needs this forcemerge, it can be applied mannually against > Elasticsearch directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'
[ https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16706804#comment-16706804 ] Karl Wright commented on CONNECTORS-1546: - Hi [~st...@remcam.net], can you let me know what happened to this? We're trying to get 2.12 ready for completion. Thanks!! > Optimize Elasticsearch performance by removing 'forcemerge' > --- > > Key: CONNECTORS-1546 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1546 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Reporter: Hans Van Goethem >Assignee: Steph van Schalkwyk >Priority: Major > Fix For: ManifoldCF 2.12 > > > After crawling with ManifoldCF, forcemerge is applied to optimize the > Elasticsearch index. This optimization makes the Elastic faster for > read-operations but not for write-opeartions. On the contrary, performance on > the write operations becomes worse after every forcemerge. > Can you remove this forcemerge in ManifoldCF to optimize perfomance for > recurrent crawling to Elasticsearch? > If somene needs this forcemerge, it can be applied mannually against > Elasticsearch directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'
[ https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1546: Fix Version/s: ManifoldCF 2.12 > Optimize Elasticsearch performance by removing 'forcemerge' > --- > > Key: CONNECTORS-1546 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1546 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Reporter: Hans Van Goethem >Assignee: Steph van Schalkwyk >Priority: Major > Fix For: ManifoldCF 2.12 > > > After crawling with ManifoldCF, forcemerge is applied to optimize the > Elasticsearch index. This optimization makes the Elastic faster for > read-operations but not for write-opeartions. On the contrary, performance on > the write operations becomes worse after every forcemerge. > Can you remove this forcemerge in ManifoldCF to optimize perfomance for > recurrent crawling to Elasticsearch? > If somene needs this forcemerge, it can be applied mannually against > Elasticsearch directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1522) Add SSL trust certificates list to ElasticSearch output connector
[ https://issues.apache.org/jira/browse/CONNECTORS-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1522. - Resolution: Fixed Still needs testing. That has been left to [~svanschalkwyk] to complete. > Add SSL trust certificates list to ElasticSearch output connector > - > > Key: CONNECTORS-1522 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1522 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Steph van Schalkwyk >Assignee: Karl Wright >Priority: Minor > Fix For: ManifoldCF 2.12 > > > Add "SSL trust certificate list" to Elasticsearch output connector. > Add User Id, Password functionality to ES output connector. > Above as per SOLR output connector. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1560) Improve tika-server robustness via -spawnChild
[ https://issues.apache.org/jira/browse/CONNECTORS-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16705329#comment-16705329 ] Karl Wright commented on CONNECTORS-1560: - [~talli...@apache.org], ManifoldCF does not ship the Tika Server. We provide a transformation connector that talks to it, but that is all. There is also an embedded Tika transformer which works for many people, but if people run into difficulties with it we recommend using the external server and setting it up themselves. > Improve tika-server robustness via -spawnChild > -- > > Key: CONNECTORS-1560 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1560 > Project: ManifoldCF > Issue Type: Wish >Reporter: Tim Allison >Priority: Major > > I'd encourage you to consider adopting the new {{-spawnChild}} mode in > tika-server. See the documentation here: > https://wiki.apache.org/tika/TikaJAXRS#Making%20Tika%20Server%20Robust%20to%20OOMs,%20Infinite%20Loops%20and%20Memory%20Leaks > The small downside is that the server can go down for a few seconds during > the restart. Clients have to be prepared for an IOException on files that > are being parsed when the child server goes down and/or if the child is being > restarted. The upside is that your users will be protected against infinite > loops, OOM and memory leaks...things that we used to just hope never > happened...but they do, and they will. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1560) Improve tika-server robustness via -spawnChild
[ https://issues.apache.org/jira/browse/CONNECTORS-1560?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1560. - Resolution: Won't Fix > Improve tika-server robustness via -spawnChild > -- > > Key: CONNECTORS-1560 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1560 > Project: ManifoldCF > Issue Type: Wish >Reporter: Tim Allison >Priority: Major > > I'd encourage you to consider adopting the new {{-spawnChild}} mode in > tika-server. See the documentation here: > https://wiki.apache.org/tika/TikaJAXRS#Making%20Tika%20Server%20Robust%20to%20OOMs,%20Infinite%20Loops%20and%20Memory%20Leaks > The small downside is that the server can go down for a few seconds during > the restart. Clients have to be prepared for an IOException on files that > are being parsed when the child server goes down and/or if the child is being > restarted. The upside is that your users will be protected against infinite > loops, OOM and memory leaks...things that we used to just hope never > happened...but they do, and they will. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1559) Logging Is Not working as expected
[ https://issues.apache.org/jira/browse/CONNECTORS-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16702381#comment-16702381 ] Karl Wright commented on CONNECTORS-1559: - As for an example logging.xml -- there's one shipped for every example. Please just read the documentation??? > Logging Is Not working as expected > -- > > Key: CONNECTORS-1559 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1559 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna >Assignee: Karl Wright >Priority: Major > > We are using the Manifold multi procress file type installation and normal > log4j property is not working as expected the Manifold trying to log into OS > log which we have not configure, > > If you can share sample Logging.xml and explain how logging works in Apache > that will be helpful. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1559) Logging Is Not working as expected
[ https://issues.apache.org/jira/browse/CONNECTORS-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1559. - Resolution: Not A Problem I don't know what you are talking about. There is a book, you know, which goes into many of these details. It's free. It has examples. Maybe you could look at that before opening tickets like this? https://github.com/DaddyWri/manifoldcfinaction/tree/master/pdfs > Logging Is Not working as expected > -- > > Key: CONNECTORS-1559 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1559 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna >Assignee: Karl Wright >Priority: Major > > We are using the Manifold multi procress file type installation and normal > log4j property is not working as expected the Manifold trying to log into OS > log which we have not configure, > > If you can share sample Logging.xml and explain how logging works in Apache > that will be helpful. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1559) Logging Is Not working as expected
[ https://issues.apache.org/jira/browse/CONNECTORS-1559?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1559. - Resolution: Not A Problem Assignee: Karl Wright Logging is described in the "how to build and deploy" page, here: https://manifoldcf.apache.org/release/release-2.11/en_US/how-to-build-and-deploy.html#The+ManifoldCF+configuration+files There are two places where logging may be configured: system-wide loggers controlled by properties.xml, and local loggers by the logging.xml file. > Logging Is Not working as expected > -- > > Key: CONNECTORS-1559 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1559 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna >Assignee: Karl Wright >Priority: Major > > We are using the Manifold multi procress file type installation and normal > log4j property is not working as expected the Manifold trying to log into OS > log which we have not configure, > > If you can share sample Logging.xml and explain how logging works in Apache > that will be helpful. > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1558) Action Button is Missing in Status Job
[ https://issues.apache.org/jira/browse/CONNECTORS-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16700885#comment-16700885 ] Karl Wright commented on CONNECTORS-1558: - I'm afraid this report is completely unintelligible, and it doesn't describe a bug either. So I'm closing it. Please communicate via us...@manifoldcf.apache.org for questions like this. > Action Button is Missing in Status Job > -- > > Key: CONNECTORS-1558 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1558 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna >Priority: Major > > We configure the Elastic Connector with Manifold server, We are using > Manifold 2.10 version and Elastic 5.6 . Even though no job is running still > Agent process is running from 2days and all its printing in the Simple > History Job end message. > > Could it be possible to release this job and we can stop the process from > running? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1558) Action Button is Missing in Status Job
[ https://issues.apache.org/jira/browse/CONNECTORS-1558?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1558. - Resolution: Incomplete > Action Button is Missing in Status Job > -- > > Key: CONNECTORS-1558 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1558 > Project: ManifoldCF > Issue Type: Bug >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna >Priority: Major > > We configure the Elastic Connector with Manifold server, We are using > Manifold 2.10 version and Elastic 5.6 . Even though no job is running still > Agent process is running from 2days and all its printing in the Simple > History Job end message. > > Could it be possible to release this job and we can stop the process from > running? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1557) HTML Tag extractor
[ https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16694406#comment-16694406 ] Karl Wright commented on CONNECTORS-1557: - The best way to deliver the code is as a patch attachment to a ticket like this. I hope that the transformer you wrote is consistent with the other transformers that ManifoldCF provides, e.g. the HTML Extractor and the Metadata Adjuster. Generally we are not fond of transformers that take on more than the most basic part of what might be structured as a multi-part transformation. From your description it sounds like you've basically extended the HTML extractor and added functionality to it similar to what the Metadata Adjuster does. If that's true, it might be good to only provide the extraction functionality extension from CSS to the HTML extractor, and let the Metadata Adjuster handle the field mappings. Please let me know how you want to proceed. > HTML Tag extractor > -- > > Key: CONNECTORS-1557 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1557 > Project: ManifoldCF > Issue Type: New Feature >Reporter: Donald Van den Driessche >Priority: Major > > I wrote a HTML Tag extractor, based on the HTML Extractor. > I needed to extract specific HTML tags and transfer them to their own field > in my output repository. > Input > * Englobing tag (CSS selector) > * Blacklist (CSS selector) > * Fieldmapping (CSS selector) > * Strip HTML > Process > * Retrieve Englobing tag > * Remove blacklist > * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + > strip HTML (if requested) > * Englobing tag minus blacklist: strip HTML (if requested) and return as > output (content) > How can I best deliver the source code? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1557) HTML Tag extractor
[ https://issues.apache.org/jira/browse/CONNECTORS-1557?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1557: --- Assignee: Karl Wright > HTML Tag extractor > -- > > Key: CONNECTORS-1557 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1557 > Project: ManifoldCF > Issue Type: New Feature >Reporter: Donald Van den Driessche >Assignee: Karl Wright >Priority: Major > > I wrote a HTML Tag extractor, based on the HTML Extractor. > I needed to extract specific HTML tags and transfer them to their own field > in my output repository. > Input > * Englobing tag (CSS selector) > * Blacklist (CSS selector) > * Fieldmapping (CSS selector) > * Strip HTML > Process > * Retrieve Englobing tag > * Remove blacklist > * Map selected CSS selectors in Fieldmapping (arrays if multiple finds) + > strip HTML (if requested) > * Englobing tag minus blacklist: strip HTML (if requested) and return as > output (content) > How can I best deliver the source code? > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776
[ https://issues.apache.org/jira/browse/CONNECTORS-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1556: Attachment: CONNECTORS-1556.patch > Integrate changes in retry handling to address TIKA-2776 > > > Key: CONNECTORS-1556 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1556 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: CONNECTORS-1556.patch > > > The Tika service extractor currently retries on some conditions but does not > handle the case where the external Tika service is restarting itself. This > generates a 503 error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776
[ https://issues.apache.org/jira/browse/CONNECTORS-1556?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1556. - Resolution: Fixed r1846627 > Integrate changes in retry handling to address TIKA-2776 > > > Key: CONNECTORS-1556 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1556 > Project: ManifoldCF > Issue Type: Bug > Components: Tika service connector >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: CONNECTORS-1556.patch > > > The Tika service extractor currently retries on some conditions but does not > handle the case where the external Tika service is restarting itself. This > generates a 503 error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1556) Integrate changes in retry handling to address TIKA-2776
Karl Wright created CONNECTORS-1556: --- Summary: Integrate changes in retry handling to address TIKA-2776 Key: CONNECTORS-1556 URL: https://issues.apache.org/jira/browse/CONNECTORS-1556 Project: ManifoldCF Issue Type: Bug Components: Tika service connector Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 2.12 The Tika service extractor currently retries on some conditions but does not handle the case where the external Tika service is restarting itself. This generates a 503 error. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1554) Job stuck during crawl documents on folder
[ https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1554. - Resolution: Cannot Reproduce > Job stuck during crawl documents on folder > -- > > Key: CONNECTORS-1554 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1554 > Project: ManifoldCF > Issue Type: Bug > Components: Active Directory authority, File system connector, Tika > extractor >Affects Versions: ManifoldCF 2.11 > Environment: Ubuntu Server 18.04 > ManifoldCF 2.11 > Solr 7.5.0 > Tika Server 1.19.1 >Reporter: Mario Bisonti >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.11 > > Attachments: SimpleHistory.png, manifoldcf.log > > > Hallo. > When I start a job that index a Windows Share, it stucks after a 15 minutes > near. > > I see error in ManifoldCF.log as you can see in the attachment > > I attached "Simple History" with the last documents crawled. > Thanks a lot. > Mario > [^manifoldcf.log]!SimpleHistory.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder
[ https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16678101#comment-16678101 ] Karl Wright commented on CONNECTORS-1554: - [~bisontim], there are several approved models under which you can run ManifoldCF. They are each represented by an example directory in the distribution. But the way you propose running everything under Tomcat is not one of these. If you indeed want to run ManifoldCF as a single process (with the pitfalls that may have, including issues regarding starvation of UI resources during heavy crawling), you can simply deploy the combined ManifoldCF war file. Instructions are on the "how to build and deploy" page. > Job stuck during crawl documents on folder > -- > > Key: CONNECTORS-1554 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1554 > Project: ManifoldCF > Issue Type: Bug > Components: Active Directory authority, File system connector, Tika > extractor >Affects Versions: ManifoldCF 2.11 > Environment: Ubuntu Server 18.04 > ManifoldCF 2.11 > Solr 7.5.0 > Tika Server 1.19.1 >Reporter: Mario Bisonti >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.11 > > Attachments: SimpleHistory.png, manifoldcf.log > > > Hallo. > When I start a job that index a Windows Share, it stucks after a 15 minutes > near. > > I see error in ManifoldCF.log as you can see in the attachment > > I attached "Simple History" with the last documents crawled. > Thanks a lot. > Mario > [^manifoldcf.log]!SimpleHistory.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder
[ https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16677450#comment-16677450 ] Karl Wright commented on CONNECTORS-1554: - Note that if you perform the lock-clean procedure *as described*, all the documents should be reprioritized in any case, so all crawling should resume. After that, if you wind up with stuck documents it should be possible to look at the simple history for one of the stuck ones to see what happened to it. The document retry logic has not changed for years, and was last changed in a minor way to address this very problem back in 2015. Documents that get retried wind up being given to a thread that recomputes their priority. The need to do this is signaled by the "needspriority" field being set to "Y", and then the reprioritization threads kick in and set the priority eventually. So if you have jobqueue entries with the docpriority value of 1E9+1, a status of "P" or "G", and a needspriority field NOT set to 'Y', then those documents are stuck and I don't know how they got there. So I need to know what happened to them that caused this. > Job stuck during crawl documents on folder > -- > > Key: CONNECTORS-1554 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1554 > Project: ManifoldCF > Issue Type: Bug > Components: Active Directory authority, File system connector, Tika > extractor >Affects Versions: ManifoldCF 2.11 > Environment: Ubuntu Server 18.04 > ManifoldCF 2.11 > Solr 7.5.0 > Tika Server 1.19.1 >Reporter: Mario Bisonti >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.11 > > Attachments: SimpleHistory.png, manifoldcf.log > > > Hallo. > When I start a job that index a Windows Share, it stucks after a 15 minutes > near. > > I see error in ManifoldCF.log as you can see in the attachment > > I attached "Simple History" with the last documents crawled. > Thanks a lot. > Mario > [^manifoldcf.log]!SimpleHistory.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder
[ https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676928#comment-16676928 ] Karl Wright commented on CONNECTORS-1554: - Hi [~bisontim], you are using file synchronization, as I feared. This is deprecated. You really want to be using Zookeeper synchronization. Furthermore, your process of cleaning the locks is wrong. The Tomcat web apps you are using do not include the agents process, and therefore you are cleaning the locks out from under a running agents process! That's never going to work. The proper process is: (1) shutdown tomcat (2) shutdown agents process (3) clean locks (4) start agents process (5) start tomcat You do not need to shut down solr or postgresql for this; in fact, that's counterproductive. > Job stuck during crawl documents on folder > -- > > Key: CONNECTORS-1554 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1554 > Project: ManifoldCF > Issue Type: Bug > Components: Active Directory authority, File system connector, Tika > extractor >Affects Versions: ManifoldCF 2.11 > Environment: Ubuntu Server 18.04 > ManifoldCF 2.11 > Solr 7.5.0 > Tika Server 1.19.1 >Reporter: Mario Bisonti >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.11 > > Attachments: SimpleHistory.png, manifoldcf.log > > > Hallo. > When I start a job that index a Windows Share, it stucks after a 15 minutes > near. > > I see error in ManifoldCF.log as you can see in the attachment > > I attached "Simple History" with the last documents crawled. > Thanks a lot. > Mario > [^manifoldcf.log]!SimpleHistory.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder
[ https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676820#comment-16676820 ] Karl Wright commented on CONNECTORS-1554: - Hi [~bisontim], I note the following in your log: {code} ERROR 2018-11-06T14:31:47,730 (Agents thread) - Exception tossed: Service 'A' of type 'AGENT_org.apache.manifoldcf.crawler.system.CrawlerAgent is not active org.apache.manifoldcf.core.interfaces.ManifoldCFException: Service 'A' of type 'AGENT_org.apache.manifoldcf.crawler.system.CrawlerAgent is not active at org.apache.manifoldcf.core.lockmanager.BaseLockManager.endServiceActivity(BaseLockManager.java:462) ~[mcf-core.jar:?] at org.apache.manifoldcf.core.lockmanager.LockManager.endServiceActivity(LockManager.java:172) ~[mcf-core.jar:?] at org.apache.manifoldcf.agents.system.AgentsDaemon.checkAgents(AgentsDaemon.java:289) ~[mcf-agents.jar:?] at org.apache.manifoldcf.agents.system.AgentsDaemon$AgentsThread.run(AgentsDaemon.java:209) [mcf-agents.jar:?] {code} This makes me concerned that you might not be shutting down the agents process cleanly. If you are using file-based synchronization, this could lead to stuck locks, which would explain the behavior you are seeing quite well. Can you confirm that you are using zookeeper? Thanks in advance. > Job stuck during crawl documents on folder > -- > > Key: CONNECTORS-1554 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1554 > Project: ManifoldCF > Issue Type: Bug > Components: Active Directory authority, File system connector, Tika > extractor >Affects Versions: ManifoldCF 2.11 > Environment: Ubuntu Server 18.04 > ManifoldCF 2.11 > Solr 7.5.0 > Tika Server 1.19.1 >Reporter: Mario Bisonti >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.11 > > Attachments: SimpleHistory.png, manifoldcf.log > > > Hallo. > When I start a job that index a Windows Share, it stucks after a 15 minutes > near. > > I see error in ManifoldCF.log as you can see in the attachment > > I attached "Simple History" with the last documents crawled. > Thanks a lot. > Mario > [^manifoldcf.log]!SimpleHistory.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1554) Job stuck during crawl documents on folder
[ https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676817#comment-16676817 ] Karl Wright commented on CONNECTORS-1554: - Hi [~bisontim], I need the Simple History of one of the documents that is "stuck". You will need to have it go back far enough to find out what happened to that one document last. Thanks in advance!! > Job stuck during crawl documents on folder > -- > > Key: CONNECTORS-1554 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1554 > Project: ManifoldCF > Issue Type: Bug > Components: Active Directory authority, File system connector, Tika > extractor >Affects Versions: ManifoldCF 2.11 > Environment: Ubuntu Server 18.04 > ManifoldCF 2.11 > Solr 7.5.0 > Tika Server 1.19.1 >Reporter: Mario Bisonti >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.11 > > Attachments: SimpleHistory.png, manifoldcf.log > > > Hallo. > When I start a job that index a Windows Share, it stucks after a 15 minutes > near. > > I see error in ManifoldCF.log as you can see in the attachment > > I attached "Simple History" with the last documents crawled. > Thanks a lot. > Mario > [^manifoldcf.log]!SimpleHistory.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1554) Job stuck during crawl documents on folder
[ https://issues.apache.org/jira/browse/CONNECTORS-1554?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1554: --- Assignee: Karl Wright > Job stuck during crawl documents on folder > -- > > Key: CONNECTORS-1554 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1554 > Project: ManifoldCF > Issue Type: Bug > Components: Active Directory authority, File system connector, Tika > extractor >Affects Versions: ManifoldCF 2.11 > Environment: Ubuntu Server 18.04 > ManifoldCF 2.11 > Solr 7.5.0 > Tika Server 1.19.1 >Reporter: Mario Bisonti >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.11 > > Attachments: SimpleHistory.png, manifoldcf.log > > > Hallo. > When I start a job that index a Windows Share, it stucks after a 15 minutes > near. > > I see error in ManifoldCF.log as you can see in the attachment > > I attached "Simple History" with the last documents crawled. > Thanks a lot. > Mario > [^manifoldcf.log]!SimpleHistory.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1553) Upgrade to SolrJ 6.6.5
[ https://issues.apache.org/jira/browse/CONNECTORS-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1553. - Resolution: Won't Fix > Upgrade to SolrJ 6.6.5 > -- > > Key: CONNECTORS-1553 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1553 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.11 >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI >Priority: Major > Fix For: ManifoldCF 2.12 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1553) Upgrade to SolrJ 6.6.5
[ https://issues.apache.org/jira/browse/CONNECTORS-1553?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16676329#comment-16676329 ] Karl Wright commented on CONNECTORS-1553: - [~kamaci], we updated to SolrJ 7.4.x for release 2.11. We should not go back. > Upgrade to SolrJ 6.6.5 > -- > > Key: CONNECTORS-1553 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1553 > Project: ManifoldCF > Issue Type: Task > Components: Lucene/SOLR connector >Affects Versions: ManifoldCF 2.11 >Reporter: Furkan KAMACI >Assignee: Furkan KAMACI >Priority: Major > Fix For: ManifoldCF 2.12 > > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'
[ https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672605#comment-16672605 ] Karl Wright commented on CONNECTORS-1546: - I didn't see a commit go by. Were you able to commit? > Optimize Elasticsearch performance by removing 'forcemerge' > --- > > Key: CONNECTORS-1546 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1546 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Reporter: Hans Van Goethem >Assignee: Steph van Schalkwyk >Priority: Major > > After crawling with ManifoldCF, forcemerge is applied to optimize the > Elasticsearch index. This optimization makes the Elastic faster for > read-operations but not for write-opeartions. On the contrary, performance on > the write operations becomes worse after every forcemerge. > Can you remove this forcemerge in ManifoldCF to optimize perfomance for > recurrent crawling to Elasticsearch? > If somene needs this forcemerge, it can be applied mannually against > Elasticsearch directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation
[ https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672435#comment-16672435 ] Karl Wright commented on CONNECTORS-1552: - Looks good, but I'd suggest making sure the text capitalization style is consistent with everything else in the connector. > Apache ManifoldCF Elastic Connector for Basic Authorisation > --- > > Key: CONNECTORS-1552 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1552 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna Agrawal >Assignee: Steph van Schalkwyk >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: screenshot-1.png > > > We are using the Apache Manifold CF to connect the elastic search as our > Elastic server is protected url there is no way we are able to connect from > the Admin console. > If we remove the authentication connector works well but we want to access by > passing username and password. > Please guide us so that we can complete our set up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1529) Add "url" output element to ES Output Connector (required when used with the Web Repository Connector)
[ https://issues.apache.org/jira/browse/CONNECTORS-1529?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16672425#comment-16672425 ] Karl Wright commented on CONNECTORS-1529: - As long as it's a new field, seems that backwards compatibility is preserved, so I'm OK with it. > Add "url" output element to ES Output Connector (required when used with the > Web Repository Connector) > -- > > Key: CONNECTORS-1529 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1529 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Steph van Schalkwyk >Assignee: Steph van Schalkwyk >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: elasticsearch.patch, image-2018-09-06-10-28-45-008.png > > > Add "url" (copy of the _id field) to ES Output. > ES no longer supports copying from _id (copy-to) in the schema. > As per > !image-2018-09-06-10-28-45-008.png! > -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values
[ https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16670641#comment-16670641 ] Karl Wright commented on LUCENE-8540: - [~ivera] Looks reasonable as far as I can tell. The question is whether the decode scaling factor is 'correct' but I think changing that will cause people to need to reindex, so this is a better fix. > Geo3d quantization test failure for MAX/MIN encoding values > --- > > Key: LUCENE-8540 > URL: https://issues.apache.org/jira/browse/LUCENE-8540 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > Attachments: LUCENE-8540.patch > > > Here is a reproducible error: > {code:java} > 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint > 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig > 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled > (@Nightly()) > 08:45:21[junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization > -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true > -Dtests.file.encoding=US-ASCII > 08:45:21[junit4] ERROR 0.20s J1 | TestGeo3DPoint.testQuantization <<< > 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: > value=-1.0011188543037526 is out-of-bounds (less than than WGS84's > -planetMax=-1.0011188539924791) > 08:45:21[junit4]> at > __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0) > 08:45:21[junit4]> at > org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56) > 08:45:21[junit4]> at > org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228) > 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748) > 08:45:21[junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): > {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, > docValues:{id=DocValuesFormat(name=Asserting), > point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, > maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, > locale=ga-IE, timezone=America/Bogota > 08:45:21[junit4] 2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle > Corporation 1.8.0_181 > (64-bit)/cpus=16,threads=1,free=466116320,total=536346624 > 08:45:21[junit4] 2> NOTE: All tests run in this JVM: [GeoPointTest, > RandomGeoPolygonTest, TestGeo3DPoint] > 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 > error, 1 skipped <<< FAILURES!{code} > > It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or > encoding = Geo3DUtil.MAX_ENCODED_VALUE. > It is related with https://issues.apache.org/jira/browse/LUCENE-7327 > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation
[ https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1552: --- Assignee: Steph van Schalkwyk (was: Karl Wright) > Apache ManifoldCF Elastic Connector for Basic Authorisation > --- > > Key: CONNECTORS-1552 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1552 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna Agrawal >Assignee: Steph van Schalkwyk >Priority: Major > Fix For: ManifoldCF 2.12 > > > We are using the Apache Manifold CF to connect the elastic search as our > Elastic server is protected url there is no way we are able to connect from > the Admin console. > If we remove the authentication connector works well but we want to access by > passing username and password. > Please guide us so that we can complete our set up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation
[ https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16667589#comment-16667589 ] Karl Wright commented on CONNECTORS-1552: - The ES connector does not currently support any ES authentication requirements whatsoever. This is therefore an enhancement to the current connector, not a bug. Enhancement requests are looked at based on time and availability of the volunteers working on the ManifoldCF project. I would suggest that if you have time-critical need for a new feature, you consider adding it yourself. The earliest I could look at this would be next weekend and that is not guaranteed. > Apache ManifoldCF Elastic Connector for Basic Authorisation > --- > > Key: CONNECTORS-1552 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1552 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna Agrawal >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.12 > > > We are using the Apache Manifold CF to connect the elastic search as our > Elastic server is protected url there is no way we are able to connect from > the Admin console. > If we remove the authentication connector works well but we want to access by > passing username and password. > Please guide us so that we can complete our set up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1552) Apache ManifoldCF Elastic Connector for Basic Authorisation
[ https://issues.apache.org/jira/browse/CONNECTORS-1552?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1552: --- Assignee: Karl Wright Priority: Major (was: Blocker) Fix Version/s: ManifoldCF 2.12 Component/s: Elastic Search connector Issue Type: Improvement (was: Bug) > Apache ManifoldCF Elastic Connector for Basic Authorisation > --- > > Key: CONNECTORS-1552 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1552 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Affects Versions: ManifoldCF 2.10 >Reporter: Krishna Agrawal >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.12 > > > We are using the Apache Manifold CF to connect the elastic search as our > Elastic server is protected url there is no way we are able to connect from > the Admin console. > If we remove the authentication connector works well but we want to access by > passing username and password. > Please guide us so that we can complete our set up. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1551) Various confluence connector issues
[ https://issues.apache.org/jira/browse/CONNECTORS-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1551. - Resolution: Fixed r1844778 > Various confluence connector issues > --- > > Key: CONNECTORS-1551 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1551 > Project: ManifoldCF > Issue Type: Bug > Components: Confluence connector >Affects Versions: ManifoldCF 2.11 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: CONNECTORS-1551.patch > > > I've just made the patch to extend mcf-confluence-connector. > The official site says that I can create a JIRA ticket for improvements. > But I cannot access the JIRA via the firewall in our office. > Can someone create a ticket instead of me? > The patch is attached to this mail. > [Extension] > o Support the page type 'blogpost' as well as 'page'. (*1) > o Include the Japanese message catalog. > [Bug Fix] > o Ugly message when the 'Port' value is invalid. > o Ugly message of 'Process Attachments' in 'View a Job'. > o Some null pointer exceptions. > (*1) > Confluence has 2 different types of page. > The current connector can only find 'page' typed pages. > This extension can find both of them selectively. > Thanks. > Takashi SHIRAI -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1551) Various confluence connector issues
[ https://issues.apache.org/jira/browse/CONNECTORS-1551?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1551: Attachment: CONNECTORS-1551.patch > Various confluence connector issues > --- > > Key: CONNECTORS-1551 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1551 > Project: ManifoldCF > Issue Type: Bug > Components: Confluence connector >Affects Versions: ManifoldCF 2.11 >Reporter: Karl Wright >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.12 > > Attachments: CONNECTORS-1551.patch > > > I've just made the patch to extend mcf-confluence-connector. > The official site says that I can create a JIRA ticket for improvements. > But I cannot access the JIRA via the firewall in our office. > Can someone create a ticket instead of me? > The patch is attached to this mail. > [Extension] > o Support the page type 'blogpost' as well as 'page'. (*1) > o Include the Japanese message catalog. > [Bug Fix] > o Ugly message when the 'Port' value is invalid. > o Ugly message of 'Process Attachments' in 'View a Job'. > o Some null pointer exceptions. > (*1) > Confluence has 2 different types of page. > The current connector can only find 'page' typed pages. > This extension can find both of them selectively. > Thanks. > Takashi SHIRAI -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1551) Various confluence connector issues
Karl Wright created CONNECTORS-1551: --- Summary: Various confluence connector issues Key: CONNECTORS-1551 URL: https://issues.apache.org/jira/browse/CONNECTORS-1551 Project: ManifoldCF Issue Type: Bug Components: Confluence connector Affects Versions: ManifoldCF 2.11 Reporter: Karl Wright Assignee: Karl Wright Fix For: ManifoldCF 2.12 I've just made the patch to extend mcf-confluence-connector. The official site says that I can create a JIRA ticket for improvements. But I cannot access the JIRA via the firewall in our office. Can someone create a ticket instead of me? The patch is attached to this mail. [Extension] o Support the page type 'blogpost' as well as 'page'. (*1) o Include the Japanese message catalog. [Bug Fix] o Ugly message when the 'Port' value is invalid. o Ugly message of 'Process Attachments' in 'View a Job'. o Some null pointer exceptions. (*1) Confluence has 2 different types of page. The current connector can only find 'page' typed pages. This extension can find both of them selectively. Thanks. Takashi SHIRAI -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values
[ https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16660515#comment-16660515 ] Karl Wright commented on LUCENE-8540: - Hi [~ivera], can you have a look at this? I'm quite busy today unfortunately. > Geo3d quantization test failure for MAX/MIN encoding values > --- > > Key: LUCENE-8540 > URL: https://issues.apache.org/jira/browse/LUCENE-8540 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Priority: Major > > Here is a reproducible error: > {code:java} > 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint > 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig > 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled > (@Nightly()) > 08:45:21[junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization > -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true > -Dtests.file.encoding=US-ASCII > 08:45:21[junit4] ERROR 0.20s J1 | TestGeo3DPoint.testQuantization <<< > 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: > value=-1.0011188543037526 is out-of-bounds (less than than WGS84's > -planetMax=-1.0011188539924791) > 08:45:21[junit4]> at > __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0) > 08:45:21[junit4]> at > org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56) > 08:45:21[junit4]> at > org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228) > 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748) > 08:45:21[junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): > {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, > docValues:{id=DocValuesFormat(name=Asserting), > point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, > maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, > locale=ga-IE, timezone=America/Bogota > 08:45:21[junit4] 2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle > Corporation 1.8.0_181 > (64-bit)/cpus=16,threads=1,free=466116320,total=536346624 > 08:45:21[junit4] 2> NOTE: All tests run in this JVM: [GeoPointTest, > RandomGeoPolygonTest, TestGeo3DPoint] > 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 > error, 1 skipped <<< FAILURES!{code} > > It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or > encoding = Geo3DUtil.MAX_ENCODED_VALUE. > It is related with https://issues.apache.org/jira/browse/LUCENE-7327 > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-8540) Geo3d quantization test failure for MAX/MIN encoding values
[ https://issues.apache.org/jira/browse/LUCENE-8540?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned LUCENE-8540: --- Assignee: Ignacio Vera > Geo3d quantization test failure for MAX/MIN encoding values > --- > > Key: LUCENE-8540 > URL: https://issues.apache.org/jira/browse/LUCENE-8540 > Project: Lucene - Core > Issue Type: Bug > Components: modules/spatial3d >Reporter: Ignacio Vera >Assignee: Ignacio Vera >Priority: Major > > Here is a reproducible error: > {code:java} > 08:45:21[junit4] Suite: org.apache.lucene.spatial3d.TestGeo3DPoint > 08:45:21[junit4] IGNOR/A 0.01s J1 | TestGeo3DPoint.testRandomBig > 08:45:21[junit4]> Assumption #1: 'nightly' test group is disabled > (@Nightly()) > 08:45:21[junit4] 2> NOTE: reproduce with: ant test > -Dtestcase=TestGeo3DPoint -Dtests.method=testQuantization > -Dtests.seed=4CB20CF248F6211 -Dtests.slow=true -Dtests.badapples=true > -Dtests.locale=ga-IE -Dtests.timezone=America/Bogota -Dtests.asserts=true > -Dtests.file.encoding=US-ASCII > 08:45:21[junit4] ERROR 0.20s J1 | TestGeo3DPoint.testQuantization <<< > 08:45:21[junit4]> Throwable #1: java.lang.IllegalArgumentException: > value=-1.0011188543037526 is out-of-bounds (less than than WGS84's > -planetMax=-1.0011188539924791) > 08:45:21[junit4]> at > __randomizedtesting.SeedInfo.seed([4CB20CF248F6211:32220FD9326E7F33]:0) > 08:45:21[junit4]> at > org.apache.lucene.spatial3d.Geo3DUtil.encodeValue(Geo3DUtil.java:56) > 08:45:21[junit4]> at > org.apache.lucene.spatial3d.TestGeo3DPoint.testQuantization(TestGeo3DPoint.java:1228) > 08:45:21[junit4]> at java.lang.Thread.run(Thread.java:748) > 08:45:21[junit4] 2> NOTE: test params are: codec=Asserting(Lucene70): > {id=PostingsFormat(name=LuceneVarGapDocFreqInterval)}, > docValues:{id=DocValuesFormat(name=Asserting), > point=DocValuesFormat(name=Lucene70)}, maxPointsInLeafNode=659, > maxMBSortInHeap=6.225981846119071, sim=RandomSimilarity(queryNorm=false): {}, > locale=ga-IE, timezone=America/Bogota > 08:45:21[junit4] 2> NOTE: Linux 2.6.32-754.6.3.el6.x86_64 amd64/Oracle > Corporation 1.8.0_181 > (64-bit)/cpus=16,threads=1,free=466116320,total=536346624 > 08:45:21[junit4] 2> NOTE: All tests run in this JVM: [GeoPointTest, > RandomGeoPolygonTest, TestGeo3DPoint] > 08:45:21[junit4] Completed [18/18 (1!)] on J1 in 19.83s, 14 tests, 1 > error, 1 skipped <<< FAILURES!{code} > > It seems this test will fail if encoding = Geo3DUtil.MIN_ENCODED_VALUE or > encoding = Geo3DUtil.MAX_ENCODED_VALUE. > It is related with https://issues.apache.org/jira/browse/LUCENE-7327 > > > -- This message was sent by Atlassian JIRA (v7.6.3#76005) - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Resolved] (CONNECTORS-1550) HTML Tag mapping
[ https://issues.apache.org/jira/browse/CONNECTORS-1550?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1550. - Resolution: Not A Problem Hi [~DonaldVdD], please post questions like this to the us...@manifoldcf.apache.org mailing list. Jira is meant for bugs and enhancement requests. Thank you! > HTML Tag mapping > > > Key: CONNECTORS-1550 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1550 > Project: ManifoldCF > Issue Type: Wish > Components: Elastic Search connector, Tika extractor, Web connector >Affects Versions: ManifoldCF 2.10 >Reporter: Donald Van den Driessche >Priority: Major > > I’ll be crawling a website with the standard Web connecter. I want to extract > just certain html tags like , and . > I’ve set up an HTML extractor transformation connector and the internal Tika > transformation connector. But I can’t find any place to do a mapping to the > output for this. > > Do I have to write my own transformation connector to extract the content of > these tags? Or is there a built in solution? -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1549) Include and exclude rules order lost
[ https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1549: Attachment: CONNECTORS-1549.patch > Include and exclude rules order lost > > > Key: CONNECTORS-1549 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1549 > Project: ManifoldCF > Issue Type: Bug > Components: API, JCIFS connector >Affects Versions: ManifoldCF 2.11 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > Fix For: ManifoldCF 2.12 > > Attachments: CONNECTORS-1549.patch, > image-2018-10-18-18-28-14-547.png, image-2018-10-18-18-33-01-577.png, > image-2018-10-18-18-34-01-542.png > > > The include and exclude rules that can be defined in the job configuration > for the JCIFS connector can be combined and the defined order is really > important. > The problem is that when one retrieve the job configuration as a json object > through the API, the include and exclude rules are splitted in two diffrent > arrays instead of one (one for each type of rule). So, the order is > completely lost when one try to recreate the job thanks to the API and the > JSON object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1549) Include and exclude rules order lost
[ https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1549. - Resolution: Fixed r1844293 > Include and exclude rules order lost > > > Key: CONNECTORS-1549 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1549 > Project: ManifoldCF > Issue Type: Bug > Components: API, JCIFS connector >Affects Versions: ManifoldCF 2.11 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > Fix For: ManifoldCF 2.12 > > Attachments: CONNECTORS-1549.patch, > image-2018-10-18-18-28-14-547.png, image-2018-10-18-18-33-01-577.png, > image-2018-10-18-18-34-01-542.png > > > The include and exclude rules that can be defined in the job configuration > for the JCIFS connector can be combined and the defined order is really > important. > The problem is that when one retrieve the job configuration as a json object > through the API, the include and exclude rules are splitted in two diffrent > arrays instead of one (one for each type of rule). So, the order is > completely lost when one try to recreate the job thanks to the API and the > JSON object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1549) Include and exclude rules order lost
[ https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1549: Fix Version/s: ManifoldCF 2.12 > Include and exclude rules order lost > > > Key: CONNECTORS-1549 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1549 > Project: ManifoldCF > Issue Type: Bug > Components: API, JCIFS connector >Affects Versions: ManifoldCF 2.11 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > Fix For: ManifoldCF 2.12 > > Attachments: image-2018-10-18-18-28-14-547.png, > image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png > > > The include and exclude rules that can be defined in the job configuration > for the JCIFS connector can be combined and the defined order is really > important. > The problem is that when one retrieve the job configuration as a json object > through the API, the include and exclude rules are splitted in two diffrent > arrays instead of one (one for each type of rule). So, the order is > completely lost when one try to recreate the job thanks to the API and the > JSON object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost
[ https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16656073#comment-16656073 ] Karl Wright commented on CONNECTORS-1549: - I found the issue and have attached a patch. Thanks! > Include and exclude rules order lost > > > Key: CONNECTORS-1549 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1549 > Project: ManifoldCF > Issue Type: Bug > Components: API, JCIFS connector >Affects Versions: ManifoldCF 2.11 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > Attachments: image-2018-10-18-18-28-14-547.png, > image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png > > > The include and exclude rules that can be defined in the job configuration > for the JCIFS connector can be combined and the defined order is really > important. > The problem is that when one retrieve the job configuration as a json object > through the API, the include and exclude rules are splitted in two diffrent > arrays instead of one (one for each type of rule). So, the order is > completely lost when one try to recreate the job thanks to the API and the > JSON object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost
[ https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655986#comment-16655986 ] Karl Wright commented on CONNECTORS-1549: - Hi [~julienFL] Sorry for the delay. First note that you can always use the order-preserving form even if MCF outputs the JSON in the other "sugary" form. So this should unblock you. Second, I'm looking at the code that generates the output in Configuration.java: {code} // The new JSON parser uses hash order for object keys. So it isn't good enough to just detect that there's an // intermingling. Instead we need to the existence of more that one key; that implies that we need to do order preservation. String lastChildType = null; boolean needAlternate = false; int i = 0; while (i < getChildCount()) { ConfigurationNode child = findChild(i++); String key = child.getType(); List list = childMap.get(key); if (list == null) { // We found no existing list, so create one list = new ArrayList(); childMap.put(key,list); childList.add(key); } // Key order comes into play when we have elements of different types within the same child. if (lastChildType != null && !lastChildType.equals(key)) { needAlternate = true; break; } list.add(child); lastChildType = key; } if (needAlternate) { // Can't use the array representation. We'll need to start do a _children_ object, and enumerate // each child. So, the JSON will look like: // :{_attribute_:xxx,_children_:[{_type_:, ...},{_type_:, ...}, ...]} ... {code} The (needAlternate) clause is the one that writes the specification in the verbose form. The logic seems like it would detect any time there's a subtree with a different key under a given level and set "needAlternate". I'll stare at it some more but right now I'm having trouble seeing how this fails. > Include and exclude rules order lost > > > Key: CONNECTORS-1549 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1549 > Project: ManifoldCF > Issue Type: Bug > Components: API, JCIFS connector >Affects Versions: ManifoldCF 2.11 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > Attachments: image-2018-10-18-18-28-14-547.png, > image-2018-10-18-18-33-01-577.png, image-2018-10-18-18-34-01-542.png > > > The include and exclude rules that can be defined in the job configuration > for the JCIFS connector can be combined and the defined order is really > important. > The problem is that when one retrieve the job configuration as a json object > through the API, the include and exclude rules are splitted in two diffrent > arrays instead of one (one for each type of rule). So, the order is > completely lost when one try to recreate the job thanks to the API and the > JSON object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1549) Include and exclude rules order lost
[ https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16655223#comment-16655223 ] Karl Wright commented on CONNECTORS-1549: - Hi [~julienFL], there was a similar ticket a while back for the file system connector. Let me explain what the solution was and see if you still think there is a problem. (1) The actual internal representation of a Document Specification is XML. (2) For the API, we convert the XML to JSON and back. (3) Because a complete and unambiguous conversion between these formats is quite ugly, we have multiple ways of doing the conversion, so that we allow "syntactic sugar" in the JSON for specific cases where the conversion can be done simply. (4) A while back, there was a bug in the code that determined whether it was possible to use syntactic sugar of the specific kind that would lead to two independent lists for the File System Connector's document specification, so for a while what was *output* when you exported the Job was incorrect, and order would be lost if you re-imported it. The solution was to (a) fix the bug, and (b) get the person using the API to use the correct, unambigious JSON format instead of the "sugary" format. This preserves order. The way to see if this is what you are up against is to create a JCIFS job with a complex rule set that has both inclusions and exclusions. If it looks different than what you are expecting, then try replicating that format when you import via the API. > Include and exclude rules order lost > > > Key: CONNECTORS-1549 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1549 > Project: ManifoldCF > Issue Type: Bug > Components: API, JCIFS connector >Affects Versions: ManifoldCF 2.11 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > > The include and exclude rules that can be defined in the job configuration > for the JCIFS connector can be combined and the defined order is really > important. > The problem is that when one retrieve the job configuration as a json object > through the API, the include and exclude rules are splitted in two diffrent > arrays instead of one (one for each type of rule). So, the order is > completely lost when one try to recreate the job thanks to the API and the > JSON object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1549) Include and exclude rules order lost
[ https://issues.apache.org/jira/browse/CONNECTORS-1549?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1549: --- Assignee: Karl Wright > Include and exclude rules order lost > > > Key: CONNECTORS-1549 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1549 > Project: ManifoldCF > Issue Type: Bug > Components: API, JCIFS connector >Affects Versions: ManifoldCF 2.11 >Reporter: Julien Massiera >Assignee: Karl Wright >Priority: Critical > > The include and exclude rules that can be defined in the job configuration > for the JCIFS connector can be combined and the defined order is really > important. > The problem is that when one retrieve the job configuration as a json object > through the API, the include and exclude rules are splitted in two diffrent > arrays instead of one (one for each type of rule). So, the order is > completely lost when one try to recreate the job thanks to the API and the > JSON object. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1548) CMIS output connector test fails with versioning state error
[ https://issues.apache.org/jira/browse/CONNECTORS-1548?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1548: Description: While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector test failures. Specifically, here's the trace: {code} [junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The versioning state flag is imcompatible to the type definition. [junit] at org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994) {code} Nested exception is: {code} [junit] Caused by: org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The versioning state flag is imcompatible to the type definition. [junit] at org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.convertStatusCode(AbstractAtomPubService.java:514) [junit] at org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.post(AbstractAtomPubService.java:717) [junit] at org.apache.chemistry.opencmis.client.bindings.spi.atompub.ObjectServiceImpl.createDocument(ObjectServiceImpl.java:122) [junit] at org.apache.chemistry.opencmis.client.runtime.SessionImpl.createDocument(SessionImpl.java:1158) {code} This may (or may not) be related to the Tika code now using a different implementation of jaxb. I've moved all of jaxb and its dependent classes into connector-common-lib accordingly, and have no specific inclusions of jaxb in any connector class that would need it to be in connector-lib. It has been committed to trunk; r1844137. Please verify (or disprove) that the problem is the new jaxb implementation. If it is we'll need to figure out why CMIS cares which implementation is used. was: While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector test failures. Specifically, here's the trace: {code} [junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The versioning state flag is imcompatible to the type definition. [junit] at org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994) {code} This may (or may not) be related to the Tika code now using a different implementation of jaxb. I've moved all of jaxb and its dependent classes into connector-common-lib accordingly, and have no specific inclusions of jaxb in any connector class that would need it to be in connector-lib. It has been committed to trunk; r1844137. Please verify (or disprove) that the problem is the new jaxb implementation. If it is we'll need to figure out why CMIS cares which implementation is used. > CMIS output connector test fails with versioning state error > > > Key: CONNECTORS-1548 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1548 > Project: ManifoldCF > Issue Type: Bug > Components: CMIS Output Connector >Reporter: Karl Wright >Assignee: Piergiorgio Lucidi >Priority: Major > Fix For: ManifoldCF 2.12 > > > While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector > test failures. Specifically, here's the trace: > {code} > [junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The > versioning state flag is imcompatible to the type definition. > [junit] at > org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994) > {code} > Nested exception is: > {code} > [junit] Caused by: > org.apache.chemistry.opencmis.commons.exceptions.CmisConstraintException: The > versioning state flag is imcompatible to the type definition. > [junit] at > org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.convertStatusCode(AbstractAtomPubService.java:514) > [junit] at > org.apache.chemistry.opencmis.client.bindings.spi.atompub.AbstractAtomPubService.post(AbstractAtomPubService.java:717) > [junit] at > org.apache.chemistry.opencmis.client.bindings.spi.atompub.ObjectServiceImpl.createDocument(ObjectServiceImpl.java:122) > [junit] at > org.apache.chemistry.opencmis.client.runtime.SessionImpl.createDocument(SessionImpl.java:1158) > {code} > This may (or may not) be related to the Tika code now using a different > implementation of jaxb. I've moved all of jaxb and its dependent classes > into connector-common-lib accordingly, and have no specific inclusions of > jaxb in any connector class that would need it to be in connector-lib. > It has been committed to trunk; r1844137. Please verify (or disprove) that > the problem is the new jaxb implementation. If it is
[jira] [Created] (CONNECTORS-1548) CMIS output connector test fails with versioning state error
Karl Wright created CONNECTORS-1548: --- Summary: CMIS output connector test fails with versioning state error Key: CONNECTORS-1548 URL: https://issues.apache.org/jira/browse/CONNECTORS-1548 Project: ManifoldCF Issue Type: Bug Components: CMIS Output Connector Reporter: Karl Wright Assignee: Piergiorgio Lucidi Fix For: ManifoldCF 2.12 While working on the upgrade to Tika 1.19.1, I ran into CMIS output connector test failures. Specifically, here's the trace: {code} [junit] org.apache.manifoldcf.core.interfaces.ManifoldCFException: The versioning state flag is imcompatible to the type definition. [junit] at org.apache.manifoldcf.agents.output.cmisoutput.CmisOutputConnector.addOrReplaceDocumentWithException(CmisOutputConnector.java:994) {code} This may (or may not) be related to the Tika code now using a different implementation of jaxb. I've moved all of jaxb and its dependent classes into connector-common-lib accordingly, and have no specific inclusions of jaxb in any connector class that would need it to be in connector-lib. It has been committed to trunk; r1844137. Please verify (or disprove) that the problem is the new jaxb implementation. If it is we'll need to figure out why CMIS cares which implementation is used. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Resolved] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector
[ https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright resolved CONNECTORS-1547. - Resolution: Fixed r1844120 > No activity record for for excluded documents in WebCrawlerConnector > > > Key: CONNECTORS-1547 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1547 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Olivier Tavard >Assignee: Karl Wright >Priority: Minor > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf_local_files.log, manifoldcf_web.log, > simple_history_files.jpg, simple_history_web.jpg > > > Hi, > I noticed that there is no activity record logged for documents excluded by > the Document Filter transformation connector in the WebCrawler connector. > To reproduce the issue on MCF out of the box : > Null output connector > Web repository connector > Job : > - DocumentFilter added which only accepts application/msword (doc/docx) > documents > The simple history does not mention the documents excluded (excepted for html > documents). They have fetch activity and that's all (see > simple_history_web.jpeg). > We can only see the documents excluded by the MCF log (with DEBUG verbosity > activity on connectors) : > {code:java} > Removing url > 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png' > because it had the wrong content type ('image/png'){code} > (see manifoldcf_local_files.log) > The related code is in WebcrawlerConnector.java l.904 : > {code:java} > fetchStatus.contextMessage = "it had the wrong content type > ('"+contentType+"')"; > fetchStatus.resultSignal = RESULT_NO_DOCUMENT; > activityResultCode = null;{code} > The activityResultCode is null. > > > If we configure the same job but for a Local File system connector with the > same Document Filter transformation connector, the simple history mentions > all the documents excluded in the simple history (see > simple_history_files.jpeg) and the code mentions a specific error code with > an activity record logged (class FileConnector l. 415) : > {code:java} > if (!activities.checkMimeTypeIndexable(mimeType)) > { > errorCode = activities.EXCLUDED_MIMETYPE; > errorDesc = "Excluded because mime type ('"+mimeType+"')"; > Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because > mime type ('"+mimeType+"') was excluded by output connector."); > activities.noDocument(documentIdentifier,versionString); > continue; > }{code} > > So the Web Crawler connector should have the same behaviour than for > FileConnector and explicitly mention all the documents excluded by the user I > think. > > Best regards, > Olivier -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector
[ https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright updated CONNECTORS-1547: Fix Version/s: ManifoldCF 2.12 > No activity record for for excluded documents in WebCrawlerConnector > > > Key: CONNECTORS-1547 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1547 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Olivier Tavard >Assignee: Karl Wright >Priority: Minor > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf_local_files.log, manifoldcf_web.log, > simple_history_files.jpg, simple_history_web.jpg > > > Hi, > I noticed that there is no activity record logged for documents excluded by > the Document Filter transformation connector in the WebCrawler connector. > To reproduce the issue on MCF out of the box : > Null output connector > Web repository connector > Job : > - DocumentFilter added which only accepts application/msword (doc/docx) > documents > The simple history does not mention the documents excluded (excepted for html > documents). They have fetch activity and that's all (see > simple_history_web.jpeg). > We can only see the documents excluded by the MCF log (with DEBUG verbosity > activity on connectors) : > {code:java} > Removing url > 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png' > because it had the wrong content type ('image/png'){code} > (see manifoldcf_local_files.log) > The related code is in WebcrawlerConnector.java l.904 : > {code:java} > fetchStatus.contextMessage = "it had the wrong content type > ('"+contentType+"')"; > fetchStatus.resultSignal = RESULT_NO_DOCUMENT; > activityResultCode = null;{code} > The activityResultCode is null. > > > If we configure the same job but for a Local File system connector with the > same Document Filter transformation connector, the simple history mentions > all the documents excluded in the simple history (see > simple_history_files.jpeg) and the code mentions a specific error code with > an activity record logged (class FileConnector l. 415) : > {code:java} > if (!activities.checkMimeTypeIndexable(mimeType)) > { > errorCode = activities.EXCLUDED_MIMETYPE; > errorDesc = "Excluded because mime type ('"+mimeType+"')"; > Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because > mime type ('"+mimeType+"') was excluded by output connector."); > activities.noDocument(documentIdentifier,versionString); > continue; > }{code} > > So the Web Crawler connector should have the same behaviour than for > FileConnector and explicitly mention all the documents excluded by the user I > think. > > Best regards, > Olivier -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Assigned] (CONNECTORS-1547) No activity record for for excluded documents in WebCrawlerConnector
[ https://issues.apache.org/jira/browse/CONNECTORS-1547?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Karl Wright reassigned CONNECTORS-1547: --- Assignee: Karl Wright > No activity record for for excluded documents in WebCrawlerConnector > > > Key: CONNECTORS-1547 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1547 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Reporter: Olivier Tavard >Assignee: Karl Wright >Priority: Minor > Attachments: manifoldcf_local_files.log, manifoldcf_web.log, > simple_history_files.jpg, simple_history_web.jpg > > > Hi, > I noticed that there is no activity record logged for documents excluded by > the Document Filter transformation connector in the WebCrawler connector. > To reproduce the issue on MCF out of the box : > Null output connector > Web repository connector > Job : > - DocumentFilter added which only accepts application/msword (doc/docx) > documents > The simple history does not mention the documents excluded (excepted for html > documents). They have fetch activity and that's all (see > simple_history_web.jpeg). > We can only see the documents excluded by the MCF log (with DEBUG verbosity > activity on connectors) : > {code:java} > Removing url > 'https://www.datafari.com/assets/img/Logo_Datafari_4_Condensed_No_D_20180606_30x30.png' > because it had the wrong content type ('image/png'){code} > (see manifoldcf_local_files.log) > The related code is in WebcrawlerConnector.java l.904 : > {code:java} > fetchStatus.contextMessage = "it had the wrong content type > ('"+contentType+"')"; > fetchStatus.resultSignal = RESULT_NO_DOCUMENT; > activityResultCode = null;{code} > The activityResultCode is null. > > > If we configure the same job but for a Local File system connector with the > same Document Filter transformation connector, the simple history mentions > all the documents excluded in the simple history (see > simple_history_files.jpeg) and the code mentions a specific error code with > an activity record logged (class FileConnector l. 415) : > {code:java} > if (!activities.checkMimeTypeIndexable(mimeType)) > { > errorCode = activities.EXCLUDED_MIMETYPE; > errorDesc = "Excluded because mime type ('"+mimeType+"')"; > Logging.connectors.debug("Skipping file '"+documentIdentifier+"' because > mime type ('"+mimeType+"') was excluded by output connector."); > activities.noDocument(documentIdentifier,versionString); > continue; > }{code} > > So the Web Crawler connector should have the same behaviour than for > FileConnector and explicitly mention all the documents excluded by the user I > think. > > Best regards, > Olivier -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'
[ https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651950#comment-16651950 ] Karl Wright commented on CONNECTORS-1546: - I agree with your decision. > Optimize Elasticsearch performance by removing 'forcemerge' > --- > > Key: CONNECTORS-1546 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1546 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Reporter: Hans Van Goethem >Assignee: Steph van Schalkwyk >Priority: Major > > After crawling with ManifoldCF, forcemerge is applied to optimize the > Elasticsearch index. This optimization makes the Elastic faster for > read-operations but not for write-opeartions. On the contrary, performance on > the write operations becomes worse after every forcemerge. > Can you remove this forcemerge in ManifoldCF to optimize perfomance for > recurrent crawling to Elasticsearch? > If somene needs this forcemerge, it can be applied mannually against > Elasticsearch directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1546) Optimize Elasticsearch performance by removing 'forcemerge'
[ https://issues.apache.org/jira/browse/CONNECTORS-1546?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16651761#comment-16651761 ] Karl Wright commented on CONNECTORS-1546: - Hi [~st...@remcam.net], can you comment on this? > Optimize Elasticsearch performance by removing 'forcemerge' > --- > > Key: CONNECTORS-1546 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1546 > Project: ManifoldCF > Issue Type: Improvement > Components: Elastic Search connector >Reporter: Hans Van Goethem >Assignee: Steph van Schalkwyk >Priority: Major > > After crawling with ManifoldCF, forcemerge is applied to optimize the > Elasticsearch index. This optimization makes the Elastic faster for > read-operations but not for write-opeartions. On the contrary, performance on > the write operations becomes worse after every forcemerge. > Can you remove this forcemerge in ManifoldCF to optimize perfomance for > recurrent crawling to Elasticsearch? > If somene needs this forcemerge, it can be applied mannually against > Elasticsearch directly. -- This message was sent by Atlassian JIRA (v7.6.3#76005)