[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731261#comment-16731261 ] Tim Steenbeke commented on CONNECTORS-1562: --- [~kwri...@metacarta.com] I think there was some misscommunication: The Issue with the "stopped working" was found by my college Donald Van den Driessche, so I didn't have any more info than what he gave me. I recreated the issue and this is the Error: {code:java} Error: Repeated service interruptions - failure processing document: Stream Closed{code} !Screenshot from 2018-12-31 11-17-29.png! the question I wanted answered is: How are we supposed to set up the job with the data we have, and what you see as the best solution, might not be the right solution. I asked this and you only responded to the other issue with manifold, It looked like you avoided the question. You suggested using the URL with the site-map but with excludes, and this is simply not possible because the exclude list is to big an there is no reg exp. possible because of the randomness of the links. So on this part I also though that you were looking in to this and found a fix or edited code. I'm sorry if my text was formed blunt but I'm just trying to get information and I didn't know any other way to get your attention to the full picture of the comment. English is not my first language so I'm sorry for my small vocabulary usage, google translate also doesn't help on this part. So i hope we can continue this communication to get to a solution, hopefully a solution that works for both of us. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: Screenshot from 2018-12-31 11-17-29.png > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731315#comment-16731315 ] Tim Steenbeke commented on CONNECTORS-1562: --- IS this the error ? {code:java} WARN 2018-12-31T08:24:46,453 (Worker thread '32') - Service interruption reported for job 1546241012417 connection 'repo_website-en': IO exception: Stream Closed WARN 2018-12-31T08:28:52,471 (Worker thread '6') - Service interruption reported for job 1546241012417 connection 'repo_website-en': IO exception: Stream Closed WARN 2018-12-31T08:32:10,699 (Worker thread '13') - Service interruption reported for job 1546241012417 connection 'repo_website-en': IO exception: Stream Closed ERROR 2018-12-31T08:32:10,750 (Worker thread '13') - Exception tossed: Repeated service interruptions - failure processing document: Stream Closed org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service interruptions - failure processing document: Stream Closed at org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) [mcf-pull-agent.jar:?] Caused by: java.io.IOException: Stream Closed at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191] at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) ~[?:1.8.0_191] at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191] at java.io.InputStreamReader.read(InputStreamReader.java:184) ~[?:1.8.0_191] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221) ~[?:?] at org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123) ~[httpcore-4.4.10.jar:4.4.10] at org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56) ~[httpclient-4.5.6.jar:4.5.6] at org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133) ~[?:?] WARN 2018-12-31T08:33:35,958 (Job notification thread) - ES: Commit failed: {"error":"Incorrect HTTP method for uri [/website-en/_optimize] and method [GET], allowed: [POST]","status":405} WARN 2018-12-31T08:34:46,024 (Job notification thread) - ES: Commit failed: {"error":"Incorrect HTTP method for uri [/pintra/_optimize] and method [GET], allowed: [POST]","status":405} {code} The time is 1h difference, it's running on a docker container that has different timezone atm. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723782#comment-16723782 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 7:48 AM: - [~kwri...@metacarta.com] If we update to manifold 2.12 can we than use the seedmap as originaly intended by us ? so we create a job with X seeds, ES output, web input and HopCount 0 for links and redirect: # Put X seeds in seedmap # run job # X documents get pushed to ES # update job to have X minus 20 seeds wait till scheduled time # run job # 20 documents get deleted from ES # X minus 20 documents get updated # wait till scheduled time # ... Will it work like this ? was (Author: steenti): If we update to manifold 2.12 can we than use the seedmap as originaly intended by us ? so we create a job with X seeds, ES output, web input and HopCount 0 for links and redirect: # Put X seeds in seedmap # run job # X documents get pushed to ES # update job to have X minus 20 seeds wait till scheduled time # run job # 20 documents get deleted from ES # X minus 20 documents get updated # wait till scheduled time # ... Will it work like this ? > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723782#comment-16723782 ] Tim Steenbeke commented on CONNECTORS-1562: --- If we update to manifold 2.12 can we than use the seedmap as originaly intended by us ? so we create a job with X seeds, ES output, web input and HopCount 0 for links and redirect: # Put X seeds in seedmap # run job # X documents get pushed to ES # update job to have X minus 20 seeds wait till scheduled time # run job # 20 documents get deleted from ES # X minus 20 documents get updated # wait till scheduled time # ... Will it work like this ? > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724016#comment-16724016 ] Tim Steenbeke commented on CONNECTORS-1562: --- There is no regex, there is no possibility to make a regex for this. That's the issue with creating the exclude/blacklist. 'started acting strange' stopped working and crashed. This is not the question. answer my question please. is this the way we have to run the job: {panel} _*Tim Steenbeke added a comment*_ we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ...{panel} > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723981#comment-16723981 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 11:52 AM: -- [~kwri...@metacarta.com] - So then with the seed map URL: we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ... Last time we tried this manifold started acting strange because of the amount of url's/links located in the sitemap URL (sitemap url: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true]) (blacklist url's: [https://www.uantwerpen.be/admin/system/sitemap/sitemap_revokes.aspx?lang=en=true]) was (Author: steenti): [~kwri...@metacarta.com] - So then with the seed map URL: we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ... Last time we tried this manifold started acting strange because of the amount of url's/links located in the sitemap URL (sitemap url: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true]) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723981#comment-16723981 ] Tim Steenbeke commented on CONNECTORS-1562: --- [~kwri...@metacarta.com] - So then with the seed map URL: we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ... Last time we tried this manifold started acting strange because of the amount of url's/links located in the sitemap URL (sitemap url: [https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true]) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724016#comment-16724016 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 12:47 PM: -- There is no regex, there is no possibility to make a regex for this. That's the issue with creating the exclude/blacklist. there is already a regex being used for images, documents and other files that don't have to be crawled. 'started acting strange' stopped working and crashed, because of the amount of URL's it stopped indexing, no messages or error's were given, just the job stopped working. This is not the question. answer my question please. is this the way we have to run the job: {panel} _*Tim Steenbeke added a comment*_ we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ...{panel} was (Author: steenti): There is no regex, there is no possibility to make a regex for this. That's the issue with creating the exclude/blacklist. 'started acting strange' stopped working and crashed. This is not the question. answer my question please. is this the way we have to run the job: {panel} _*Tim Steenbeke added a comment*_ we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES output, web input and Hop-count 1 for links and 0 for redirect: # run job # +-29000 documents get pushed to ES # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's) # wait till scheduled time # run job # documents get add/deleted (e.g.: 10 documents deleted) # wait till scheduled time # ...{panel} > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, > manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: (was: 30URLSeeds.png) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: (was: Screenshot from 2018-12-10 14-07-46.png) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: (was: 3URLSeed.png) > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716761#comment-16716761 ] Tim Steenbeke commented on CONNECTORS-1562: --- [~kwri...@metacarta.com] I have a URL with the full sitemap that has to be crawled ~^(and a full exclude sitemap)^~. If i use this URL as seed, do I have to set the hop filters to any value (e.g. redirect:0 and link:1) ? If one or multiple links are deleted from this sitemap, will the document be deleted from ES ? How should I set up the job to only keep the crawled sites in the sitemap ? > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717023#comment-16717023 ] Tim Steenbeke commented on CONNECTORS-1562: --- Due to customer requirements and sources the best option is to work with a list of seeds created by the sitemap. Whenever there is an update and a seed is removed it should be removed from elastic. Therefore is it possible to reopen the issue and test how and why documents that aren't in the seed list anymore, don't get deleted. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709712#comment-16709712 ] Tim Steenbeke commented on CONNECTORS-1562: --- The crawling is schedualted as dynamically rescan of the documents !Screenshot from 2018-12-05 09-01-46.png! > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: Screenshot from 2018-12-05 09-01-46.png > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709986#comment-16709986 ] Tim Steenbeke commented on CONNECTORS-1562: --- The documentation states: {code:java} A typical non-continuous run of a job has the following stages of execution: Adding the job's new, changed, or deleted starting points to the queue ("seeding") Fetching documents, discovering new documents, and detecting deletions Removing no-longer-included documents from the queue Jobs can also be run "continuously", which means that the job never completes, unless it is aborted. A continuous run has different stages of execution: Adding the job's new, changed, or deleted starting points to the queue ("seeding") Fetching documents, discovering new documents, and detecting deletions, while reseeding periodically Note that continuous jobs cannot remove no-longer-included documents from the queue. They can only remove documents that have been deleted from the repository.{code} Both should detect deletions but only non-continuous should delete the unreachable documents. so knowing this i changed the job to a non-continuous job that starts every 5 min for testing. Even when the job is non-continuous it doesn't delete the unreachable documents It keeps all documents indexed in elastic > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Tim Steenbeke commented on CONNECTORS-1562: --- Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above. The scheduler worked fine now, even with multiple values. I tested the same with the ES output connector and It also started up at the scheduled time. So it seems there was an issue in the import of the job schedule which has been resolved now. Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. (also on the null output but this is probably normal cause it's Null) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 8:55 AM: - Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above. The scheduler worked fine now, even with multiple values. I tested the same with the ES output connector and It also started up at the scheduled time. So it seems there was an issue in the import of the job schedule which has been resolved now. Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. Also in the Document Status for the Jobs there were no deletions registered. (also on the null output but this is probably normal cause it's Null) was (Author: steenti): Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above. The scheduler worked fine now, even with multiple values. I tested the same with the ES output connector and It also started up at the scheduled time. So it seems there was an issue in the import of the job schedule which has been resolved now. Next I edited the seeds and deleted some links and let the job run scheduled again. There were 0 Deletions and the Simple History also showed 0 deletion messages. (also on the null output but this is probably normal cause it's Null) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610 ] Tim Steenbeke commented on CONNECTORS-1562: --- Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted documents that shouldn't have been indexed in the first place, documents that were added to ES but weren't in the scope in the original run.) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 11:42 AM: -- Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted 3 documents and not 10) was (Author: steenti): Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted documents that shouldn't have been indexed in the first place, documents that were added to ES but weren't in the scope in the original run.) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Issue Comment Deleted] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Comment: was deleted (was: Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted 3 documents and not 10 so it partially worked)) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610 ] Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 11:49 AM: -- Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted 3 documents and not 10 so it partially worked) was (Author: steenti): Manifold doesn't delete documents it should delete. you quote the text where i say there were no deletions and than ask me if there were any ? ( on a site-note: It did however just deleted 3 documents and not 10) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: 30URLSeeds.png > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, Screenshot from 2018-12-05 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: 3URLSeed.png > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 > 09-01-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714692#comment-16714692 ] Tim Steenbeke commented on CONNECTORS-1562: --- # I created a job with a Null-Outputconnector # put 30 url's as seeds # set the hopfilter to 0 so no links or redirects will be checked, # run the job. Check Simple History: All the docuemtns get fetched and processed (if: {color:#33}RESPONSECODENOTINDEXABLE{color}) # I edit the JOB # delete all but 3 URL's, seeds are now just 3 URL's # run the job Check Simple History: all documents get fetched even though they aren't in the seeds anymore no document gets deleted and the job ends !30URLSeeds.png! !3URLSeed.png! !Screenshot from 2018-12-10 14-07-46.png! > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 > 09-01-46.png, Screenshot from 2018-12-10 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: Screenshot from 2018-12-10 14-07-46.png > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 > 09-01-46.png, Screenshot from 2018-12-10 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1562) Document removal Elastic
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1562: -- Attachment: (was: Screenshot from 2018-12-05 09-01-46.png) > Document removal Elastic > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 > 14-07-46.png > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1562) Document removal Elastic
Tim Steenbeke created CONNECTORS-1562: - Summary: Document removal Elastic Key: CONNECTORS-1562 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 Project: ManifoldCF Issue Type: Bug Components: Elastic Search connector, Web connector Affects Versions: ManifoldCF 2.11 Environment: Manifoldcf 2.11 Elasticsearch 6.3.2 Web inputconnector elastic outputconnecotr Job crawls website input and outputs content to elastic Reporter: Tim Steenbeke My documents aren't removed from ElasticSearch index after rerunning the changed seeds I update my job to change the seedmap and rerun it or use the schedualer to keep it runneng even after updating it. After the rerun the unreachable documents don't get deleted. It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1567: -- Attachment: bandwidth_test_abc.png > export of web connection bandwidth throttling > - > > Key: CONNECTORS-1567 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1567 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: bandwidth.png, bandwidth_test_abc.png > > > When exporting the web connector using the API, it doesn't export the > bandwidth throttling. > Than when importing this connector to a clean manifoldcf it creates the > connector with basic bandwidth. > When using the connector in a job it works properly. > The issue here is that the connector isn't created with correct bandwidth > throttling. > And the connector gives issues in the UI when trying to view or edit. > (related to issue: > [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568]) > e.g.: > {code:java} > { > "name": "test_web", > "configuration": null, > "_PARAMETER_": [ > { > "_attribute_name": "Email address", > "_value_": "tim.steenbeke@formica.digital" > }, > { > "_attribute_name": "Robots usage", > "_value_": "all" > }, > { > "_attribute_name": "Meta robots tags usage", > "_value_": "all" > }, > { > "_attribute_name": "Proxy host", > "_value_": "" > }, > { > "_attribute_name": "Proxy port", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication domain", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication user name", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication password", > "_value_": "" > } > ] > }, > "description": "Website repository standard settup", > "throttle": null, > "max_connections": 10, > "class_name": > "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", > "acl_authority": null > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909 ] Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:05 AM: --- But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth throttle is a different object in the response JSON or am I mistaking ? Also i don't understand what you mean with old-form, the example is the response from a 'repositoryconnections' GET call on manifoldCF 2.11. In the documentation it also only speaks of throttling and not the bandwidth for both 2.11 and 2.12. ([JSON repository connector 2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]]) *response for curl -X GET [http://localhost:8345/mcf-api-service/json/repositoryconnections] -H 'content-type: application/json'* {code:java} { "throttle": { "match_description": "testable regex", "rate": "1.666E-4", "match": "test reg" }, "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": { "maxkbpersecond": { "_value_": "", "_attribute_value": "64" }, "_attribute_caseinsensitive": "false", "maxconnections": { "_value_": "", "_attribute_value": "2" }, "maxfetchesperminute": { "_value_": "", "_attribute_value": "12" }, "_attribute_binregexp": "test regex", "_value_": "" }, "_PARAMETER_": [ { "_value_": "tim.steenbeke@formica.digital", "_attribute_name": "Email address" }, { "_value_": "all", "_attribute_name": "Robots usage" }, { "_value_": "all", "_attribute_name": "Meta robots tags usage" }, { "_value_": "proxyhost", "_attribute_name": "Proxy host" }, { "_value_": "port", "_attribute_name": "Proxy port" }, { "_value_": "domain", "_attribute_name": "Proxy authentication domain" }, { "_value_": "admin", "_attribute_name": "Proxy authentication user name" }, { "_value_": "5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=", "_attribute_name": "Proxy authentication password" } ], "accesscredential": [ { "_value_": "", "_attribute_type": "basic", "_attribute_username": "admin", "_attribute_urlregexp": "some acces creds", "_attribute_password": "RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=", "_attribute_domain": "localhost:8080" }, { "_value_": "", "_attribute_type": "session", "_attribute_urlregexp": "url regex" } ] }, "name": "abc_test", "description": "test abc", "isnew": "false", "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector" } {code} *For following bandwidth setup:* !bandwidth_test_abc.png! *So than I would do the following to set bandwidth and throttling to null:* {code:java} { "throttle": null,<<<--- null for throttling "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": null,<<<--- null for bandwidth "_PARAMETER_": [ { "_value_":
[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909 ] Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:06 AM: --- But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth throttle is a different object in the response JSON or am I mistaking ? Also i don't understand what you mean with old-form, the example is the response from a 'repositoryconnections' GET call on manifoldCF 2.11. In the documentation it also only speaks of throttling and not the bandwidth for both 2.11 and 2.12. [[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]] *response for curl -X GET [http://localhost:8345/mcf-api-service/json/repositoryconnections] -H 'content-type: application/json'* {code:java} { "throttle": { "match_description": "testable regex", "rate": "1.666E-4", "match": "test reg" }, "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": { "maxkbpersecond": { "_value_": "", "_attribute_value": "64" }, "_attribute_caseinsensitive": "false", "maxconnections": { "_value_": "", "_attribute_value": "2" }, "maxfetchesperminute": { "_value_": "", "_attribute_value": "12" }, "_attribute_binregexp": "test regex", "_value_": "" }, "_PARAMETER_": [ { "_value_": "tim.steenbeke@formica.digital", "_attribute_name": "Email address" }, { "_value_": "all", "_attribute_name": "Robots usage" }, { "_value_": "all", "_attribute_name": "Meta robots tags usage" }, { "_value_": "proxyhost", "_attribute_name": "Proxy host" }, { "_value_": "port", "_attribute_name": "Proxy port" }, { "_value_": "domain", "_attribute_name": "Proxy authentication domain" }, { "_value_": "admin", "_attribute_name": "Proxy authentication user name" }, { "_value_": "5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=", "_attribute_name": "Proxy authentication password" } ], "accesscredential": [ { "_value_": "", "_attribute_type": "basic", "_attribute_username": "admin", "_attribute_urlregexp": "some acces creds", "_attribute_password": "RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=", "_attribute_domain": "localhost:8080" }, { "_value_": "", "_attribute_type": "session", "_attribute_urlregexp": "url regex" } ] }, "name": "abc_test", "description": "test abc", "isnew": "false", "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector" } {code} *For following bandwidth setup:* !bandwidth_test_abc.png! *So than I would do the following to set bandwidth and throttling to null:* {code:java} { "throttle": null,<<<--- null for throttling "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": null,<<<--- null for bandwidth "_PARAMETER_": [ { "_value_": "tim.steenbeke@formica.digital",
[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909 ] Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:06 AM: --- But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth throttle is a different object in the response JSON or am I mistaking ? Also i don't understand what you mean with old-form, the example is the response from a 'repositoryconnections' GET call on manifoldCF 2.11. In the documentation it also only speaks of throttling and not the bandwidth for both 2.11 and 2.12. [JSON repository connection object| [https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects |https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]] *response for curl -X GET [http://localhost:8345/mcf-api-service/json/repositoryconnections] -H 'content-type: application/json'* {code:java} { "throttle": { "match_description": "testable regex", "rate": "1.666E-4", "match": "test reg" }, "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": { "maxkbpersecond": { "_value_": "", "_attribute_value": "64" }, "_attribute_caseinsensitive": "false", "maxconnections": { "_value_": "", "_attribute_value": "2" }, "maxfetchesperminute": { "_value_": "", "_attribute_value": "12" }, "_attribute_binregexp": "test regex", "_value_": "" }, "_PARAMETER_": [ { "_value_": "tim.steenbeke@formica.digital", "_attribute_name": "Email address" }, { "_value_": "all", "_attribute_name": "Robots usage" }, { "_value_": "all", "_attribute_name": "Meta robots tags usage" }, { "_value_": "proxyhost", "_attribute_name": "Proxy host" }, { "_value_": "port", "_attribute_name": "Proxy port" }, { "_value_": "domain", "_attribute_name": "Proxy authentication domain" }, { "_value_": "admin", "_attribute_name": "Proxy authentication user name" }, { "_value_": "5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=", "_attribute_name": "Proxy authentication password" } ], "accesscredential": [ { "_value_": "", "_attribute_type": "basic", "_attribute_username": "admin", "_attribute_urlregexp": "some acces creds", "_attribute_password": "RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=", "_attribute_domain": "localhost:8080" }, { "_value_": "", "_attribute_type": "session", "_attribute_urlregexp": "url regex" } ] }, "name": "abc_test", "description": "test abc", "isnew": "false", "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector" } {code} *For following bandwidth setup:* !bandwidth_test_abc.png! *So than I would do the following to set bandwidth and throttling to null:* {code:java} { "throttle": null,<<<--- null for throttling "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": null,<<<--- null for
[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909 ] Tim Steenbeke commented on CONNECTORS-1567: --- But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth throttle is a different object in the response JSON or am I mistaking ? Also i don't understand what you mean with old-form, the example is the response from a 'repositoryconnections' GET call on manifoldCF 2.11. In the documentation it also only speaks of throttling and not the bandwidth for both 2.11 and 2.12. ([JSON repository connector 2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects|https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects])]]) *response for curl -X GET [http://localhost:8345/mcf-api-service/json/repositoryconnections] -H 'content-type: application/json'* {code:java} { "throttle": { "match_description": "testable regex", "rate": "1.666E-4", "match": "test reg" }, "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": { "maxkbpersecond": { "_value_": "", "_attribute_value": "64" }, "_attribute_caseinsensitive": "false", "maxconnections": { "_value_": "", "_attribute_value": "2" }, "maxfetchesperminute": { "_value_": "", "_attribute_value": "12" }, "_attribute_binregexp": "test regex", "_value_": "" }, "_PARAMETER_": [ { "_value_": "tim.steenbeke@formica.digital", "_attribute_name": "Email address" }, { "_value_": "all", "_attribute_name": "Robots usage" }, { "_value_": "all", "_attribute_name": "Meta robots tags usage" }, { "_value_": "proxyhost", "_attribute_name": "Proxy host" }, { "_value_": "port", "_attribute_name": "Proxy port" }, { "_value_": "domain", "_attribute_name": "Proxy authentication domain" }, { "_value_": "admin", "_attribute_name": "Proxy authentication user name" }, { "_value_": "5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=", "_attribute_name": "Proxy authentication password" } ], "accesscredential": [ { "_value_": "", "_attribute_type": "basic", "_attribute_username": "admin", "_attribute_urlregexp": "some acces creds", "_attribute_password": "RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=", "_attribute_domain": "localhost:8080" }, { "_value_": "", "_attribute_type": "session", "_attribute_urlregexp": "url regex" } ] }, "name": "abc_test", "description": "test abc", "isnew": "false", "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector" } {code} *For following bandwidth setup:* !bandwidth_test_abc.png! *So than I would do the following to set bandwidth and throttling to null:* {code:java} { "throttle": null,<<<--- null for throttling "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": null,<<<--- null for bandwidth "_PARAMETER_": [
[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1567: -- Attachment: bandwidth.png > export of web connection bandwidth throttling > - > > Key: CONNECTORS-1567 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1567 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: bandwidth.png > > > When exporting the web connector using the API, it doesn't export the > bandwidth throttling. > Than when importing this connector to a clean manifoldcf it creates the > connector with basic bandwidth. > When using the connector in a job it works properly. > The issue here is that the connector isn't created with correct bandwidth > throttling. > And the connector gives issues in the UI when trying to view or edit. > (related to issue: > [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568]) > e.g.: > {code:java} > { > "name": "test_web", > "configuration": null, > "_PARAMETER_": [ > { > "_attribute_name": "Email address", > "_value_": "tim.steenbeke@formica.digital" > }, > { > "_attribute_name": "Robots usage", > "_value_": "all" > }, > { > "_attribute_name": "Meta robots tags usage", > "_value_": "all" > }, > { > "_attribute_name": "Proxy host", > "_value_": "" > }, > { > "_attribute_name": "Proxy port", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication domain", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication user name", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication password", > "_value_": "" > } > ] > }, > "description": "Website repository standard settup", > "throttle": null, > "max_connections": 10, > "class_name": > "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", > "acl_authority": null > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735659#comment-16735659 ] Tim Steenbeke commented on CONNECTORS-1562: --- We created everything via the UI and than extracted the full setup. Then after deleting everything we used the API to POST it back to manifold Then I get this error, but only with web repository connections. Also I noticed that the bandwidth is not exported to the connector Json. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737170#comment-16737170 ] Tim Steenbeke commented on CONNECTORS-1562: --- But if the hop-count is 2 than it will go to far in the sitemap and will add documents that aren't supposed to be indexed. Because the sitemap is the full whitelist, we set the hop-count to 1 so it doesn't hop from the whitelisted URL to a maybe non whitelisted URL. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735590#comment-16735590 ] Tim Steenbeke commented on CONNECTORS-1562: --- I have the Throttle on null and max_connections on 10 which was the standard setting. I'm also getting an Error when i try to open my web output-connector all other connectors and job editing works. I'm building the manifold connectors and jobs using the API. *HTTP ERROR 500* Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason: Server Error *Caused by:* {code:java} org.apache.jasper.JasperException: An exception occurred processing JSP page /editconnection.jsp at line 564 561: 562: if (className.length() > 0) 563: { 564: RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName); 565: } 566: %> 567: Stacktrace: at org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) at org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) at org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866) at org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83) at org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155) at org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388) ... 23 more {code} *Caused by:* {code:java} java.lang.NullPointerException at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) at org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) at org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866) at
[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1567: -- Description: When exporting the web connector using the API, it doesn't export the bandwidth throttling. Than when importing this connector to a clean manifoldcf it creates the connector with basic bandwidth. When using the connector in a job it works properly. The issue here is that the connector isn't created with correct bandwidth throttling. And the connector gives issues in the UI when trying to view or edit. e.g.: {code:java} { "name": "test_web", "configuration": null, "_PARAMETER_": [ { "_attribute_name": "Email address", "_value_": "tim.steenbeke@formica.digital" }, { "_attribute_name": "Robots usage", "_value_": "all" }, { "_attribute_name": "Meta robots tags usage", "_value_": "all" }, { "_attribute_name": "Proxy host", "_value_": "" }, { "_attribute_name": "Proxy port", "_value_": "" }, { "_attribute_name": "Proxy authentication domain", "_value_": "" }, { "_attribute_name": "Proxy authentication user name", "_value_": "" }, { "_attribute_name": "Proxy authentication password", "_value_": "" } ] }, "description": "Website repository standard settup", "throttle": null, "max_connections": 10, "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", "acl_authority": null }{code} was: When exporting the web connector using the API, it doesn't export the bandwidth throttling. This than also doesn't create a connector with bandwidth throttling which gives issues in the UI when trying to view or edit. When using this connector in a job it will use the default bandwidth. e.g.: {code:java} { "name": "test_web", "configuration": null, "_PARAMETER_": [ { "_attribute_name": "Email address", "_value_": "tim.steenbeke@formica.digital" }, { "_attribute_name": "Robots usage", "_value_": "all" }, { "_attribute_name": "Meta robots tags usage", "_value_": "all" }, { "_attribute_name": "Proxy host", "_value_": "" }, { "_attribute_name": "Proxy port", "_value_": "" }, { "_attribute_name": "Proxy authentication domain", "_value_": "" }, { "_attribute_name": "Proxy authentication user name", "_value_": "" }, { "_attribute_name": "Proxy authentication password", "_value_": "" } ] }, "description": "Website repository standard settup", "throttle": null, "max_connections": 10, "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", "acl_authority": null }{code} > export of web connection bandwidth throttling > - > > Key: CONNECTORS-1567 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1567 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Major > > When exporting the web connector using the API, it doesn't export the > bandwidth throttling. > Than when importing this connector to a clean manifoldcf it creates the > connector with basic bandwidth. > When using the connector in a job it works properly. > The issue here is that the connector isn't created with correct bandwidth > throttling. > And the connector gives issues in the UI when trying to view or edit. > e.g.: > {code:java} > { > "name": "test_web", > "configuration": null, > "_PARAMETER_": [ > { > "_attribute_name": "Email address", > "_value_": "tim.steenbeke@formica.digital" > }, > { > "_attribute_name": "Robots usage", > "_value_": "all" > }, > { > "_attribute_name": "Meta robots tags usage", > "_value_": "all" > }, > { > "_attribute_name": "Proxy host", > "_value_": "" > }, > { > "_attribute_name": "Proxy port", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication domain", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication user name", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication password", > "_value_": "" > } > ] > }, > "description": "Website repository standard settup", > "throttle": null, > "max_connections": 10, > "class_name": >
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736873#comment-16736873 ] Tim Steenbeke commented on CONNECTORS-1562: --- Ok I will create a new ticket for both issues. Testing the connector with the UI, disabling bandwidth and max connections equal to 20, we were able to crawl all sites. Now I'm still stuck with the deletion issue, were if a site is removed from the sitemap, will it be removed from elastic, since it is no longer accessible. We put the sitemap as seed, the big one, and if 1 or a few URL's get removed, they should get removed from elastic. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909 ] Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:05 AM: --- But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth throttle is a different object in the response JSON or am I mistaking ? Also i don't understand what you mean with old-form, the example is the response from a 'repositoryconnections' GET call on manifoldCF 2.11. In the documentation it also only speaks of throttling and not the bandwidth for both 2.11 and 2.12. [JSON repository connector 2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects |https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]] *response for curl -X GET [http://localhost:8345/mcf-api-service/json/repositoryconnections] -H 'content-type: application/json'* {code:java} { "throttle": { "match_description": "testable regex", "rate": "1.666E-4", "match": "test reg" }, "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": { "maxkbpersecond": { "_value_": "", "_attribute_value": "64" }, "_attribute_caseinsensitive": "false", "maxconnections": { "_value_": "", "_attribute_value": "2" }, "maxfetchesperminute": { "_value_": "", "_attribute_value": "12" }, "_attribute_binregexp": "test regex", "_value_": "" }, "_PARAMETER_": [ { "_value_": "tim.steenbeke@formica.digital", "_attribute_name": "Email address" }, { "_value_": "all", "_attribute_name": "Robots usage" }, { "_value_": "all", "_attribute_name": "Meta robots tags usage" }, { "_value_": "proxyhost", "_attribute_name": "Proxy host" }, { "_value_": "port", "_attribute_name": "Proxy port" }, { "_value_": "domain", "_attribute_name": "Proxy authentication domain" }, { "_value_": "admin", "_attribute_name": "Proxy authentication user name" }, { "_value_": "5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=", "_attribute_name": "Proxy authentication password" } ], "accesscredential": [ { "_value_": "", "_attribute_type": "basic", "_attribute_username": "admin", "_attribute_urlregexp": "some acces creds", "_attribute_password": "RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=", "_attribute_domain": "localhost:8080" }, { "_value_": "", "_attribute_type": "session", "_attribute_urlregexp": "url regex" } ] }, "name": "abc_test", "description": "test abc", "isnew": "false", "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector" } {code} *For following bandwidth setup:* !bandwidth_test_abc.png! *So than I would do the following to set bandwidth and throttling to null:* {code:java} { "throttle": null,<<<--- null for throttling "max_connections": "20", "configuration": { "trust": { "_attribute_trusteverything": "true", "_value_": "", "_attribute_urlregexp": ".*" }, "bindesc": null,<<<--- null for bandwidth
[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738159#comment-16738159 ] Tim Steenbeke commented on CONNECTORS-1567: --- Same problem as CONNECTORS-1568, we found the issue after debugging and fixed it. Thank you for the help. > export of web connection bandwidth throttling > - > > Key: CONNECTORS-1567 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1567 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: bandwidth_test_abc.png > > > When exporting the web connector using the API, it doesn't export the > bandwidth throttling. > Than when importing this connector to a clean manifoldcf it creates the > connector with basic bandwidth. > When using the connector in a job it works properly. > The issue here is that the connector isn't created with correct bandwidth > throttling. > And the connector gives issues in the UI when trying to view or edit. > (related to issue: > [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568]) > e.g.: > {code:java} > { > "name": "test_web", > "configuration": null, > "_PARAMETER_": [ > { > "_attribute_name": "Email address", > "_value_": "tim.steenbeke@formica.digital" > }, > { > "_attribute_name": "Robots usage", > "_value_": "all" > }, > { > "_attribute_name": "Meta robots tags usage", > "_value_": "all" > }, > { > "_attribute_name": "Proxy host", > "_value_": "" > }, > { > "_attribute_name": "Proxy port", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication domain", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication user name", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication password", > "_value_": "" > } > ] > }, > "description": "Website repository standard settup", > "throttle": null, > "max_connections": 10, > "class_name": > "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", > "acl_authority": null > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1568) UI error imported web connection
[ https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738153#comment-16738153 ] Tim Steenbeke commented on CONNECTORS-1568: --- When debugging the project we found a missing JSON object for trust and this created an issue for the UI but the connector still worked. Now we fixed the bug. so thank you for the help. > UI error imported web connection > > > Key: CONNECTORS-1568 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1568 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > > Using the ManifoldCF API, we export a web repository connector, with basic > settings. > Than we importing the web connector using the manifoldcf API. > The connector get's imported and can be used in a job. > When trying to view or edit the connector in the UI following error pops up. > (connected to issue: > [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567] > *HTTP ERROR 500* > Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason: > Server Error > *Caused by:* > {code:java} > org.apache.jasper.JasperException: An exception occurred processing JSP page > /editconnection.jsp at line 564 > 561: > 562: if (className.length() > 0) > 563: { > 564: > RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new > > org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName); > 565: } > 566: %> > 567: > Stacktrace: > at > org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521) > at > org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430) > at > org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313) > at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260) > at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) > at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) > at > org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) > at > org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) > at > org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) > at > org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125) > at > org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) > at > org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) > at > org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059) > at > org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) > at > org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) > at > org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) > at > org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) > at org.eclipse.jetty.server.Server.handle(Server.java:497) > at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) > at > org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248) > at > org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) > at > org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610) > at > org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539) > at java.lang.Thread.run(Thread.java:748) > Caused by: java.lang.NullPointerException > at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) > at > org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) > at > org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701) > at > org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866) > at > org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83) > at > org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155) > at >
[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1567: -- Attachment: (was: bandwidth.png) > export of web connection bandwidth throttling > - > > Key: CONNECTORS-1567 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1567 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: bandwidth_test_abc.png > > > When exporting the web connector using the API, it doesn't export the > bandwidth throttling. > Than when importing this connector to a clean manifoldcf it creates the > connector with basic bandwidth. > When using the connector in a job it works properly. > The issue here is that the connector isn't created with correct bandwidth > throttling. > And the connector gives issues in the UI when trying to view or edit. > (related to issue: > [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568]) > e.g.: > {code:java} > { > "name": "test_web", > "configuration": null, > "_PARAMETER_": [ > { > "_attribute_name": "Email address", > "_value_": "tim.steenbeke@formica.digital" > }, > { > "_attribute_name": "Robots usage", > "_value_": "all" > }, > { > "_attribute_name": "Meta robots tags usage", > "_value_": "all" > }, > { > "_attribute_name": "Proxy host", > "_value_": "" > }, > { > "_attribute_name": "Proxy port", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication domain", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication user name", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication password", > "_value_": "" > } > ] > }, > "description": "Website repository standard settup", > "throttle": null, > "max_connections": 10, > "class_name": > "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", > "acl_authority": null > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738090#comment-16738090 ] Tim Steenbeke commented on CONNECTORS-1567: --- When putting _"binddesc"_ to null it doesn't seem to help The process you describe is how we do it. # Make the connector in the UI # test connector # extract connector # clean manifold # import connector # test connector So the output should than be new format because we use 2.11. > export of web connection bandwidth throttling > - > > Key: CONNECTORS-1567 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1567 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Major > Fix For: ManifoldCF 2.13 > > Attachments: bandwidth_test_abc.png > > > When exporting the web connector using the API, it doesn't export the > bandwidth throttling. > Than when importing this connector to a clean manifoldcf it creates the > connector with basic bandwidth. > When using the connector in a job it works properly. > The issue here is that the connector isn't created with correct bandwidth > throttling. > And the connector gives issues in the UI when trying to view or edit. > (related to issue: > [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568]) > e.g.: > {code:java} > { > "name": "test_web", > "configuration": null, > "_PARAMETER_": [ > { > "_attribute_name": "Email address", > "_value_": "tim.steenbeke@formica.digital" > }, > { > "_attribute_name": "Robots usage", > "_value_": "all" > }, > { > "_attribute_name": "Meta robots tags usage", > "_value_": "all" > }, > { > "_attribute_name": "Proxy host", > "_value_": "" > }, > { > "_attribute_name": "Proxy port", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication domain", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication user name", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication password", > "_value_": "" > } > ] > }, > "description": "Website repository standard settup", > "throttle": null, > "max_connections": 10, > "class_name": > "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", > "acl_authority": null > }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass
[ https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732056#comment-16732056 ] Tim Steenbeke commented on CONNECTORS-1562: --- Yes the seed document that is used as sitemap is approximately 23000+ URLs ([https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true]) Curl completely fetches it but it takes some time. > Documents unreachable due to hopcount are not considered unreachable on > cleanup pass > > > Key: CONNECTORS-1562 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1562 > Project: ManifoldCF > Issue Type: Bug > Components: Elastic Search connector, Web connector >Affects Versions: ManifoldCF 2.11 > Environment: Manifoldcf 2.11 > Elasticsearch 6.3.2 > Web inputconnector > elastic outputconnecotr > Job crawls website input and outputs content to elastic >Reporter: Tim Steenbeke >Assignee: Karl Wright >Priority: Critical > Labels: starter > Fix For: ManifoldCF 2.12 > > Attachments: Screenshot from 2018-12-31 11-17-29.png, > manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced > > Original Estimate: 4h > Remaining Estimate: 4h > > My documents aren't removed from ElasticSearch index after rerunning the > changed seeds > I update my job to change the seedmap and rerun it or use the schedualer to > keep it runneng even after updating it. > After the rerun the unreachable documents don't get deleted. > It only adds doucments when they can be reached. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1568) UI error imported web connection
[ https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1568: -- Description: Using the ManifoldCF API, we export a web repository connector, with basic settings. Than we importing the web connector using the manifoldcf API. The connector get's imported and can be used in a job. When trying to view or edit the connector in the UI following error pops up. (connected to issue: [CONNECTORS-1567|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567] *HTTP ERROR 500* Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason: Server Error *Caused by:* {code:java} org.apache.jasper.JasperException: An exception occurred processing JSP page /editconnection.jsp at line 564 561: 562: if (className.length() > 0) 563: { 564: RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName); 565: } 566: %> 567: Stacktrace: at org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) at org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) at org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866) at org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83) at org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155) at org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388) ... 23 more {code} *Caused by:* {code:java} java.lang.NullPointerException at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) at org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) at org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701) at
[jira] [Created] (CONNECTORS-1567) export of web connection bandwidth throttling
Tim Steenbeke created CONNECTORS-1567: - Summary: export of web connection bandwidth throttling Key: CONNECTORS-1567 URL: https://issues.apache.org/jira/browse/CONNECTORS-1567 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 2.12, ManifoldCF 2.11 Reporter: Tim Steenbeke When exporting the web connector using the API, it doesn't export the bandwidth throttling. This than also doesn't create a connector with bandwidth throttling which gives issues in the UI when trying to view or edit. When using this connector in a job it will use the default bandwidth. e.g.: {code:java} { "name": "test_web", "configuration": null, "_PARAMETER_": [ { "_attribute_name": "Email address", "_value_": "tim.steenbeke@formica.digital" }, { "_attribute_name": "Robots usage", "_value_": "all" }, { "_attribute_name": "Meta robots tags usage", "_value_": "all" }, { "_attribute_name": "Proxy host", "_value_": "" }, { "_attribute_name": "Proxy port", "_value_": "" }, { "_attribute_name": "Proxy authentication domain", "_value_": "" }, { "_attribute_name": "Proxy authentication user name", "_value_": "" }, { "_attribute_name": "Proxy authentication password", "_value_": "" } ] }, "description": "Website repository standard settup", "throttle": null, "max_connections": 10, "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", "acl_authority": null }{code} -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1568) UI error imported web connection
Tim Steenbeke created CONNECTORS-1568: - Summary: UI error imported web connection Key: CONNECTORS-1568 URL: https://issues.apache.org/jira/browse/CONNECTORS-1568 Project: ManifoldCF Issue Type: Bug Components: Web connector Affects Versions: ManifoldCF 2.12, ManifoldCF 2.11 Reporter: Tim Steenbeke Using the ManifoldCF API, we export a web repository connector, with basic settings. Than we importing the web connector using the manifoldcf API. The connector get's imported and can be used in a job. When trying to view or edit the connector in the UI following error pops up. *HTTP ERROR 500* Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason: Server Error *Caused by:* {code:java} org.apache.jasper.JasperException: An exception occurred processing JSP page /editconnection.jsp at line 564 561: 562: if (className.length() > 0) 563: { 564: RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName); 565: } 566: %> 567: Stacktrace: at org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) at org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) at org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866) at org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83) at org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155) at org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388) ... 23 more {code} *Caused by:* {code:java} java.lang.NullPointerException at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) at org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) at org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
[jira] [Updated] (CONNECTORS-1568) UI error imported web connection
[ https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1568: -- Description: Using the ManifoldCF API, we export a web repository connector, with basic settings. Than we importing the web connector using the manifoldcf API. The connector get's imported and can be used in a job. When trying to view or edit the connector in the UI following error pops up. (connected to issue: [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567] *HTTP ERROR 500* Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason: Server Error *Caused by:* {code:java} org.apache.jasper.JasperException: An exception occurred processing JSP page /editconnection.jsp at line 564 561: 562: if (className.length() > 0) 563: { 564: RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName); 565: } 566: %> 567: Stacktrace: at org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430) at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313) at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769) at org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143) at org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577) at org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223) at org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125) at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515) at org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185) at org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059) at org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141) at org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215) at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52) at org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97) at org.eclipse.jetty.server.Server.handle(Server.java:497) at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311) at org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248) at org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540) at org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610) at org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539) at java.lang.Thread.run(Thread.java:748) Caused by: java.lang.NullPointerException at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) at org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) at org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866) at org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83) at org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155) at org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916) at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70) at javax.servlet.http.HttpServlet.service(HttpServlet.java:790) at org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388) ... 23 more {code} *Caused by:* {code:java} java.lang.NullPointerException at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164) at org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86) at org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47) at org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701) at
[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling
[ https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1567: -- Description: When exporting the web connector using the API, it doesn't export the bandwidth throttling. Than when importing this connector to a clean manifoldcf it creates the connector with basic bandwidth. When using the connector in a job it works properly. The issue here is that the connector isn't created with correct bandwidth throttling. And the connector gives issues in the UI when trying to view or edit. (related to issue: [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568]) e.g.: {code:java} { "name": "test_web", "configuration": null, "_PARAMETER_": [ { "_attribute_name": "Email address", "_value_": "tim.steenbeke@formica.digital" }, { "_attribute_name": "Robots usage", "_value_": "all" }, { "_attribute_name": "Meta robots tags usage", "_value_": "all" }, { "_attribute_name": "Proxy host", "_value_": "" }, { "_attribute_name": "Proxy port", "_value_": "" }, { "_attribute_name": "Proxy authentication domain", "_value_": "" }, { "_attribute_name": "Proxy authentication user name", "_value_": "" }, { "_attribute_name": "Proxy authentication password", "_value_": "" } ] }, "description": "Website repository standard settup", "throttle": null, "max_connections": 10, "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", "acl_authority": null }{code} was: When exporting the web connector using the API, it doesn't export the bandwidth throttling. Than when importing this connector to a clean manifoldcf it creates the connector with basic bandwidth. When using the connector in a job it works properly. The issue here is that the connector isn't created with correct bandwidth throttling. And the connector gives issues in the UI when trying to view or edit. e.g.: {code:java} { "name": "test_web", "configuration": null, "_PARAMETER_": [ { "_attribute_name": "Email address", "_value_": "tim.steenbeke@formica.digital" }, { "_attribute_name": "Robots usage", "_value_": "all" }, { "_attribute_name": "Meta robots tags usage", "_value_": "all" }, { "_attribute_name": "Proxy host", "_value_": "" }, { "_attribute_name": "Proxy port", "_value_": "" }, { "_attribute_name": "Proxy authentication domain", "_value_": "" }, { "_attribute_name": "Proxy authentication user name", "_value_": "" }, { "_attribute_name": "Proxy authentication password", "_value_": "" } ] }, "description": "Website repository standard settup", "throttle": null, "max_connections": 10, "class_name": "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector", "acl_authority": null }{code} > export of web connection bandwidth throttling > - > > Key: CONNECTORS-1567 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1567 > Project: ManifoldCF > Issue Type: Bug > Components: Web connector >Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Major > > When exporting the web connector using the API, it doesn't export the > bandwidth throttling. > Than when importing this connector to a clean manifoldcf it creates the > connector with basic bandwidth. > When using the connector in a job it works properly. > The issue here is that the connector isn't created with correct bandwidth > throttling. > And the connector gives issues in the UI when trying to view or edit. > (related to issue: > [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568]) > e.g.: > {code:java} > { > "name": "test_web", > "configuration": null, > "_PARAMETER_": [ > { > "_attribute_name": "Email address", > "_value_": "tim.steenbeke@formica.digital" > }, > { > "_attribute_name": "Robots usage", > "_value_": "all" > }, > { > "_attribute_name": "Meta robots tags usage", > "_value_": "all" > }, > { > "_attribute_name": "Proxy host", > "_value_": "" > }, > { > "_attribute_name": "Proxy port", > "_value_": "" > }, > { > "_attribute_name": "Proxy authentication domain", > "_value_": "" > }, >
[jira] [Commented] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774146#comment-16774146 ] Tim Steenbeke commented on CONNECTORS-1584: --- {panel:title=Failure notice send by mailer-dae...@apache.org} Hi. This is the qmail-send program at apache.org. I'm afraid I wasn't able to deliver your message to the following addresses. This is a permanent error; I've given up. Sorry it didn't work out. : Must be sent from an @apache.org address or a subscriber address or an address in LDAP. --- Below this line is a copy of the message. Return-Path: Received: (qmail 90034 invoked by uid 99); 18 Feb 2019 10:35:51 - Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Feb 2019 10:35:51 + Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 07C55C84A2 for ; Mon, 18 Feb 2019 10:35:51 + (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.998 X-Spam-Level: * X-Spam-Status: No, score=1.998 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=cronos.onmicrosoft.com Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id bzxj-Zwazahp for ; Mon, 18 Feb 2019 10:35:47 + (UTC) Received: from EUR02-HE1-obe.outbound.protection.outlook.com (mail-eopbgr10062.outbound.protection.outlook.com [40.107.1.62]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTPS id 5432D5F533 for ; Mon, 18 Feb 2019 10:35:47 + (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=CRONOS.onmicrosoft.com; s=selector1-CRONOS-onmicrosoft-com; h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck; bh=NXxuOXxO7L5OIh8wemB0u1esV8BQdvefAryTpMAPvDU=; b=dZRUlfL4a6CvpIZbLZVeakgTuNXTti3W/oO9VcpZrao8Odjy7PljvmTce1+2kx3NxG/uWOFVhgaHgYSJXBOwRSVRwW/Ovx6YP1z5fw5nBpdoux666pZd7uzLlTJSM5kNOLwqrU2fIdSkW3J6qFqB1TMMu8Jm4BonW/kXylfb0SY= Received: from AM6PR0302MB3256.eurprd03.prod.outlook.com (52.133.27.27) by AM6PR0302MB3383.eurprd03.prod.outlook.com (52.133.28.10) with Microsoft SMTP Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id 15.20.1622.19; Mon, 18 Feb 2019 10:35:40 + Received: from AM6PR0302MB3256.eurprd03.prod.outlook.com ([fe80::a8f3:3f23:b1f3:8ce6]) by AM6PR0302MB3256.eurprd03.prod.outlook.com ([fe80::a8f3:3f23:b1f3:8ce6%5]) with mapi id 15.20.1622.018; Mon, 18 Feb 2019 10:35:40 + From: Steenbeke Tim To: "u...@manifoldcf.apache.org" Subject: Regex support Thread-Topic: Regex support Thread-Index: AQHUx3UQjsHP1lgCt0uYyVLe47S0rw== Date: Mon, 18 Feb 2019 10:35:40 + Message-ID: Accept-Language: en- Content-Language: en- X-MS-Has-Attach: X-MS-TNEF-Correlator: authentication-results: spf=none (sender IP is ) smtp.mailfrom=Tim.Steenbeke@formica.digital; x-originating-ip: [94.143.189.241] x-ms-publictraffictype: Email x-ms-office365-filtering-correlation-id: 8b0d3335-aeb7-4dc8-f619-08d6958cd411 x-microsoft-antispam: BCL:0;PCL:0;RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600110)(711020)(4605104)(2017052603328)(7153060)(7193020);SRVR:AM6PR0302MB3383; x-ms-traffictypediagnostic: AM6PR0302MB3383: x-ms-exchange-purlcount: 2 x-microsoft-exchange-diagnostics: 1;AM6PR0302MB3383;20:GL97sCN3oMJg9YDuqqZQjTkFnP+s9blDsxlMF5L7uIMW/Cz7EUmc2qn4aUHZ/Gk7T7u0uYQUMqr5RYnJ4UUZF2FRDvKg91ZSHM2t/jcwq+Udc5ibZTY5ZByYX7bVG9i6ZqCb2tLa/S///Mc2MjH8KqSVacv1zGyCeBiOczfh3E4= x-microsoft-antispam-prvs: x-forefront-prvs: 09525C61DB x-forefront-antispam-report:
[jira] [Commented] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774147#comment-16774147 ] Tim Steenbeke commented on CONNECTORS-1584: --- 3 colleges and I tried mailing to the address and we all get this same response. So it is the right address than, I though we made a mistake. > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774033#comment-16774033 ] Tim Steenbeke commented on CONNECTORS-1584: --- If the mail is userS I think the site should be updated because the mail mentioned in FAQ is user. [https://manifoldcf.apache.org/release/release-2.12/en_US/faq.html] Also thanks for responding. > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Comment Edited] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774033#comment-16774033 ] Tim Steenbeke edited comment on CONNECTORS-1584 at 2/21/19 12:37 PM: - If the mail is user*s* I think the site should be updated because the mail mentioned in FAQ is user. [https://manifoldcf.apache.org/release/release-2.12/en_US/faq.html] Also thanks for responding. was (Author: steenti): If the mail is userS I think the site should be updated because the mail mentioned in FAQ is user. [https://manifoldcf.apache.org/release/release-2.12/en_US/faq.html] Also thanks for responding. > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1584: -- Description: What type of regexs does manifold include and exclude support and also in general regex support? At the moment i'm using a web repository connection and an Elastic output connection. I'm trying to exclude urls that link to documents. e.g. website.com/document/path/this.pdf and website.com/document/path/other.PDF The issue i'm having is that the regex that I have found so far doesn't work case insensitive, so for every possible case i have to add a new line. e.g.: {code:java} .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} Is it possible to add documentation what type of regex is able to be used or maybe a tool to test your regex and see if it is supported by manifold ? I tried mailing this question to [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail adress returns a failure notice. was: What type of regexs does manifold include and exclude support and also in general regex support? At the moment i'm using a web repository connection and an Elastic output connection. I'm trying to exclude urls that link to documents. e.g. website.com/document/path/this.pdf and website.com/document/path/other.PDF The issue i'm having is that the regex that I have found so far doesn't work case insensitive, so for every possible case i have to add a new line. e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... . Is it possible to add documentation what type of regex is able to be used or maybe a tool to test your regex and see if it is supported by manifold ? I tried mailing this question to [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail adress returns a failure notice. > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: > {code:java} > .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code} > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1584) regex documentation
Tim Steenbeke created CONNECTORS-1584: - Summary: regex documentation Key: CONNECTORS-1584 URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 Project: ManifoldCF Issue Type: Improvement Components: Web connector Affects Versions: ManifoldCF 2.12 Reporter: Tim Steenbeke What type of regexs does manifold include and exclude support and also in general regex support? At the moment i'm using a web repository connection and an Elastic output connection. I'm trying to exclude urls that link to documents. e.g. website.com/document/path/this.pdf and website.com/document/path/other.PDF The issue i'm having is that the regex that I have found so far doesn't work case insensitive, so for every possible case i have to add a new line. e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... . Is it possible to add documentation what type of regex is able to be used or maybe a tool to test your regex and see if it is supported by manifold ? I tried mailing this question to [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Updated] (CONNECTORS-1584) regex documentation
[ https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tim Steenbeke updated CONNECTORS-1584: -- Description: What type of regexs does manifold include and exclude support and also in general regex support? At the moment i'm using a web repository connection and an Elastic output connection. I'm trying to exclude urls that link to documents. e.g. website.com/document/path/this.pdf and website.com/document/path/other.PDF The issue i'm having is that the regex that I have found so far doesn't work case insensitive, so for every possible case i have to add a new line. e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... . Is it possible to add documentation what type of regex is able to be used or maybe a tool to test your regex and see if it is supported by manifold ? I tried mailing this question to [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail adress returns a failure notice. was: What type of regexs does manifold include and exclude support and also in general regex support? At the moment i'm using a web repository connection and an Elastic output connection. I'm trying to exclude urls that link to documents. e.g. website.com/document/path/this.pdf and website.com/document/path/other.PDF The issue i'm having is that the regex that I have found so far doesn't work case insensitive, so for every possible case i have to add a new line. e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... . Is it possible to add documentation what type of regex is able to be used or maybe a tool to test your regex and see if it is supported by manifold ? I tried mailing this question to [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail adress returns a failure notice. > regex documentation > --- > > Key: CONNECTORS-1584 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1584 > Project: ManifoldCF > Issue Type: Improvement > Components: Web connector >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > > What type of regexs does manifold include and exclude support and also in > general regex support? > At the moment i'm using a web repository connection and an Elastic output > connection. > I'm trying to exclude urls that link to documents. > e.g. website.com/document/path/this.pdf and > website.com/document/path/other.PDF > The issue i'm having is that the regex that I have found so far doesn't work > case insensitive, so for every possible case i have to add a new line. > e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... . > Is it possible to add documentation what type of regex is able to be used or > maybe a tool to test your regex and see if it is supported by manifold ? > I tried mailing this question to > [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail > adress returns a failure notice. -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Commented] (CONNECTORS-1575) inconsistant use of value-labels
[ https://issues.apache.org/jira/browse/CONNECTORS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753912#comment-16753912 ] Tim Steenbeke commented on CONNECTORS-1575: --- Ok, thank you for your fast response. > inconsistant use of value-labels > - > > Key: CONNECTORS-1575 > URL: https://issues.apache.org/jira/browse/CONNECTORS-1575 > Project: ManifoldCF > Issue Type: Bug > Components: API >Affects Versions: ManifoldCF 2.12 >Reporter: Tim Steenbeke >Priority: Minor > Attachments: image-2019-01-28-11-57-46-738.png > > > When retrieving a job, using the API there seems to be inconsistencies in the > return JSON of a job. > For the schedule value of 'hourofday', 'minutesofhour', etc. the label of the > value is 'value' while for all other value-labels it is '_value_'. > > !image-2019-01-28-11-57-46-738.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)
[jira] [Created] (CONNECTORS-1575) inconsistant use of value-labels
Tim Steenbeke created CONNECTORS-1575: - Summary: inconsistant use of value-labels Key: CONNECTORS-1575 URL: https://issues.apache.org/jira/browse/CONNECTORS-1575 Project: ManifoldCF Issue Type: Bug Components: API Affects Versions: ManifoldCF 2.12 Reporter: Tim Steenbeke Attachments: image-2019-01-28-11-57-46-738.png When retrieving a job, using the API there seems to be inconsistencies in the return JSON of a job. For the schedule value of 'hourofday', 'minutesofhour', etc. the label of the value is 'value' while for all other value-labels it is '_value_'. !image-2019-01-28-11-57-46-738.png! -- This message was sent by Atlassian JIRA (v7.6.3#76005)