from:"Tim Steenbeke \(JIRA\)"

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731261#comment-16731261
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

[~kwri...@metacarta.com] I think there was some misscommunication: 


    The Issue with the "stopped working" was found by my college Donald Van den 
Driessche, so I didn't have any more info than what he gave me.
    I recreated the issue and this is the Error:
{code:java}
Error: Repeated service interruptions - failure processing document: Stream 
Closed{code}
!Screenshot from 2018-12-31 11-17-29.png!
   

the question I wanted answered is: How are we supposed to set up the job with 
the data we have, and what you see as the best solution, might not be the right 
solution.
I asked this and you only responded to the other issue with manifold, It looked 
like you avoided the question.
You suggested using the URL with the site-map but with excludes, and this is 
simply not possible because the exclude list is to big an there is no reg exp. 
possible because of the randomness of the links.
So on this part I also though that you were looking in to this and found a fix 
or edited code.

I'm sorry if my text was formed blunt but I'm just trying to get information 
and I didn't know any other way to get your attention to the full picture of 
the comment.
English is not my first language so I'm sorry for my small vocabulary usage, 
google translate also doesn't help on this part.
So i hope we can continue this communication to get to a solution, hopefully a 
solution that works for both of us.

 

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: Screenshot from 2018-12-31 11-17-29.png

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731315#comment-16731315
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

IS this the error ?
{code:java}
 WARN 2018-12-31T08:24:46,453 (Worker thread '32') - Service interruption 
reported for job 1546241012417 connection 'repo_website-en': IO exception: 
Stream Closed
 WARN 2018-12-31T08:28:52,471 (Worker thread '6') - Service interruption 
reported for job 1546241012417 connection 'repo_website-en': IO exception: 
Stream Closed
 WARN 2018-12-31T08:32:10,699 (Worker thread '13') - Service interruption 
reported for job 1546241012417 connection 'repo_website-en': IO exception: 
Stream Closed
ERROR 2018-12-31T08:32:10,750 (Worker thread '13') - Exception tossed: Repeated 
service interruptions - failure processing document: Stream Closed
org.apache.manifoldcf.core.interfaces.ManifoldCFException: Repeated service 
interruptions - failure processing document: Stream Closed
    at 
org.apache.manifoldcf.crawler.system.WorkerThread.run(WorkerThread.java:489) 
[mcf-pull-agent.jar:?]
Caused by: java.io.IOException: Stream Closed
    at java.io.FileInputStream.readBytes(Native Method) ~[?:1.8.0_191]
    at java.io.FileInputStream.read(FileInputStream.java:255) ~[?:1.8.0_191]
    at sun.nio.cs.StreamDecoder.readBytes(StreamDecoder.java:284) 
~[?:1.8.0_191]
    at sun.nio.cs.StreamDecoder.implRead(StreamDecoder.java:326) 
~[?:1.8.0_191]
    at sun.nio.cs.StreamDecoder.read(StreamDecoder.java:178) ~[?:1.8.0_191]
    at java.io.InputStreamReader.read(InputStreamReader.java:184) 
~[?:1.8.0_191]
    at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchIndex$IndexRequestEntity.writeTo(ElasticSearchIndex.java:221)
 ~[?:?]
    at 
org.apache.http.impl.execchain.RequestEntityProxy.writeTo(RequestEntityProxy.java:121)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.DefaultBHttpClientConnection.sendRequestEntity(DefaultBHttpClientConnection.java:156)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.impl.conn.CPoolProxy.sendRequestEntity(CPoolProxy.java:160) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.protocol.HttpRequestExecutor.doSendRequest(HttpRequestExecutor.java:238)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.protocol.HttpRequestExecutor.execute(HttpRequestExecutor.java:123)
 ~[httpcore-4.4.10.jar:4.4.10]
    at 
org.apache.http.impl.execchain.MainClientExec.execute(MainClientExec.java:272) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.execchain.ProtocolExec.execute(ProtocolExec.java:185) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.execchain.RedirectExec.execute(RedirectExec.java:110) 
~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.InternalHttpClient.doExecute(InternalHttpClient.java:185)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:83)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:108)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.http.impl.client.CloseableHttpClient.execute(CloseableHttpClient.java:56)
 ~[httpclient-4.5.6.jar:4.5.6]
    at 
org.apache.manifoldcf.agents.output.elasticsearch.ElasticSearchConnection$CallThread.run(ElasticSearchConnection.java:133)
 ~[?:?]
 WARN 2018-12-31T08:33:35,958 (Job notification thread) - ES: Commit failed: 
{"error":"Incorrect HTTP method for uri [/website-en/_optimize] and method 
[GET], allowed: [POST]","status":405}
 WARN 2018-12-31T08:34:46,024 (Job notification thread) - ES: Commit failed: 
{"error":"Incorrect HTTP method for uri [/pintra/_optimize] and method [GET], 
allowed: [POST]","status":405}
{code}
The time is 1h difference, it's running on a docker container that has 
different timezone atm.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining

[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-17 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723782#comment-16723782
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 7:48 AM:
-

[~kwri...@metacarta.com]
If we update to manifold 2.12 can we than use the seedmap as originaly intended 
by us ?

so we create a job with X seeds, ES output, web input and HopCount 0 for links 
and redirect:
 # Put X seeds in seedmap
 # run job
 # X documents get pushed to ES
 # update job to have X minus 20 seeds
 wait till scheduled time

 
 # run job
 # 20 documents get deleted from ES
 # X minus 20 documents get updated
 # wait till scheduled time
 # ...

Will it work like this ?


was (Author: steenti):
If we update to manifold 2.12 can we than use the seedmap as originaly intended 
by us ?

so we create a job with X seeds, ES output, web input and HopCount 0 for links 
and redirect: 
 # Put X seeds in seedmap
 # run job
 # X documents get pushed to ES
 # update job to have X minus 20 seeds
wait till scheduled time

 
 # run job
 # 20 documents get deleted from ES
 # X minus 20 documents get updated
 # wait till scheduled time
 # ...

Will it work like this ?

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-17 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723782#comment-16723782
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

If we update to manifold 2.12 can we than use the seedmap as originaly intended 
by us ?

so we create a job with X seeds, ES output, web input and HopCount 0 for links 
and redirect: 
 # Put X seeds in seedmap
 # run job
 # X documents get pushed to ES
 # update job to have X minus 20 seeds
wait till scheduled time

 
 # run job
 # 20 documents get deleted from ES
 # X minus 20 documents get updated
 # wait till scheduled time
 # ...

Will it work like this ?

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724016#comment-16724016
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

There is no regex, there is no possibility to make a regex for this. That's the 
issue with creating the exclude/blacklist.


'started acting strange' stopped working and crashed.

This is not the question. answer my question please.
is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723981#comment-16723981
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 11:52 AM:
--

[~kwri...@metacarta.com] - So then with the seed map URL:

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...

Last time we tried this manifold started acting strange because of the amount 
of url's/links located in the sitemap URL
 (sitemap url: 
[https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true])
(blacklist url's: 
[https://www.uantwerpen.be/admin/system/sitemap/sitemap_revokes.aspx?lang=en=true])


was (Author: steenti):
[~kwri...@metacarta.com] - So then with the seed map URL:

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...

Last time we tried this manifold started acting strange because of the amount 
of url's/links located in the sitemap URL
(sitemap url: 
[https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true])

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723981#comment-16723981
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

[~kwri...@metacarta.com] - So then with the seed map URL:

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...

Last time we tried this manifold started acting strange because of the amount 
of url's/links located in the sitemap URL
(sitemap url: 
[https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true])

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724016#comment-16724016
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 12:47 PM:
--

There is no regex, there is no possibility to make a regex for this. That's the 
issue with creating the exclude/blacklist. there is already a regex being used 
for images, documents and other files that don't have to be crawled.

'started acting strange' stopped working and crashed, because of the amount of 
URL's it stopped indexing, no messages or error's were given, just the job 
stopped working.

This is not the question. answer my question please.
 is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 


was (Author: steenti):
There is no regex, there is no possibility to make a regex for this. That's the 
issue with creating the exclude/blacklist.


'started acting strange' stopped working and crashed.

This is not the question. answer my question please.
is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-11 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: (was: 30URLSeeds.png)

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-11 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: (was: Screenshot from 2018-12-10 14-07-46.png)

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-11 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: (was: 3URLSeed.png)

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-11 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16716761#comment-16716761
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

[~kwri...@metacarta.com] I have a URL with the full sitemap that has to be 
crawled ~^(and a full exclude sitemap)^~.
If i use this URL as seed, do I have to set the hop filters to any value (e.g. 
redirect:0 and link:1) ?
If one or multiple links are deleted from this sitemap, will the document be 
deleted from ES ?



How should I set up the job to only keep the crawled sites in the sitemap ?

 

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-11 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16717023#comment-16717023
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

Due to customer requirements and sources the best option is to work with a list 
of seeds created by the sitemap.
Whenever there is an update and a seed is removed it should be removed from 
elastic.

Therefore is it possible to reopen the issue and test how and why documents 
that aren't in the seed list anymore, don't get deleted.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709712#comment-16709712
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

The crawling is schedualted as dynamically rescan of the documents


!Screenshot from 2018-12-05 09-01-46.png!

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: Screenshot from 2018-12-05 09-01-46.png

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-05 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16709986#comment-16709986
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

The documentation states:
{code:java}
A typical non-continuous run of a job has the following stages of execution:

Adding the job's new, changed, or deleted starting points to the queue 
("seeding")
Fetching documents, discovering new documents, and detecting deletions
Removing no-longer-included documents from the queue

Jobs can also be run "continuously", which means that the job never completes, 
unless it is aborted. A continuous run has different stages of execution:

Adding the job's new, changed, or deleted starting points to the queue 
("seeding")
Fetching documents, discovering new documents, and detecting deletions, while 
reseeding periodically

Note that continuous jobs cannot remove no-longer-included documents from the 
queue. They can only remove documents that have been deleted from the 
repository.{code}
Both should detect deletions but only non-continuous should delete the 
unreachable documents.
so knowing this i changed the job to a non-continuous job that starts every 5 
min for testing.
Even when the job is non-continuous it doesn't delete the unreachable documents
It keeps all documents indexed in elastic

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above.
The scheduler worked fine now, even with multiple values.
I tested the same with the ES output connector and It also started up at the 
scheduled time.
So it seems there was an issue in the import of the job schedule which has been 
resolved now.

Next I edited the seeds and deleted some links and let the job run scheduled 
again.
There were 0 Deletions and the Simple History also showed 0 deletion messages.
(also on the null output but this is probably normal cause it's Null)

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=1671#comment-1671
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 8:55 AM:
-

Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above.
 The scheduler worked fine now, even with multiple values.
 I tested the same with the ES output connector and It also started up at the 
scheduled time.
 So it seems there was an issue in the import of the job schedule which has 
been resolved now.

Next I edited the seeds and deleted some links and let the job run scheduled 
again.
 There were 0 Deletions and the Simple History also showed 0 deletion messages.
Also in the Document Status for the Jobs there were no deletions registered.
 (also on the null output but this is probably normal cause it's Null)


was (Author: steenti):
Hi [~kwri...@metacarta.com], So i Set up a Job as you explained above.
The scheduler worked fine now, even with multiple values.
I tested the same with the ES output connector and It also started up at the 
scheduled time.
So it seems there was an issue in the import of the job schedule which has been 
resolved now.

Next I edited the seeds and deleted some links and let the job run scheduled 
again.
There were 0 Deletions and the Simple History also showed 0 deletion messages.
(also on the null output but this is probably normal cause it's Null)

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

Manifold doesn't delete documents it should delete.

you quote the text where i say there were no deletions and than ask me if there 
were any ?

( on a site-note: It did however just deleted documents that shouldn't have 
been indexed in the first place, documents that were added to ES but weren't in 
the scope in the original run.)

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 11:42 AM:
--

Manifold doesn't delete documents it should delete.

you quote the text where i say there were no deletions and than ask me if there 
were any ?

( on a site-note: It did however just deleted 3 documents and not 10)


was (Author: steenti):
Manifold doesn't delete documents it should delete.

you quote the text where i say there were no deletions and than ask me if there 
were any ?

( on a site-note: It did however just deleted documents that shouldn't have 
been indexed in the first place, documents that were added to ES but weren't in 
the scope in the original run.)

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Issue Comment Deleted] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Comment: was deleted

(was: Manifold doesn't delete documents it should delete.

you quote the text where i say there were no deletions and than ask me if there 
were any ?

( on a site-note: It did however just deleted 3 documents and not 10 so it 
partially worked))

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714610#comment-16714610
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/10/18 11:49 AM:
--

Manifold doesn't delete documents it should delete.

you quote the text where i say there were no deletions and than ask me if there 
were any ?

( on a site-note: It did however just deleted 3 documents and not 10 so it 
partially worked)


was (Author: steenti):
Manifold doesn't delete documents it should delete.

you quote the text where i say there were no deletions and than ask me if there 
were any ?

( on a site-note: It did however just deleted 3 documents and not 10)

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: 30URLSeeds.png

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, Screenshot from 2018-12-05 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: 3URLSeed.png

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 
> 09-01-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16714692#comment-16714692
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

# I created a job with a Null-Outputconnector
 # put  30 url's as seeds
 # set the hopfilter to 0 so no links or redirects will be checked,
 # run the job.

Check Simple History: All the docuemtns get fetched and processed (if: 
{color:#33}RESPONSECODENOTINDEXABLE{color})
 # I edit the JOB
 # delete all but 3 URL's, seeds are now just 3 URL's
 # run the job

Check Simple History: all documents get fetched even though they aren't in the 
seeds anymore no document gets deleted and the job ends

 

!30URLSeeds.png!

!3URLSeed.png!

!Screenshot from 2018-12-10 14-07-46.png!

 

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 
> 09-01-46.png, Screenshot from 2018-12-10 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: Screenshot from 2018-12-10 14-07-46.png

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-05 
> 09-01-46.png, Screenshot from 2018-12-10 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1562) Document removal Elastic

2018-12-10 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1562:
--
Attachment: (was: Screenshot from 2018-12-05 09-01-46.png)

> Document removal Elastic
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Attachments: 30URLSeeds.png, 3URLSeed.png, Screenshot from 2018-12-10 
> 14-07-46.png
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (CONNECTORS-1562) Document removal Elastic

2018-12-04 Thread Tim Steenbeke (JIRA)

Tim Steenbeke created CONNECTORS-1562:
-

 Summary: Document removal Elastic
 Key: CONNECTORS-1562
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
 Project: ManifoldCF
  Issue Type: Bug
  Components: Elastic Search connector, Web connector
Affects Versions: ManifoldCF 2.11
 Environment: Manifoldcf 2.11
Elasticsearch 6.3.2

Web inputconnector
elastic outputconnecotr
Job crawls website input and outputs content to elastic
Reporter: Tim Steenbeke


My documents aren't removed from ElasticSearch index after rerunning the 
changed seeds

I update my job to change the seedmap and rerun it or use the schedualer to 
keep it runneng even after updating it.
After the rerun the unreachable documents don't get deleted.
It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Attachment: bandwidth_test_abc.png

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: bandwidth.png, bandwidth_test_abc.png
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:05 AM:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
 In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. ([JSON repository connector 
2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]])

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for bandwidth
    "_PARAMETER_": [
    {
    "_value_":

[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:06 AM:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
 In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. 
[[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]]

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for bandwidth
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",

[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:06 AM:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
 In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. [JSON repository connection object| 
[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects
 
|https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]]

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for

[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke commented on CONNECTORS-1567:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. ([JSON repository connector 
2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects|https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects])]])

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for bandwidth
    "_PARAMETER_": [

[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Attachment: bandwidth.png

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: bandwidth.png
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-07 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735659#comment-16735659
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

We created everything via the UI and than extracted the full setup.
Then after deleting everything we used the API to POST it back to manifold

Then I get this error, but only with web repository connections.
Also I noticed that the bandwidth is not exported to the connector Json.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737170#comment-16737170
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

But if the hop-count is 2 than it will go to far in the sitemap and will add 
documents that aren't supposed to be indexed.

Because the sitemap is the full whitelist, we set the hop-count to 1 so it 
doesn't hop from the whitelisted URL to a maybe non whitelisted URL.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-07 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16735590#comment-16735590
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

I have the Throttle on null and max_connections on 10 which was the standard 
setting.

 

I'm also getting an Error when i try to open my web output-connector all other 
connectors and job editing works.
I'm building the manifold connectors and jobs using the API.
*HTTP ERROR 500*
Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
    Server Error

*Caused by:*
{code:java}
org.apache.jasper.JasperException: An exception occurred processing JSP page 
/editconnection.jsp at line 564

561:
562: if (className.length() > 0)
563: {
564:   
RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new 
org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
565: }
566: %>
567:


Stacktrace:
    at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
    at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
    at 
org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
    at 
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
    at 
org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
    ... 23 more
   {code}
*Caused by:*

 
{code:java}
java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
    at

[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Description: 
When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.
Than when importing this connector to a clean manifoldcf it creates the 
connector with basic bandwidth.
When using the connector in a job it works properly.

The issue here is that the connector isn't created with correct bandwidth 
throttling.
 And the connector gives issues in the UI when trying to view or edit.

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}

  was:
When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.

This than also doesn't create a connector with bandwidth throttling which gives 
issues in the UI when trying to view or edit.

When using this connector in a job it will use the default bandwidth.

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Major
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
> Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
> When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
>

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736873#comment-16736873
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

Ok I will create a new ticket for both issues.

Testing the connector with the UI, disabling bandwidth and max connections 
equal to 20, we were able to crawl all sites.

Now I'm still stuck with the deletion issue, were if a site is removed from the 
sitemap, will it be removed from elastic, since it is no longer accessible.
We put the sitemap as seed, the big one, and if 1 or a few URL's get removed, 
they should get removed from elastic.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:05 AM:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
 In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. [JSON repository connector 
2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects
 
|https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]]

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for bandwidth

[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-09 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738159#comment-16738159
 ] 

Tim Steenbeke commented on CONNECTORS-1567:
---

Same problem as CONNECTORS-1568, we found the issue after debugging and fixed 
it.
Thank you for the help.

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: bandwidth_test_abc.png
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1568) UI error imported web connection

2019-01-09 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738153#comment-16738153
 ] 

Tim Steenbeke commented on CONNECTORS-1568:
---

When debugging the project we found a missing JSON object for trust and this 
created an issue for the UI but the connector still worked.
Now we fixed the bug. so thank you for the help.

> UI error imported web connection
> 
>
> Key: CONNECTORS-1568
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Using the ManifoldCF API, we export a web repository connector, with basic 
> settings.
>  Than we importing the web connector using the manifoldcf API.
>  The connector get's imported and can be used in a job.
>  When trying to view or edit the connector in the UI following error pops up.
> (connected to issue: 
> [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]
> *HTTP ERROR 500*
>  Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
>      Server Error
> *Caused by:*
> {code:java}
> org.apache.jasper.JasperException: An exception occurred processing JSP page 
> /editconnection.jsp at line 564
> 561:
> 562: if (className.length() > 0)
> 563: {
> 564:   
> RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new
>  
> org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
> 565: }
> 566: %>
> 567:
> Stacktrace:
>     at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
>     at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
>     at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
>     at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>     at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>     at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:497)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>     at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
>     at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>     at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
>     at 
> org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
>     at 
> org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
>     at 
> org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
>     at 
> org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
>     at 
>

[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-09 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Attachment: (was: bandwidth.png)

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: bandwidth_test_abc.png
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-09 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16738090#comment-16738090
 ] 

Tim Steenbeke commented on CONNECTORS-1567:
---

When putting _"binddesc"_ to null it doesn't seem to help

The process you describe is how we do it.
 # Make the connector in the UI
 # test connector
 # extract connector
 #  clean manifold
 # import connector
 # test connector

So the output should than be new format because we use 2.11.

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: bandwidth_test_abc.png
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-02 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16732056#comment-16732056
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

Yes the seed document that is used as sitemap is approximately 23000+ URLs 
([https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true])
Curl completely fetches it but it takes some time.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1568:
--
Description: 
Using the ManifoldCF API, we export a web repository connector, with basic 
settings.
 Than we importing the web connector using the manifoldcf API.
 The connector get's imported and can be used in a job.
 When trying to view or edit the connector in the UI following error pops up.

(connected to issue: 
[CONNECTORS-1567|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]

*HTTP ERROR 500*
 Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
     Server Error

*Caused by:*
{code:java}
org.apache.jasper.JasperException: An exception occurred processing JSP page 
/editconnection.jsp at line 564

561:
562: if (className.length() > 0)
563: {
564:   
RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new 
org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
565: }
566: %>
567:


Stacktrace:
    at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
    at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
    at 
org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
    at 
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
    at 
org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
    ... 23 more
   {code}
*Caused by:*

 
{code:java}
java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at

[jira] [Created] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)

Tim Steenbeke created CONNECTORS-1567:
-

 Summary: export of web connection bandwidth throttling
 Key: CONNECTORS-1567
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 2.12, ManifoldCF 2.11
Reporter: Tim Steenbeke


When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.

This than also doesn't create a connector with bandwidth throttling which gives 
issues in the UI when trying to view or edit.

When using this connector in a job it will use the default bandwidth.

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Tim Steenbeke (JIRA)

Tim Steenbeke created CONNECTORS-1568:
-

 Summary: UI error imported web connection
 Key: CONNECTORS-1568
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 2.12, ManifoldCF 2.11
Reporter: Tim Steenbeke


Using the ManifoldCF API, we export a web repository connector, with basic 
settings.
Than we importing the web connector using the manifoldcf API.
The connector get's imported and can be used in a job.
When trying to view or edit the connector in the UI following error pops up.

*HTTP ERROR 500*
 Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
     Server Error

*Caused by:*
{code:java}
org.apache.jasper.JasperException: An exception occurred processing JSP page 
/editconnection.jsp at line 564

561:
562: if (className.length() > 0)
563: {
564:   
RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new 
org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
565: }
566: %>
567:


Stacktrace:
    at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
    at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
    at 
org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
    at 
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
    at 
org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
    ... 23 more
   {code}
*Caused by:*

 
{code:java}
java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)

[jira] [Updated] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1568:
--
Description: 
Using the ManifoldCF API, we export a web repository connector, with basic 
settings.
 Than we importing the web connector using the manifoldcf API.
 The connector get's imported and can be used in a job.
 When trying to view or edit the connector in the UI following error pops up.

(connected to issue: 
[CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]

*HTTP ERROR 500*
 Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
     Server Error

*Caused by:*
{code:java}
org.apache.jasper.JasperException: An exception occurred processing JSP page 
/editconnection.jsp at line 564

561:
562: if (className.length() > 0)
563: {
564:   
RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new 
org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
565: }
566: %>
567:


Stacktrace:
    at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
    at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
    at 
org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
    at 
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
    at 
org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
    ... 23 more
   {code}
*Caused by:*

 
{code:java}
java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at

[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Description: 
When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.
 Than when importing this connector to a clean manifoldcf it creates the 
connector with basic bandwidth.
 When using the connector in a job it works properly.

The issue here is that the connector isn't created with correct bandwidth 
throttling.
 And the connector gives issues in the UI when trying to view or edit.

(related to issue: 
[CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}

  was:
When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.
Than when importing this connector to a clean manifoldcf it creates the 
connector with basic bandwidth.
When using the connector in a job it works properly.

The issue here is that the connector isn't created with correct bandwidth 
throttling.
 And the connector gives issues in the UI when trying to view or edit.

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Major
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>

[jira] [Commented] (CONNECTORS-1584) regex documentation

2019-02-21 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774146#comment-16774146
 ] 

Tim Steenbeke commented on CONNECTORS-1584:
---

{panel:title=Failure notice send by mailer-dae...@apache.org}
 
Hi. This is the qmail-send program at apache.org.
I'm afraid I wasn't able to deliver your message to the following addresses.
This is a permanent error; I've given up. Sorry it didn't work out.

:
Must be sent from an @apache.org address or a subscriber address or an address 
in LDAP.

--- Below this line is a copy of the message.

Return-Path: 
Received: (qmail 90034 invoked by uid 99); 18 Feb 2019 10:35:51 -
Received: from pnap-us-west-generic-nat.apache.org (HELO 
spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 18 Feb 2019 10:35:51 +
Received: from localhost (localhost [127.0.0.1])
    by spamd1-us-west.apache.org (ASF Mail Server at 
spamd1-us-west.apache.org) with ESMTP id 07C55C84A2
    for ; Mon, 18 Feb 2019 10:35:51 + (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 1.998
X-Spam-Level: *
X-Spam-Status: No, score=1.998 tagged_above=-999 required=6.31
    tests=[DKIMWL_WL_MED=-0.001, DKIM_SIGNED=0.1, DKIM_VALID=-0.1,
    HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_PASS=-0.001,
    SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
    dkim=pass (1024-bit key) header.d=cronos.onmicrosoft.com
Received: from mx1-lw-eu.apache.org ([10.40.0.8])
    by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 
10024)
    with ESMTP id bzxj-Zwazahp for ;
    Mon, 18 Feb 2019 10:35:47 + (UTC)
Received: from EUR02-HE1-obe.outbound.protection.outlook.com 
(mail-eopbgr10062.outbound.protection.outlook.com [40.107.1.62])
    by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with 
ESMTPS id 5432D5F533
    for ; Mon, 18 Feb 2019 10:35:47 + (UTC)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
 d=CRONOS.onmicrosoft.com; s=selector1-CRONOS-onmicrosoft-com;
 
h=From:Date:Subject:Message-ID:Content-Type:MIME-Version:X-MS-Exchange-SenderADCheck;
 bh=NXxuOXxO7L5OIh8wemB0u1esV8BQdvefAryTpMAPvDU=;
 
b=dZRUlfL4a6CvpIZbLZVeakgTuNXTti3W/oO9VcpZrao8Odjy7PljvmTce1+2kx3NxG/uWOFVhgaHgYSJXBOwRSVRwW/Ovx6YP1z5fw5nBpdoux666pZd7uzLlTJSM5kNOLwqrU2fIdSkW3J6qFqB1TMMu8Jm4BonW/kXylfb0SY=
Received: from AM6PR0302MB3256.eurprd03.prod.outlook.com (52.133.27.27) by
 AM6PR0302MB3383.eurprd03.prod.outlook.com (52.133.28.10) with Microsoft SMTP
 Server (version=TLS1_2, cipher=TLS_ECDHE_RSA_WITH_AES_256_GCM_SHA384) id
 15.20.1622.19; Mon, 18 Feb 2019 10:35:40 +
Received: from AM6PR0302MB3256.eurprd03.prod.outlook.com
 ([fe80::a8f3:3f23:b1f3:8ce6]) by AM6PR0302MB3256.eurprd03.prod.outlook.com
 ([fe80::a8f3:3f23:b1f3:8ce6%5]) with mapi id 15.20.1622.018; Mon, 18 Feb 2019
 10:35:40 +
From: Steenbeke Tim 
To: "u...@manifoldcf.apache.org" 
Subject: Regex support
Thread-Topic: Regex support
Thread-Index: AQHUx3UQjsHP1lgCt0uYyVLe47S0rw==
Date: Mon, 18 Feb 2019 10:35:40 +
Message-ID:
 

Accept-Language: en- Content-Language: en- X-MS-Has-Attach:
X-MS-TNEF-Correlator:
authentication-results: spf=none (sender IP is )
 smtp.mailfrom=Tim.Steenbeke@formica.digital; 
x-originating-ip: [94.143.189.241]
x-ms-publictraffictype: Email
x-ms-office365-filtering-correlation-id: 8b0d3335-aeb7-4dc8-f619-08d6958cd411
x-microsoft-antispam:
 
BCL:0;PCL:0;RULEID:(2390118)(7020095)(4652040)(8989299)(4534185)(4627221)(201703031133081)(201702281549075)(8990200)(5600110)(711020)(4605104)(2017052603328)(7153060)(7193020);SRVR:AM6PR0302MB3383;
x-ms-traffictypediagnostic: AM6PR0302MB3383:
x-ms-exchange-purlcount: 2
x-microsoft-exchange-diagnostics:
 
1;AM6PR0302MB3383;20:GL97sCN3oMJg9YDuqqZQjTkFnP+s9blDsxlMF5L7uIMW/Cz7EUmc2qn4aUHZ/Gk7T7u0uYQUMqr5RYnJ4UUZF2FRDvKg91ZSHM2t/jcwq+Udc5ibZTY5ZByYX7bVG9i6ZqCb2tLa/S///Mc2MjH8KqSVacv1zGyCeBiOczfh3E4=
x-microsoft-antispam-prvs:
 

x-forefront-prvs: 09525C61DB
x-forefront-antispam-report:

[jira] [Commented] (CONNECTORS-1584) regex documentation

2019-02-21 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774147#comment-16774147
 ] 

Tim Steenbeke commented on CONNECTORS-1584:
---

3 colleges and I tried mailing to the address and we all get this same response.

So it is the right address than, I though we made a mistake.

> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.:
> {code:java}
> .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1584) regex documentation

2019-02-21 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774033#comment-16774033
 ] 

Tim Steenbeke commented on CONNECTORS-1584:
---

If the mail is userS I think the site should be updated because the mail 
mentioned in FAQ is user.
[https://manifoldcf.apache.org/release/release-2.12/en_US/faq.html]

Also thanks for responding.

> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.:
> {code:java}
> .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Comment Edited] (CONNECTORS-1584) regex documentation

2019-02-21 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16774033#comment-16774033
 ] 

Tim Steenbeke edited comment on CONNECTORS-1584 at 2/21/19 12:37 PM:
-

If the mail is user*s* I think the site should be updated because the mail 
mentioned in FAQ is user.
 [https://manifoldcf.apache.org/release/release-2.12/en_US/faq.html]

Also thanks for responding.


was (Author: steenti):
If the mail is userS I think the site should be updated because the mail 
mentioned in FAQ is user.
[https://manifoldcf.apache.org/release/release-2.12/en_US/faq.html]

Also thanks for responding.

> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.:
> {code:java}
> .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1584) regex documentation

2019-02-18 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1584:
--
Description: 
What type of regexs does manifold include and exclude support and also in 
general regex support?

At the moment i'm using a web repository connection and an Elastic output 
connection.
 I'm trying to exclude urls that link to documents.

          e.g. website.com/document/path/this.pdf and 
website.com/document/path/other.PDF

The issue i'm having is that the regex that I have found so far doesn't work 
case insensitive, so for every possible case i have to add a new line.
    e.g.:
{code:java}
.*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
Is it possible to add documentation what type of regex is able to be used or 
maybe a tool to test your regex and see if it is supported by manifold ?

I tried mailing this question to 
[u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
adress returns a failure notice.

  was:
What type of regexs does manifold include and exclude support and also in 
general regex support?

At the moment i'm using a web repository connection and an Elastic output 
connection.
 I'm trying to exclude urls that link to documents.

          e.g. website.com/document/path/this.pdf and 
website.com/document/path/other.PDF

The issue i'm having is that the regex that I have found so far doesn't work 
case insensitive, so for every possible case i have to add a new line.
    e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... .

Is it possible to add documentation what type of regex is able to be used or 
maybe a tool to test your regex and see if it is supported by manifold ?

I tried mailing this question to 
[u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
adress returns a failure notice.


> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.:
> {code:java}
> .*.pdf$ and .*.PDF$ and .*.Pdf and ... .{code}
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (CONNECTORS-1584) regex documentation

2019-02-18 Thread Tim Steenbeke (JIRA)

Tim Steenbeke created CONNECTORS-1584:
-

 Summary: regex documentation
 Key: CONNECTORS-1584
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
 Project: ManifoldCF
  Issue Type: Improvement
  Components: Web connector
Affects Versions: ManifoldCF 2.12
Reporter: Tim Steenbeke


What type of regexs does manifold include and exclude support and also in 
general regex support?

At the moment i'm using a web repository connection and an Elastic output 
connection.
I'm trying to exclude urls that link to documents.

          e.g. website.com/document/path/this.pdf and 
website.com/document/path/other.PDF

The issue i'm having is that the regex that I have found so far doesn't work 
case insensitive, so for every possible case i have to add a new line.
   e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... .

Is it possible to add documentation what type of regex is able to be used or 
maybe a tool to test your regex and see if it is supported by manifold ?

I tried mailing this question to 
[u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Updated] (CONNECTORS-1584) regex documentation

2019-02-18 Thread Tim Steenbeke (JIRA)



 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1584?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1584:
--
Description: 
What type of regexs does manifold include and exclude support and also in 
general regex support?

At the moment i'm using a web repository connection and an Elastic output 
connection.
 I'm trying to exclude urls that link to documents.

          e.g. website.com/document/path/this.pdf and 
website.com/document/path/other.PDF

The issue i'm having is that the regex that I have found so far doesn't work 
case insensitive, so for every possible case i have to add a new line.
    e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... .

Is it possible to add documentation what type of regex is able to be used or 
maybe a tool to test your regex and see if it is supported by manifold ?

I tried mailing this question to 
[u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
adress returns a failure notice.

  was:
What type of regexs does manifold include and exclude support and also in 
general regex support?

At the moment i'm using a web repository connection and an Elastic output 
connection.
I'm trying to exclude urls that link to documents.

          e.g. website.com/document/path/this.pdf and 
website.com/document/path/other.PDF

The issue i'm having is that the regex that I have found so far doesn't work 
case insensitive, so for every possible case i have to add a new line.
   e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... .

Is it possible to add documentation what type of regex is able to be used or 
maybe a tool to test your regex and see if it is supported by manifold ?

I tried mailing this question to 
[u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
adress returns a failure notice.


> regex documentation
> ---
>
> Key: CONNECTORS-1584
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1584
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Web connector
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
>
> What type of regexs does manifold include and exclude support and also in 
> general regex support?
> At the moment i'm using a web repository connection and an Elastic output 
> connection.
>  I'm trying to exclude urls that link to documents.
>           e.g. website.com/document/path/this.pdf and 
> website.com/document/path/other.PDF
> The issue i'm having is that the regex that I have found so far doesn't work 
> case insensitive, so for every possible case i have to add a new line.
>     e.g.: .*.pdf$ and .*.PDF$ and .*.Pdf and ... .
> Is it possible to add documentation what type of regex is able to be used or 
> maybe a tool to test your regex and see if it is supported by manifold ?
> I tried mailing this question to 
> [u...@manifoldcf.apache.org|mailto:u...@manifoldcf.apache.org] but this mail 
> adress returns a failure notice.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1575) inconsistant use of value-labels

2019-01-28 Thread Tim Steenbeke (JIRA)



[ 
https://issues.apache.org/jira/browse/CONNECTORS-1575?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16753912#comment-16753912
 ] 

Tim Steenbeke commented on CONNECTORS-1575:
---

Ok, thank you for your fast response.

> inconsistant use of value-labels 
> -
>
> Key: CONNECTORS-1575
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1575
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: API
>Affects Versions: ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Minor
> Attachments: image-2019-01-28-11-57-46-738.png
>
>
> When retrieving a job, using the API there seems to be inconsistencies in the 
> return JSON of a job.
> For the schedule value of 'hourofday', 'minutesofhour', etc. the label of the 
> value is 'value' while for all other value-labels it is '_value_'.
>  
> !image-2019-01-28-11-57-46-738.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Created] (CONNECTORS-1575) inconsistant use of value-labels

2019-01-28 Thread Tim Steenbeke (JIRA)

Tim Steenbeke created CONNECTORS-1575:
-

 Summary: inconsistant use of value-labels 
 Key: CONNECTORS-1575
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1575
 Project: ManifoldCF
  Issue Type: Bug
  Components: API
Affects Versions: ManifoldCF 2.12
Reporter: Tim Steenbeke
 Attachments: image-2019-01-28-11-57-46-738.png

When retrieving a job, using the API there seems to be inconsistencies in the 
return JSON of a job.

For the schedule value of 'hourofday', 'minutesofhour', etc. the label of the 
value is 'value' while for all other value-labels it is '_value_'.

 

!image-2019-01-28-11-57-46-738.png!



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

60 matches

Mail list logo