[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:05 AM:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
 In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. [JSON repository connector 
2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects
 
|https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]]

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for bandwidth
  

[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:06 AM:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
 In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. 
[[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]]

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for bandwidth
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    

[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:06 AM:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
 In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. [JSON repository connection object| 
[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects
 
|https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]]

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for 

[jira] [Comment Edited] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke edited comment on CONNECTORS-1567 at 1/9/19 7:05 AM:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
 In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. ([JSON repository connector 
2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects]])

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for bandwidth
    "_PARAMETER_": [
    {
    "_value_": 

[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737909#comment-16737909
 ] 

Tim Steenbeke commented on CONNECTORS-1567:
---

But is bandwidth throttles and throttling the same for manifoldcf ? bandwidth 
throttle is a different object in the response JSON or am I mistaking ?

Also i don't understand what you mean with old-form, the example is the 
response from a 'repositoryconnections' GET call on manifoldCF 2.11.
In the documentation it also only speaks of throttling and not the bandwidth 
for both 2.11 and 2.12. ([JSON repository connector 
2.12|[https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects|https://manifoldcf.apache.org/release/release-2.12/en_US/programmatic-operation.html#Repository+connection+objects])]])

*response for curl -X GET 
[http://localhost:8345/mcf-api-service/json/repositoryconnections]  -H 
'content-type: application/json'*

 
{code:java}
{
    "throttle": {
    "match_description": "testable regex",
    "rate": "1.666E-4",
    "match": "test reg"
    },
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": {
    "maxkbpersecond": {
    "_value_": "",
    "_attribute_value": "64"
    },
    "_attribute_caseinsensitive": "false",
    "maxconnections": {
    "_value_": "",
    "_attribute_value": "2"
    },
    "maxfetchesperminute": {
    "_value_": "",
    "_attribute_value": "12"
    },
    "_attribute_binregexp": "test regex",
    "_value_": ""
    },
    "_PARAMETER_": [
    {
    "_value_": "tim.steenbeke@formica.digital",
    "_attribute_name": "Email address"
    },
    {
    "_value_": "all",
    "_attribute_name": "Robots usage"
    },
    {
    "_value_": "all",
    "_attribute_name": "Meta robots tags usage"
    },
    {
    "_value_": "proxyhost",
    "_attribute_name": "Proxy host"
    },
    {
    "_value_": "port",
    "_attribute_name": "Proxy port"
    },
    {
    "_value_": "domain",
    "_attribute_name": "Proxy authentication domain"
    },
    {
    "_value_": "admin",
    "_attribute_name": "Proxy authentication user name"
    },
    {
    "_value_": 
"5qNuZnChiobQlUozw2quhCGsgYVazxVVbAUjc3Hk5Mc=",
    "_attribute_name": "Proxy authentication password"
    }
    ],
    "accesscredential": [
    {
    "_value_": "",
    "_attribute_type": "basic",
    "_attribute_username": "admin",
    "_attribute_urlregexp": "some acces creds",
    "_attribute_password": 
"RkBMPT2W2ZC7XebgFp5PSuYSdCDnik4GKd130+PtXRk=",
    "_attribute_domain": "localhost:8080"
    },
    {
    "_value_": "",
    "_attribute_type": "session",
    "_attribute_urlregexp": "url regex"
    }
    ]
    },
    "name": "abc_test",
    "description": "test abc",
    "isnew": "false",
    "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector"
    }
{code}
 

*For following bandwidth setup:*

!bandwidth_test_abc.png!

*So than I would do the following to set bandwidth and throttling to null:*
{code:java}
{
    "throttle": null,<<<--- null for throttling
    "max_connections": "20",
    "configuration": {
    "trust": {
    "_attribute_trusteverything": "true",
    "_value_": "",
    "_attribute_urlregexp": ".*"
    },
    "bindesc": null,<<<--- null for bandwidth
    "_PARAMETER_": [

[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Attachment: bandwidth_test_abc.png

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: bandwidth.png, bandwidth_test_abc.png
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Attachment: bandwidth.png

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
> Attachments: bandwidth.png
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Resolved] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright resolved CONNECTORS-1567.
-
Resolution: Cannot Reproduce

Was already fixed; the JSON reported was old-form and thus not necessarily 
correct.


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737581#comment-16737581
 ] 

Karl Wright commented on CONNECTORS-1567:
-

Reading and writing sides match.

In XML, the format would look like this:

{code}

  ...
  
match_value
description
rate_value
  

{code}

This gets translated to JSON, which should merge the "throttle" fields into one 
throttle array, like this:

{code}
throttle: [ {... first throttle ... }, {... second throttle ... } ...]
{code}

That's obviously not happening and I need to figure out why.


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737559#comment-16737559
 ] 

Karl Wright commented on CONNECTORS-1567:
-

The output code is in common for all connections, and looks correct:

{code}
String[] throttles = connection.getThrottles();
j = 0;
while (j < throttles.length)
{
  String match = throttles[j++];
  String description = connection.getThrottleDescription(match);
  float rate = connection.getThrottleValue(match);
  child = new ConfigurationNode(CONNECTIONNODE_THROTTLE);
  ConfigurationNode throttleChildNode;

  throttleChildNode = new ConfigurationNode(CONNECTIONNODE_MATCH);
  throttleChildNode.setValue(match);
  child.addChild(child.getChildCount(),throttleChildNode);

  if (description != null)
  {
throttleChildNode = new 
ConfigurationNode(CONNECTIONNODE_MATCHDESCRIPTION);
throttleChildNode.setValue(description);
child.addChild(child.getChildCount(),throttleChildNode);
  }

  throttleChildNode = new ConfigurationNode(CONNECTIONNODE_RATE);
  throttleChildNode.setValue(new Float(rate).toString());
  child.addChild(child.getChildCount(),throttleChildNode);

  connectionNode.addChild(connectionNode.getChildCount(),child);
}
{code}

Note that the throttles are an array, so if there are no throttles, one should 
expect null or an empty array to be output.  Checking the reading side now.

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-08 Thread Michael Osipov (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737272#comment-16737272
 ] 

Michael Osipov commented on CONNECTORS-1564:


[~kwri...@metacarta.com], sorry, my bad. I will have a look at it tomorrow.

> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737201#comment-16737201
 ] 

Karl Wright commented on CONNECTORS-1562:
-

You are correct; the hopcount of zero will capture the whitelist, and a 
hopcount of 1 will capture everything the whitelist refers to.  My apologies.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737170#comment-16737170
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

But if the hop-count is 2 than it will go to far in the sitemap and will add 
documents that aren't supposed to be indexed.

Because the sitemap is the full whitelist, we set the hop-count to 1 so it 
doesn't hop from the whitelisted URL to a maybe non whitelisted URL.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1567:

Fix Version/s: ManifoldCF 2.13

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737081#comment-16737081
 ] 

Karl Wright commented on CONNECTORS-1568:
-

The UI error is to be expected when the configuration data is corrupted, 
although I've already committed a fix for this particular brand of corruption.  
The bug is that a web configuration that is exported and then reimported gets 
corrupted.


> UI error imported web connection
> 
>
> Key: CONNECTORS-1568
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Using the ManifoldCF API, we export a web repository connector, with basic 
> settings.
>  Than we importing the web connector using the manifoldcf API.
>  The connector get's imported and can be used in a job.
>  When trying to view or edit the connector in the UI following error pops up.
> (connected to issue: 
> [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]
> *HTTP ERROR 500*
>  Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
>      Server Error
> *Caused by:*
> {code:java}
> org.apache.jasper.JasperException: An exception occurred processing JSP page 
> /editconnection.jsp at line 564
> 561:
> 562: if (className.length() > 0)
> 563: {
> 564:   
> RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new
>  
> org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
> 565: }
> 566: %>
> 567:
> Stacktrace:
>     at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
>     at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
>     at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
>     at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>     at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>     at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:497)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>     at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
>     at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>     at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
>     at 
> org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
>     at 
> org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
>     at 
> org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
>     at 
> 

[jira] [Updated] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright updated CONNECTORS-1568:

Fix Version/s: ManifoldCF 2.13

> UI error imported web connection
> 
>
> Key: CONNECTORS-1568
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
> Fix For: ManifoldCF 2.13
>
>
> Using the ManifoldCF API, we export a web repository connector, with basic 
> settings.
>  Than we importing the web connector using the manifoldcf API.
>  The connector get's imported and can be used in a job.
>  When trying to view or edit the connector in the UI following error pops up.
> (connected to issue: 
> [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]
> *HTTP ERROR 500*
>  Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
>      Server Error
> *Caused by:*
> {code:java}
> org.apache.jasper.JasperException: An exception occurred processing JSP page 
> /editconnection.jsp at line 564
> 561:
> 562: if (className.length() > 0)
> 563: {
> 564:   
> RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new
>  
> org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
> 565: }
> 566: %>
> 567:
> Stacktrace:
>     at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
>     at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
>     at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
>     at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>     at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>     at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:497)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>     at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
>     at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>     at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
>     at 
> org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
>     at 
> org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
>     at 
> org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
>     at 
> org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
>     at 
> org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
>     at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at 
> 

[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737070#comment-16737070
 ] 

Karl Wright commented on CONNECTORS-1564:
-

[~michael-o], Erlend provided the code above and it does supposedly enable the 
expect header.  Obviously that code is not working for some reason.  Can you 
review the code and tell us what we are doing wrong?


> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-08 Thread Michael Osipov (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16737057#comment-16737057
 ] 

Michael Osipov commented on CONNECTORS-1564:


[~erlendfg], the log file looks fine because no {{Expect}} header and the 
client cannot replay the post body. Can you verify that this option has been 
enabled. It pretty much seems like it so not. I'd like to verify that first. 
Can you retry with that header?

> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1568:
---

Assignee: Karl Wright

> UI error imported web connection
> 
>
> Key: CONNECTORS-1568
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
>
> Using the ManifoldCF API, we export a web repository connector, with basic 
> settings.
>  Than we importing the web connector using the manifoldcf API.
>  The connector get's imported and can be used in a job.
>  When trying to view or edit the connector in the UI following error pops up.
> (connected to issue: 
> [CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]
> *HTTP ERROR 500*
>  Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
>      Server Error
> *Caused by:*
> {code:java}
> org.apache.jasper.JasperException: An exception occurred processing JSP page 
> /editconnection.jsp at line 564
> 561:
> 562: if (className.length() > 0)
> 563: {
> 564:   
> RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new
>  
> org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
> 565: }
> 566: %>
> 567:
> Stacktrace:
>     at 
> org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
>     at 
> org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
>     at 
> org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
>     at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
>     at 
> org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
>     at 
> org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
>     at 
> org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
>     at 
> org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
>     at 
> org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
>     at 
> org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
>     at 
> org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
>     at 
> org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
>     at org.eclipse.jetty.server.Server.handle(Server.java:497)
>     at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
>     at 
> org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
>     at 
> org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
>     at 
> org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
>     at java.lang.Thread.run(Thread.java:748)
> Caused by: java.lang.NullPointerException
>     at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
>     at 
> org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
>     at 
> org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
>     at 
> org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
>     at 
> org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
>     at 
> org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
>     at 
> org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
>     at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
>     at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
>     at 
> 

[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 11:39 AM:
--

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 2.



was (Author: kwri...@metacarta.com):
Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 1.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Assigned] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Karl Wright (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Karl Wright reassigned CONNECTORS-1567:
---

Assignee: Karl Wright

> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Major
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> "org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
>   "acl_authority": null
> }{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1564) Support preemptive authentication to Solr connector

2019-01-08 Thread JIRA


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1564?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736933#comment-16736933
 ] 

Erlend Garåsen commented on CONNECTORS-1564:


[~michael-o], thanks for looking at this! This is what happens if I use the 
current MCF version without preemptive authentication. I cannot see any Expect 
header which may be one of the reasons why it fails. As you can see, the 
exception occurs at the end when MCF/HttpClient probably tries to do a second 
post after a 401 response. By using preemptive authentication, I can also see 
the base64 encoded username and password sent to the Solr server, but no 
authentication attempt is ever send by using the current MCF version.

[output_solr_client.txt|http://folk.uio.no/erlendfg/manifoldcf/output_solr_client.txt]

> Support preemptive authentication to Solr connector
> ---
>
> Key: CONNECTORS-1564
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1564
> Project: ManifoldCF
>  Issue Type: Improvement
>  Components: Lucene/SOLR connector
>Reporter: Erlend Garåsen
>Assignee: Karl Wright
>Priority: Major
> Attachments: CONNECTORS-1564.patch
>
>
> We should post preemptively in case the Solr server requires basic 
> authentication. This will make the communication between ManifoldCF and Solr 
> much more effective instead of the following:
>  * Send a HTTP POST request to Solr
>  * Solr sends a 401 response
>  * Send the same request, but with a "{{Authorization: Basic}}" header
> With preemptive authentication, we can send the header in the first request.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Updated] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Tim Steenbeke (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1568:
--
Description: 
Using the ManifoldCF API, we export a web repository connector, with basic 
settings.
 Than we importing the web connector using the manifoldcf API.
 The connector get's imported and can be used in a job.
 When trying to view or edit the connector in the UI following error pops up.

(connected to issue: 
[CONNECTORS-1567)|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]

*HTTP ERROR 500*
 Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
     Server Error

*Caused by:*
{code:java}
org.apache.jasper.JasperException: An exception occurred processing JSP page 
/editconnection.jsp at line 564

561:
562: if (className.length() > 0)
563: {
564:   
RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new 
org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
565: }
566: %>
567:


Stacktrace:
    at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
    at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
    at 
org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
    at 
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
    at 
org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
    ... 23 more
   {code}
*Caused by:*

 
{code:java}
java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 

[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Description: 
When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.
 Than when importing this connector to a clean manifoldcf it creates the 
connector with basic bandwidth.
 When using the connector in a job it works properly.

The issue here is that the connector isn't created with correct bandwidth 
throttling.
 And the connector gives issues in the UI when trying to view or edit.

(related to issue: 
[CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}

  was:
When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.
Than when importing this connector to a clean manifoldcf it creates the 
connector with basic bandwidth.
When using the connector in a job it works properly.

The issue here is that the connector isn't created with correct bandwidth 
throttling.
 And the connector gives issues in the UI when trying to view or edit.

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Major
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
>  Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
>  When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> (related to issue: 
> [CONNECTORS-1568|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1568])
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   

[jira] [Updated] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Tim Steenbeke (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1568?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1568:
--
Description: 
Using the ManifoldCF API, we export a web repository connector, with basic 
settings.
 Than we importing the web connector using the manifoldcf API.
 The connector get's imported and can be used in a job.
 When trying to view or edit the connector in the UI following error pops up.

(connected to issue: 
[CONNECTORS-1567|https://issues.apache.org/jira/projects/CONNECTORS/issues/CONNECTORS-1567]

*HTTP ERROR 500*
 Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
     Server Error

*Caused by:*
{code:java}
org.apache.jasper.JasperException: An exception occurred processing JSP page 
/editconnection.jsp at line 564

561:
562: if (className.length() > 0)
563: {
564:   
RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new 
org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
565: }
566: %>
567:


Stacktrace:
    at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
    at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
    at 
org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
    at 
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
    at 
org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
    ... 23 more
   {code}
*Caused by:*

 
{code:java}
java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 

[jira] [Created] (CONNECTORS-1568) UI error imported web connection

2019-01-08 Thread Tim Steenbeke (JIRA)
Tim Steenbeke created CONNECTORS-1568:
-

 Summary: UI error imported web connection
 Key: CONNECTORS-1568
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1568
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 2.12, ManifoldCF 2.11
Reporter: Tim Steenbeke


Using the ManifoldCF API, we export a web repository connector, with basic 
settings.
Than we importing the web connector using the manifoldcf API.
The connector get's imported and can be used in a job.
When trying to view or edit the connector in the UI following error pops up.

*HTTP ERROR 500*
 Problem accessing /mcf-crawler-ui/editconnection.jsp. Reason:
     Server Error

*Caused by:*
{code:java}
org.apache.jasper.JasperException: An exception occurred processing JSP page 
/editconnection.jsp at line 564

561:
562: if (className.length() > 0)
563: {
564:   
RepositoryConnectorFactory.outputConfigurationBody(threadContext,className,new 
org.apache.manifoldcf.ui.jsp.JspWrapper(out,adminprofile),pageContext.getRequest().getLocale(),parameters,tabName);
565: }
566: %>
567:


Stacktrace:
    at 
org.apache.jasper.servlet.JspServletWrapper.handleJspException(JspServletWrapper.java:521)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:430)
    at org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServlet.java:313)
    at org.apache.jasper.servlet.JspServlet.service(JspServlet.java:260)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at org.eclipse.jetty.servlet.ServletHolder.handle(ServletHolder.java:769)
    at 
org.eclipse.jetty.servlet.ServletHandler.doHandle(ServletHandler.java:585)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:143)
    at 
org.eclipse.jetty.security.SecurityHandler.handle(SecurityHandler.java:577)
    at 
org.eclipse.jetty.server.session.SessionHandler.doHandle(SessionHandler.java:223)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doHandle(ContextHandler.java:1125)
    at org.eclipse.jetty.servlet.ServletHandler.doScope(ServletHandler.java:515)
    at 
org.eclipse.jetty.server.session.SessionHandler.doScope(SessionHandler.java:185)
    at 
org.eclipse.jetty.server.handler.ContextHandler.doScope(ContextHandler.java:1059)
    at 
org.eclipse.jetty.server.handler.ScopedHandler.handle(ScopedHandler.java:141)
    at 
org.eclipse.jetty.server.handler.ContextHandlerCollection.handle(ContextHandlerCollection.java:215)
    at org.eclipse.jetty.server.handler.HandlerList.handle(HandlerList.java:52)
    at 
org.eclipse.jetty.server.handler.HandlerWrapper.handle(HandlerWrapper.java:97)
    at org.eclipse.jetty.server.Server.handle(Server.java:497)
    at org.eclipse.jetty.server.HttpChannel.handle(HttpChannel.java:311)
    at 
org.eclipse.jetty.server.HttpConnection.onFillable(HttpConnection.java:248)
    at 
org.eclipse.jetty.io.AbstractConnection$2.run(AbstractConnection.java:540)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool.runJob(QueuedThreadPool.java:610)
    at 
org.eclipse.jetty.util.thread.QueuedThreadPool$3.run(QueuedThreadPool.java:539)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.outputConfigurationBody(WebcrawlerConnector.java:1866)
    at 
org.apache.manifoldcf.core.interfaces.ConnectorFactory.outputThisConfigurationBody(ConnectorFactory.java:83)
    at 
org.apache.manifoldcf.crawler.interfaces.RepositoryConnectorFactory.outputConfigurationBody(RepositoryConnectorFactory.java:155)
    at 
org.apache.jsp.editconnection_jsp._jspService(editconnection_jsp.java:916)
    at org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.java:70)
    at javax.servlet.http.HttpServlet.service(HttpServlet.java:790)
    at 
org.apache.jasper.servlet.JspServletWrapper.service(JspServletWrapper.java:388)
    ... 23 more
   {code}
*Caused by:*

 
{code:java}
java.lang.NullPointerException
    at org.apache.manifoldcf.core.common.Base64.decodeString(Base64.java:164)
    at 
org.apache.manifoldcf.connectorcommon.keystore.KeystoreManager.(KeystoreManager.java:86)
    at 
org.apache.manifoldcf.connectorcommon.interfaces.KeystoreManagerFactory.make(KeystoreManagerFactory.java:47)
    at 
org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector.fillInCertificatesTab(WebcrawlerConnector.java:1701)

[jira] [Updated] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)


 [ 
https://issues.apache.org/jira/browse/CONNECTORS-1567?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Tim Steenbeke updated CONNECTORS-1567:
--
Description: 
When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.
Than when importing this connector to a clean manifoldcf it creates the 
connector with basic bandwidth.
When using the connector in a job it works properly.

The issue here is that the connector isn't created with correct bandwidth 
throttling.
 And the connector gives issues in the UI when trying to view or edit.

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}

  was:
When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.

This than also doesn't create a connector with bandwidth throttling which gives 
issues in the UI when trying to view or edit.

When using this connector in a job it will use the default bandwidth.

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}


> export of web connection bandwidth throttling
> -
>
> Key: CONNECTORS-1567
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Affects Versions: ManifoldCF 2.11, ManifoldCF 2.12
>Reporter: Tim Steenbeke
>Priority: Major
>
> When exporting the web connector using the API, it doesn't export the 
> bandwidth throttling.
> Than when importing this connector to a clean manifoldcf it creates the 
> connector with basic bandwidth.
> When using the connector in a job it works properly.
> The issue here is that the connector isn't created with correct bandwidth 
> throttling.
>  And the connector gives issues in the UI when trying to view or edit.
> e.g.:
> {code:java}
> {
>   "name": "test_web",
>   "configuration": null,
> "_PARAMETER_": [
>   {
> "_attribute_name": "Email address",
> "_value_": "tim.steenbeke@formica.digital"
>   },
>   {
> "_attribute_name": "Robots usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Meta robots tags usage",
> "_value_": "all"
>   },
>   {
> "_attribute_name": "Proxy host",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy port",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication domain",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication user name",
> "_value_": ""
>   },
>   {
> "_attribute_name": "Proxy authentication password",
> "_value_": ""
>   }
> ]
>   },
>   "description": "Website repository standard settup",
>   "throttle": null,
>   "max_connections": 10,
>   "class_name": 
> 

[jira] [Created] (CONNECTORS-1567) export of web connection bandwidth throttling

2019-01-08 Thread Tim Steenbeke (JIRA)
Tim Steenbeke created CONNECTORS-1567:
-

 Summary: export of web connection bandwidth throttling
 Key: CONNECTORS-1567
 URL: https://issues.apache.org/jira/browse/CONNECTORS-1567
 Project: ManifoldCF
  Issue Type: Bug
  Components: Web connector
Affects Versions: ManifoldCF 2.12, ManifoldCF 2.11
Reporter: Tim Steenbeke


When exporting the web connector using the API, it doesn't export the bandwidth 
throttling.

This than also doesn't create a connector with bandwidth throttling which gives 
issues in the UI when trying to view or edit.

When using this connector in a job it will use the default bandwidth.

e.g.:
{code:java}
{
  "name": "test_web",
  "configuration": null,
"_PARAMETER_": [
  {
"_attribute_name": "Email address",
"_value_": "tim.steenbeke@formica.digital"
  },
  {
"_attribute_name": "Robots usage",
"_value_": "all"
  },
  {
"_attribute_name": "Meta robots tags usage",
"_value_": "all"
  },
  {
"_attribute_name": "Proxy host",
"_value_": ""
  },
  {
"_attribute_name": "Proxy port",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication domain",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication user name",
"_value_": ""
  },
  {
"_attribute_name": "Proxy authentication password",
"_value_": ""
  }
]
  },
  "description": "Website repository standard settup",
  "throttle": null,
  "max_connections": 10,
  "class_name": 
"org.apache.manifoldcf.crawler.connectors.webcrawler.WebcrawlerConnector",
  "acl_authority": null
}{code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881
 ] 

Karl Wright commented on CONNECTORS-1562:
-

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.



> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 8:38 AM:
-

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 1.



was (Author: kwri...@metacarta.com):
Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.



> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1565) Upgrade commons-collections to 3.2.2 (CVE-2015-6420)

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1565?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736878#comment-16736878
 ] 

Karl Wright commented on CONNECTORS-1565:
-

I'm concerned that we would break something because essentially it disables 
behavior (you need to turn on the behavior if you want it now, explicitly).  
Nevertheless, if all the integration tests we have pass, I'm OK with it.  The 
worst that can happen is that somebody will open a ticket against one of our 
connectors and we'll have to roll it back.


> Upgrade commons-collections to 3.2.2 (CVE-2015-6420)
> 
>
> Key: CONNECTORS-1565
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1565
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Framework core
>Affects Versions: ManifoldCF 2.12
>Reporter: Markus Schuch
>Assignee: Markus Schuch
>Priority: Critical
> Fix For: ManifoldCF next
>
>
> We should upgrade commons-collections to 3.2.2 due to a known security issue 
> with 3.2.1
> https://commons.apache.org/proper/commons-collections/security-reports.html
> Further reading:
> [http://foxglovesecurity.com/2015/11/06/what-do-weblogic-websphere-jboss-jenkins-opennms-andyour-application-have-in-common-this-vulnerability/]
> [https://www.cvedetails.com/cve/CVE-2015-6420/]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736873#comment-16736873
 ] 

Tim Steenbeke commented on CONNECTORS-1562:
---

Ok I will create a new ticket for both issues.

Testing the connector with the UI, disabling bandwidth and max connections 
equal to 20, we were able to crawl all sites.

Now I'm still stuck with the deletion issue, were if a site is removed from the 
sitemap, will it be removed from elastic, since it is no longer accessible.
We put the sitemap as seed, the big one, and if 1 or a few URL's get removed, 
they should get removed from elastic.

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)