[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 11:39 AM:
--

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 2.



was (Author: kwri...@metacarta.com):
Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 1.


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2019-01-08 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16736881#comment-16736881
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 1/8/19 8:38 AM:
-

Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.

If you remove a document from the site map, and you want MCF to pick up that 
the document is now unreachable and should be removed, you can do this by 
setting a hopcount maximum that is large but also selecting "delete unreachable 
documents".  The only thing I'd caution you about if you use this approach is 
that links BETWEEN documents will also be traversed, so if you want the sitemap 
to be a whitelist then you want hopcount max = 1.



was (Author: kwri...@metacarta.com):
Good to know that you got beyond the crawling issue.
If you run any MCF job to completion, all no-longer-present documents should be 
removed from the index.  That applies to web jobs too.  So I expect that to 
work as per design.



> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-31 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16731290#comment-16731290
 ] 

Karl Wright edited comment on CONNECTORS-1562 at 12/31/18 12:06 PM:


{code}
Error: Repeated service interruptions - failure processing document: Stream 
Closed
{code}

This is not a crash; this just means that the job aborts.  It also comes with 
numerous stack traces, in the manifoldcf log, one for each time the document 
retries.  That stack trace would be very helpful to have.  Thanks!



was (Author: kwri...@metacarta.com):
{code}
Error: Repeated service interruptions - failure processing document: Stream 
Closed
{code}

This is not a crash; this just means that the job aborts.  It also comes with 
numerous stack traces, one for each time the document retries.  That stack 
trace would be very helpful to have.  Thanks!


> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: Screenshot from 2018-12-31 11-17-29.png, 
> manifoldcf.log.cleanup, manifoldcf.log.init, manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16724016#comment-16724016
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 12:47 PM:
--

There is no regex, there is no possibility to make a regex for this. That's the 
issue with creating the exclude/blacklist. there is already a regex being used 
for images, documents and other files that don't have to be crawled.

'started acting strange' stopped working and crashed, because of the amount of 
URL's it stopped indexing, no messages or error's were given, just the job 
stopped working.

This is not the question. answer my question please.
 is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 


was (Author: steenti):
There is no regex, there is no possibility to make a regex for this. That's the 
issue with creating the exclude/blacklist.


'started acting strange' stopped working and crashed.

This is not the question. answer my question please.
is this the way we have to run the job:
{panel}
_*Tim Steenbeke added a comment*_

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...{panel}
 

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-18 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723981#comment-16723981
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 11:52 AM:
--

[~kwri...@metacarta.com] - So then with the seed map URL:

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...

Last time we tried this manifold started acting strange because of the amount 
of url's/links located in the sitemap URL
 (sitemap url: 
[https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true])
(blacklist url's: 
[https://www.uantwerpen.be/admin/system/sitemap/sitemap_revokes.aspx?lang=en=true])


was (Author: steenti):
[~kwri...@metacarta.com] - So then with the seed map URL:

we create a job with 1 seed, which is the full seed-map (+29000 URL's), ES 
output, web input and Hop-count 1 for links and 0 for redirect:
 # run job
 # +-29000 documents get pushed to ES
 # sitemap get's updated (e.g.: 29000 URL's become 28990 URL's)
 # wait till scheduled time
 # run job
 # documents get add/deleted (e.g.: 10 documents deleted)
 # wait till scheduled time
 # ...

Last time we tried this manifold started acting strange because of the amount 
of url's/links located in the sitemap URL
(sitemap url: 
[https://www.uantwerpen.be/admin/system/sitemap/sitemap.aspx?lang=en=true])

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Comment Edited] (CONNECTORS-1562) Documents unreachable due to hopcount are not considered unreachable on cleanup pass

2018-12-17 Thread Tim Steenbeke (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1562?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16723782#comment-16723782
 ] 

Tim Steenbeke edited comment on CONNECTORS-1562 at 12/18/18 7:48 AM:
-

[~kwri...@metacarta.com]
If we update to manifold 2.12 can we than use the seedmap as originaly intended 
by us ?

so we create a job with X seeds, ES output, web input and HopCount 0 for links 
and redirect:
 # Put X seeds in seedmap
 # run job
 # X documents get pushed to ES
 # update job to have X minus 20 seeds
 wait till scheduled time

 
 # run job
 # 20 documents get deleted from ES
 # X minus 20 documents get updated
 # wait till scheduled time
 # ...

Will it work like this ?


was (Author: steenti):
If we update to manifold 2.12 can we than use the seedmap as originaly intended 
by us ?

so we create a job with X seeds, ES output, web input and HopCount 0 for links 
and redirect: 
 # Put X seeds in seedmap
 # run job
 # X documents get pushed to ES
 # update job to have X minus 20 seeds
wait till scheduled time

 
 # run job
 # 20 documents get deleted from ES
 # X minus 20 documents get updated
 # wait till scheduled time
 # ...

Will it work like this ?

> Documents unreachable due to hopcount are not considered unreachable on 
> cleanup pass
> 
>
> Key: CONNECTORS-1562
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1562
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Elastic Search connector, Web connector
>Affects Versions: ManifoldCF 2.11
> Environment: Manifoldcf 2.11
> Elasticsearch 6.3.2
> Web inputconnector
> elastic outputconnecotr
> Job crawls website input and outputs content to elastic
>Reporter: Tim Steenbeke
>Assignee: Karl Wright
>Priority: Critical
>  Labels: starter
> Fix For: ManifoldCF 2.12
>
> Attachments: manifoldcf.log.cleanup, manifoldcf.log.init, 
> manifoldcf.log.reduced
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> My documents aren't removed from ElasticSearch index after rerunning the 
> changed seeds
> I update my job to change the seedmap and rerun it or use the schedualer to 
> keep it runneng even after updating it.
> After the rerun the unreachable documents don't get deleted.
> It only adds doucments when they can be reached.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)