[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826879#comment-16826879
 ] 

Karl Wright commented on CONNECTORS-1602:
-

[~DonaldVdD] ManifoldCF keeps a queue of documents which it recrawls.  The 
crawling is only completed when all the documents are no longer in a state 
where they need to be fetched.  For a continuous job, all documents once 
fetched are immediately requeued, so this never happens.

As for session-based login, if you set up your login sequence properly, so that 
when a document is fetched that needs a fresh cookie, the login will take place 
at that point and a new cookie will be used.


> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-26 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826755#comment-16826755
 ] 

Donald Van den Driessche commented on CONNECTORS-1602:
--

Karl

The website we're crawling also needs session based login.

What happens with cookies in a continuous crawl?

> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-26 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826727#comment-16826727
 ] 

Donald Van den Driessche commented on CONNECTORS-1602:
--

Thanks.

I know it runs continuous, but I'm wondering what happens if the recrawl 
timestamp is reached for documents. Will it first recrawl and then continue 
crawling, of contiunue crawling and then do the recrawl, or simultaneously 
crawl and recrawl? The last might slow down the crwaling speed.

> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-26 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826725#comment-16826725
 ] 

Karl Wright commented on CONNECTORS-1602:
-

Hi [~DonaldVdD], MCF keeps crude statistics on how often the doc changes.  As I 
said, it gets recrawled *eventually*, and if it does not change, the time is 
doubled until the next crawl, up to the maximum the job is configured for.

As for when the job "stops", the continuous crawl jobs do not stop.  They run 
indefinitely until manually aborted.


> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-26 Thread Donald Van den Driessche (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826707#comment-16826707
 ] 

Donald Van den Driessche commented on CONNECTORS-1602:
--

Ok, thanks. That already clears some things up.

How does Manifold know a document doesn't change that often if it isn't crawled?

If a full crawling takes about 8 hours, but you make your recrawl intervals 
smaller than that. Will it start recrawling before the job has completed a full 
run? And if so, may that interfere with the termination of the job? So that it 
might not get to a full run?

> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)


[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

2019-04-25 Thread Karl Wright (JIRA)


[ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=16826349#comment-16826349
 ] 

Karl Wright commented on CONNECTORS-1602:
-

Continuous crawling bases the next crawl time on the last time the document 
changed.  In general it doubles the crawling interval, up to the maximum, 
before retrying.  So if your document doesn't change very often, the crawler 
may wait quite some time before reviewing it.

The best way to see what it is going to do is to find the document in the 
Document Status report, and see when ManifoldCF intends to recrawl it.



> Continuous crawling doesn't recrawl everything
> --
>
> Key: CONNECTORS-1602
> URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
> Project: ManifoldCF
>  Issue Type: Bug
>  Components: Web connector
>Reporter: Donald Van den Driessche
>Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)