[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

Karl Wright (JIRA) Fri, 26 Apr 2019 03:48:36 -0700


    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1602?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16826879#comment-16826879
 ]


Karl Wright commented on CONNECTORS-1602:
-----------------------------------------

[~DonaldVdD] ManifoldCF keeps a queue of documents which it recrawls.  The 
crawling is only completed when all the documents are no longer in a state 
where they need to be fetched.  For a continuous job, all documents once 
fetched are immediately requeued, so this never happens.

As for session-based login, if you set up your login sequence properly, so that 
when a document is fetched that needs a fresh cookie, the login will take place 
at that point and a new cookie will be used.


> Continuous crawling doesn't recrawl everything
> ----------------------------------------------
>
>                 Key: CONNECTORS-1602
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1602
>             Project: ManifoldCF
>          Issue Type: Bug
>          Components: Web connector
>            Reporter: Donald Van den Driessche
>            Priority: Major
>
> When crawling a website in continuous crawling mode we saw that not all 
> documents are recrawled.
> The site is quite extensive. We figured out that after crawling a 
> document/page gets a recrawl timestamp in between the recrawl interval and 
> max recrawl interval.
> But if these values occur within the first crawl, Manifold starts recrawling 
> those, but seems to ignore the rest of the website. Also sometimes documents 
> get recrawled 5 times while other don't get recrawled. Apparently due to the 
> same issue.
>  
> Is it possible to shed a bit more light on the continuous crawling?
> Is it a good system to use for crawling a (extensive) website?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1602) Continuous crawling doesn't recrawl everything

Reply via email to