[jira] [Commented] (CONNECTORS-1497) Re-index seeded modified documents when the re-crawl interval is infinity and connector model is MODEL_ADD_CHANGE

Karl Wright (JIRA) Tue, 27 Feb 2018 00:16:33 -0800

    [ 
https://issues.apache.org/jira/browse/CONNECTORS-1497?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16378201#comment-16378201
 ]


Karl Wright commented on CONNECTORS-1497:
-----------------------------------------

Seeding is a well-defined connector contract that, at this point, has nothing 
to do with document scheduling, even in a continuous job.  The contract 
identifies specific documents based on the repository's capabilities, and those 
documents are not chosen based on what you want processed first, but rather on 
the requirements of the connector model.  Conflating the two I think may 
obligate connectors to manage their own document scheduling and pick the 
documents they want processed first.  That's a significant contract change and 
quite I'm concerned about that. 

The reason you want to do this at all is because you don't actually want 
document recrawls to take place on any schedule at all -- you've set the 
recrawl time to infinity.  That basically defeats the continuous crawl model 
entirely and presumes that documents once crawled are never changed or deleted 
unless you reseed them.  So the real reason you want to do this is to provide a 
connector complete schedule control over what documents are processed when.  
Presumably, your connector knows about deletions too, then?  Is there any 
reason it shouldn't be written as MODEL_ADD_CHANGE_DELETE?  Continuous 
MODEL_ADD_CHANGE_DELETE jobs are a new thing so if this is your use case we 
should think it through carefully.


> Re-index seeded modified documents when the re-crawl interval is infinity and 
>   connector model is MODEL_ADD_CHANGE
> -------------------------------------------------------------------------------------------------------------------
>
>                 Key: CONNECTORS-1497
>                 URL: https://issues.apache.org/jira/browse/CONNECTORS-1497
>             Project: ManifoldCF
>          Issue Type: Improvement
>          Components: Framework agents process
>    Affects Versions: ManifoldCF 2.9.1
>            Reporter: Ahmed Mahfouz
>            Assignee: Karl Wright
>            Priority: Major
>         Attachments: CONNECTORS-1497.patch, CONNECTORS-1497.patch2, 
> CONNECTORS-1497.patch3
>
>
> Trying to avoid a full scan of all documents for a better efficiency with a 
> large number of documents. I tried so many different setting for the Jobs but 
> I couldn't accomplish that. Especially when the repository connector model is 
> MODEL_ADD_CHANGE I was expecting the modified documents seeded should be 
> re-indexed immediately similar to the new seeds but I found out it uses the 
> re-crawl time as the scheduled time and it waits for the full scan to get 
> re-indexed. I avoided full scan by setting the re-crawl interval to infinity 
> but still, my modified documents seeds were not getting indexed. After 
> digging into the code for quite good time. I did some modification to the 
> JobManager and it worked for me. I would like to share the change with you 
> for review so I opened this ticket.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

[jira] [Commented] (CONNECTORS-1497) Re-index seeded modified documents when the re-crawl interval is infinity and connector model is MODEL_ADD_CHANGE

Reply via email to