If you have 3 million documents then each time you run a crawl it will check 
each document that matches your query correct?
 
Just want to make sure I understand.  That could really take a lot of time.  
 
Wouldn't it be better to store a last crawled date and then limit the query 
based on that date so your only indexing things the repo server says have 
changed?  The current method seems better suited to things like websites/wikis 
where you can't really query based on modified dates.
 
-mark

From: Karl Wright <[email protected]>
To: [email protected]; Mark Lugert <[email protected]> 
Sent: Monday, February 11, 2013 5:10 PM
Subject: Re: new documents

Actually, it doesn't reindex everything.  It only reindexes those
documents that have "changed", using the connector's idea of what that
means.  For SharePoint, it's the modify date, for Alfresco and CMIS I
don't know but others on this list might.

Also, don't confuse rechecking with reindexing.  ManifoldCF *will*
need to scan through the documents in many cases, but it will do a
minimal amount of work for each one.

Karl

On Mon, Feb 11, 2013 at 3:35 PM, Mark Lugert <[email protected]> wrote:
> Hi Karl,
>
> If  I use the sharepoint, alfresco, or cmis repo connectors how can I make
> it only index new documents that match my queries?
>
> Right now I'm seeing it reindex everything that matches my query every time
> the job runs.
>
> I have it set to scan all documents once, but still rescans everything every
> time I start the job.  Is this a config issue on my part?
>
> thanks,
> Mark

Reply via email to