If you have 3 million documents then each time you run a crawl it will check each document that matches your query correct? Just want to make sure I understand. That could really take a lot of time. Wouldn't it be better to store a last crawled date and then limit the query based on that date so your only indexing things the repo server says have changed? The current method seems better suited to things like websites/wikis where you can't really query based on modified dates. -mark
From: Karl Wright <[email protected]> To: [email protected]; Mark Lugert <[email protected]> Sent: Monday, February 11, 2013 5:10 PM Subject: Re: new documents Actually, it doesn't reindex everything. It only reindexes those documents that have "changed", using the connector's idea of what that means. For SharePoint, it's the modify date, for Alfresco and CMIS I don't know but others on this list might. Also, don't confuse rechecking with reindexing. ManifoldCF *will* need to scan through the documents in many cases, but it will do a minimal amount of work for each one. Karl On Mon, Feb 11, 2013 at 3:35 PM, Mark Lugert <[email protected]> wrote: > Hi Karl, > > If I use the sharepoint, alfresco, or cmis repo connectors how can I make > it only index new documents that match my queries? > > Right now I'm seeing it reindex everything that matches my query every time > the job runs. > > I have it set to scan all documents once, but still rescans everything every > time I start the job. Is this a config issue on my part? > > thanks, > Mark
