Hi Lalit, First --- you may want to sign up for this list -- or since your question is really more a user question, sign up for [email protected]. Otherwise I need to moderate your mail through.
Second --- the answer to your question about incremental crawling is, "yes, MCF does this out of the box". I highly suggest buying the book to understand how it works. http://www.manning.com/wright . Third --- there are two ways to run a job: "normal" and "minimal". The "normal" crawling cycle discovers documents, crawls those and any documents they reference, and then cleans up those documents that could not be reached during the crawl. "Minimal" does the same but doesn't try to clean up removed documents at the end of the crawl. For your purposes you will need both cycles; "minimal" most of the time, "normal" once in a while. Fourth --- continuous crawling is OK for some tasks but not others. It's well suited for a situation where you really only want fresh content and you want expire older content. I really recommend reading the book, though, because I can't give you all the detail you will need in a post really. Karl On Tue, Feb 18, 2014 at 4:30 AM, lalit jangra <[email protected]>wrote: > Hi, > > I am working on integrating manifoldcf or mcf with alfresco cms as > repository connector using CMIS query and using solr as output channel > where all index are stored. I am able to do it fine & can search documents > in solr index. > > Now as part of implementation, i am planing to introduce multiple > repository such as sharepoint, file systems etc. so now i have three > document repositories : alfresco, sharepoint & filesystem. I am planning to > have scheduled jobs which run through each of repositories and crawl these > at particular intervals. But i have following contentions. > > 1. Although i am scheduling jobs for frequent intervals, i want to make > sure that mcf jobs pick only those content which are either added new or > updated say i have 100 docs dring current job run but say 110 at next job > run so i only want to run jobs for new 10 docs not entire 110 docs. > 2. As there are relatively lesser mcf tutorials available, i have no means > to ensure that mcf jobs behaves this way but i assume it is intelligent > enough to behave this way but again no proof to substantiate it. > 3. I want to know more about mcf job schedule type : scan every document > once/rescan documents directly. Similarly i want to know more about job > invocation : complete/minimal. i would be sorry for being a newbie. > 4. Also i am considering about doing some custom coding to ensure that only > latest/updated docs are eligible for processing but again going thru code > only as less documentation available. > 5. Is it wise to doc custom coding in this case or mcf provides all these > features OOTB. > > I would appreciate for any response. > > Regards, > Lalit Jangra. >
