Hi. I just saw this thread. I believe Msft recommends a dedicated document source instance for larger corpora. I know in my SP days we were often frustrating users by making SP very slow while we were crawling. Which was mostly solved by having a dedicated source node. S
On Sat, Feb 9, 2019, 2:10 AM Karl Wright <daddy...@gmail.com wrote: > Hi Guarav, > > The number of connections you permit should depend on the resources on the > Sharepoint instance you're crawling. ManifoldCF will limit the number of > connections to that instance to the number you select. Making it larger > might help if there's a lot of resources on the SharePoint side, but in my > experience that's usually not realistic and just increasing the connection > count can even have a paradoxical effect. So that will require a back and > forth with the people running the Sharepoint instances. > > Once you can confirm that SharePoint is no longer the bottleneck (I'm > pretty certain it is right now), then the next step would be database > performance optimization. For Postgres running on Linux, you should be > pretty much pegging the CPUs on the DB machine if you've got all the other > bottlenecks eliminated. If you aren't pegging those CPUs and/or the > machine is IO bound, there has to be another bottleneck somewhere and > you'll need to find it. > > Karl > > > On Sat, Feb 9, 2019 at 1:10 AM Gaurav G <goyalgaur...@gmail.com> wrote: > >> Hi Karl, >> >> Thanks for your insights. So I'm thinking of exploring the following >> options to get the most optimal performance. Your thoughts..Is the first >> option, the one which might give the most bang for the buck? >> >> 1) Ask the Sharepoint application team to dedicate a web and app server >> specifically for crawling. Also on a related point, is there any optimal >> value for the number of concurrent repository connections? Currently we >> have it at about 40, not sure if increasing it further will improve speeds. >> 2) Splitting the crawling between two sets of manifold and postgres >> servers running on 4 different VMs but with lesser config..say 4 cores, 12 >> GB RAM. >> 3) Co-locate the crawlers in the same data center as the sharepoint >> servers. Currently they are in different DCs with dedicated MPLS >> connectivity. >> >> Thanks, >> Gaurav >> >> On Sat, Feb 9, 2019 at 3:03 AM Karl Wright <daddy...@gmail.com> wrote: >> >>> The problem is not the speed of Manifold, but rather the work it has to >>> do and the performance of SharePoint. All the speed in the world in the >>> crawler will not fix the bottleneck that is SharePoint. >>> >>> Karl >>> >>> >>> On Fri, Feb 8, 2019 at 4:06 PM Gaurav G <goyalgaur...@gmail.com> wrote: >>> >>>> Got it. >>>> Is there any way we can increase the speed of the minimal crawl. >>>> Currently we are running one VM for manifold with 8 cores and 32 gb Ram. >>>> Postgres runs on another machine with a similar configuration. Have tuned >>>> the Postgres and Manifoldcf parameters as per the recommendations. We run a >>>> full vacuum once daily. >>>> >>>> Would switching to a multi process configuration with manifoldcf >>>> running on two servers give a boost. >>>> >>>> Thanks, >>>> Gaurav >>>> >>>> On Saturday, February 9, 2019, Karl Wright <daddy...@gmail.com> wrote: >>>> >>>>> It does the minimum necessary. That means it can't do it in less. If >>>>> this is a business requirement, then you should be angry with whoever made >>>>> this requirement. >>>>> >>>>> Share point doesn't give you the ability to grab all changes or added >>>>> documents up front. You have to crawl to discover them. That is how it >>>>> is built and mcf cannot change it. >>>>> >>>>> Karl >>>>> >>>>> On Fri, Feb 8, 2019, 2:14 PM Gaurav G <goyalgaur...@gmail.com wrote: >>>>> >>>>>> Hi Karl, >>>>>> >>>>>> Thanks for the response. We tried scheduling minimal crawl for 15 >>>>>> minutes. At the end of fifteen minutes it stops with about 3000 docs in >>>>>> processing state and takes about 20-25 mins to stop. Then the question >>>>>> becomes when to schedule the next crawl. And also in those 15 minutes >>>>>> would >>>>>> it have picked all the adds and updates first or could they be part of >>>>>> the >>>>>> 3000 docs which are still in processing state which would get picked in >>>>>> the >>>>>> next run. The number of docs that actually change in a 30 min period >>>>>> won't >>>>>> be more than 200. >>>>>> >>>>>> Being able to capture adds and updates in 30 minutes is a key >>>>>> business requirement. >>>>>> >>>>>> Thanks, >>>>>> Gaurav >>>>>> >>>>>> On Friday, February 8, 2019, Karl Wright <daddy...@gmail.com> wrote: >>>>>> >>>>>>> Hi Guarav, >>>>>>> >>>>>>> The right way to do this is to schedule "minimal" crawls every 15 >>>>>>> minutes (which will process only the minimum needed to deal with adds >>>>>>> and >>>>>>> updates), and periodically perform "full" crawls (which will also >>>>>>> include >>>>>>> deletions). >>>>>>> >>>>>>> Thanks, >>>>>>> Karl >>>>>>> >>>>>>> >>>>>>> On Fri, Feb 8, 2019 at 10:11 AM Gaurav G <goyalgaur...@gmail.com> >>>>>>> wrote: >>>>>>> >>>>>>>> Hi All, >>>>>>>> >>>>>>>> We're trying to crawl a Sharepoint repo with about 30000 docs. >>>>>>>> Ideally we would like to be able to synchronize changes with the repo >>>>>>>> within 30 minutes. We are scheduling incremental crawling on this. Our >>>>>>>> observation is that a full crawl takes about 60-75 minutes. So if we >>>>>>>> schedule the incremental crawl for 30 minutes, in what order would it >>>>>>>> process the changes. Would it first bring the adds and updates and then >>>>>>>> process the rest of the docs? What kind of logic is there in the >>>>>>>> incremental crawl? >>>>>>>> We also tried the Continuous crawl to achieve this. However somehow >>>>>>>> the continuous crawl was not picking up new documents. >>>>>>>> >>>>>>>> Thanks, >>>>>>>> Gaurav >>>>>>>> >>>>>>>